Re: [HACKERS] Issues with Quorum Commit

2010-10-20 Thread Bruce Momjian
Tom Lane wrote:
 Greg Smith g...@2ndquadrant.com writes:
  I don't see this as needing any implementation any more complicated than 
  the usual way such timeouts are handled.  Note how long you've been 
  trying to reach the standby.  Default to -1 for forever.  And if you hit 
  the timeout, mark the standby as degraded and force them to do a proper 
  resync when they disconnect.  Once that's done, then they can re-enter 
  sync rep mode again, via the same process a new node would have done so.
 
 Well, actually, that's *considerably* more complicated than just a
 timeout.  How are you going to mark the standby as degraded?  The
 standby can't keep that information, because it's not even connected
 when the master makes the decision.  ISTM that this requires
 
 1. a unique identifier for each standby (not just role names that
 multiple standbys might share);
 
 2. state on the master associated with each possible standby -- not just
 the ones currently connected.
 
 Both of those are perhaps possible, but the sense I have of the
 discussion is that people want to avoid them.
 
 Actually, #2 seems rather difficult even if you want it.  Presumably
 you'd like to keep that state in reliable storage, so it survives master
 crashes.  But how you gonna commit a change to that state, if you just
 lost every standby (suppose master's ethernet cable got unplugged)?
 Looks to me like it has to be reliable non-replicated storage.  Leaving
 aside the question of how reliable it can really be if not replicated,
 it's still the case that we have noplace to put such information given
 the WAL-is-across-the-whole-cluster design.

I assumed we would have a parameter called sync_rep_failure that would
take a command and the command would be called when communication to the
slave was lost.  If you restart, it tries again and might call the
function again.

-- 
  Bruce Momjian  br...@momjian.ushttp://momjian.us
  EnterpriseDB http://enterprisedb.com

  + It's impossible for everything to be true. +

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Issues with Quorum Commit

2010-10-14 Thread Greg Stark
On Tue, Oct 12, 2010 at 11:50 PM, Robert Haas robertmh...@gmail.com wrote:
 There's another problem here we should think about, too.  Suppose you
 have a master and two standbys.  The master dies.  You promote one of
 the standbys, which turns out to be behind the other.  You then
 repoint the other standby at the one you promoted.  Congratulations,
 your database is now very possible corrupt, and you may very well get
 no warning of that fact.  It seems to me that we would be well-advised
 to install some kind of bullet-proof safeguard against this kind of
 problem, so that you will KNOW that the standby needs to be re-synced.
  I mention this because I have a vague feeling that timelines are
 supposed to prevent you from getting different WAL histories confused
 with each other, but they don't actually cover all the cases that can
 happen.


Why don't the usual protections kick in here? The new record read from
the location the xlog reader is expecting to find it has to have a
valid CRC and a correct back pointer to the previous record. If the
new wal sender is behind the old one then the new record it's sent
won't match up at all.

-- 
greg

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Issues with Quorum Commit

2010-10-14 Thread Robert Haas
On Wed, Oct 13, 2010 at 5:22 AM, Fujii Masao masao.fu...@gmail.com wrote:
 On Wed, Oct 13, 2010 at 3:50 PM, Robert Haas robertmh...@gmail.com wrote:
 There's another problem here we should think about, too.  Suppose you
 have a master and two standbys.  The master dies.  You promote one of
 the standbys, which turns out to be behind the other.  You then
 repoint the other standby at the one you promoted.  Congratulations,
 your database is now very possible corrupt, and you may very well get
 no warning of that fact.  It seems to me that we would be well-advised
 to install some kind of bullet-proof safeguard against this kind of
 problem, so that you will KNOW that the standby needs to be re-synced.

 Yep. This is why I said it's not easy to implement that.

 To start the standby without taking a base backup from new master after
 failover, the user basically has to promote the standby which is ahead
 of the other standbys (e.g., by comparing pg_last_xlog_replay_location
 on each standby).

 As the safeguard, we seem to need to compare the location at the switch
 of the timeline on the master with the last replay location on the standby.
 If the latter location is ahead AND the timeline ID of the standby is not
 the same as that of the master, we should emit warning and terminate the
 replication connection.

That doesn't seem very bullet-proof.  You can accidentally corrupt a
standby even when only one time-line is involved. AFAIK, stopping a
standby, removing recovery.conf, and starting it up again does not
change time lines.  You can even shut down the standby, bring it up as
a master, generate a little WAL, shut it back down, and bring it back
up as a standby pointing to the same master.  It would be nice to
embed in each checkpoint record an identifier that changes randomly on
each transition to normal running, so that if you do something like
this we can notice and complain loudly.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Issues with Quorum Commit

2010-10-14 Thread Fujii Masao
On Thu, Oct 14, 2010 at 11:18 AM, Greg Stark gsst...@mit.edu wrote:
 Why don't the usual protections kick in here? The new record read from
 the location the xlog reader is expecting to find it has to have a
 valid CRC and a correct back pointer to the previous record.

Yep. In most cases, those protections seem to be able to make the standby
notice the inconsistency of WAL and then give up continuing replication.
But not in all the cases. We can regard those protections as bullet-proof
safeguard?

Regards,

-- 
Fujii Masao
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Issues with Quorum Commit

2010-10-14 Thread Robert Haas
On Wed, Oct 13, 2010 at 10:18 PM, Greg Stark gsst...@mit.edu wrote:
 On Tue, Oct 12, 2010 at 11:50 PM, Robert Haas robertmh...@gmail.com wrote:
 There's another problem here we should think about, too.  Suppose you
 have a master and two standbys.  The master dies.  You promote one of
 the standbys, which turns out to be behind the other.  You then
 repoint the other standby at the one you promoted.  Congratulations,
 your database is now very possible corrupt, and you may very well get
 no warning of that fact.  It seems to me that we would be well-advised
 to install some kind of bullet-proof safeguard against this kind of
 problem, so that you will KNOW that the standby needs to be re-synced.
  I mention this because I have a vague feeling that timelines are
 supposed to prevent you from getting different WAL histories confused
 with each other, but they don't actually cover all the cases that can
 happen.


 Why don't the usual protections kick in here? The new record read from
 the location the xlog reader is expecting to find it has to have a
 valid CRC and a correct back pointer to the previous record. If the
 new wal sender is behind the old one then the new record it's sent
 won't match up at all.

There's some kind of logic that rewinds to the beginning of the WAL
segment and tries to replay from there.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Issues with Quorum Commit

2010-10-13 Thread Heikki Linnakangas

On 13.10.2010 08:21, Fujii Masao wrote:

On Sat, Oct 9, 2010 at 4:31 AM, Heikki Linnakangas
heikki.linnakan...@enterprisedb.com  wrote:

It shouldn't be too hard to fix. Walsender needs to be able to read WAL from
preceding timelines, like recovery does, and walreceiver needs to write the
incoming WAL to the right file.


And walsender seems to need to transfer the current timeline history to
the standby. Otherwise, the standby cannot recover the WAL file with new
timeline. And the standby might need to create the timeline history file
in order to recover the WAL file with new timeline even after it's restarted.


Yes, true, you need that too.

It might be good to divide this work into two phases, teaching archive 
recovery to notice new timelines appearing in the archive first, and 
doing the walsender/walreceiver changes after that.


--
  Heikki Linnakangas
  EnterpriseDB   http://www.enterprisedb.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Issues with Quorum Commit

2010-10-13 Thread Robert Haas
On Wed, Oct 13, 2010 at 2:43 AM, Heikki Linnakangas
heikki.linnakan...@enterprisedb.com wrote:
 On 13.10.2010 08:21, Fujii Masao wrote:

 On Sat, Oct 9, 2010 at 4:31 AM, Heikki Linnakangas
 heikki.linnakan...@enterprisedb.com  wrote:

 It shouldn't be too hard to fix. Walsender needs to be able to read WAL
 from
 preceding timelines, like recovery does, and walreceiver needs to write
 the
 incoming WAL to the right file.

 And walsender seems to need to transfer the current timeline history to
 the standby. Otherwise, the standby cannot recover the WAL file with new
 timeline. And the standby might need to create the timeline history file
 in order to recover the WAL file with new timeline even after it's
 restarted.

 Yes, true, you need that too.

 It might be good to divide this work into two phases, teaching archive
 recovery to notice new timelines appearing in the archive first, and doing
 the walsender/walreceiver changes after that.

There's another problem here we should think about, too.  Suppose you
have a master and two standbys.  The master dies.  You promote one of
the standbys, which turns out to be behind the other.  You then
repoint the other standby at the one you promoted.  Congratulations,
your database is now very possible corrupt, and you may very well get
no warning of that fact.  It seems to me that we would be well-advised
to install some kind of bullet-proof safeguard against this kind of
problem, so that you will KNOW that the standby needs to be re-synced.
 I mention this because I have a vague feeling that timelines are
supposed to prevent you from getting different WAL histories confused
with each other, but they don't actually cover all the cases that can
happen.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Issues with Quorum Commit

2010-10-13 Thread Markus Wanner
On 10/13/2010 06:43 AM, Fujii Masao wrote:
 Unfortunately even enough standbys don't increase write-availability
 unless you choose wait-forever. Because, after promoting one of
 standbys to new master, you must keep all the transactions waiting
 until at least one standby has connected to and caught up with new
 master. Currently this wait time is not short.

Why is that? Don't the standbies just have to switch from one walsender
to another? If there's any significant delay in switching, this either
hurts availability or robustness, yes.

 Hmm.. that increases the number of procedures which the users must
 perform at the failover.

I only consider fully automated failover. However, you seem to be
worried about the initial setup of sync rep.

 At least, the users seem to have to wait
 until the standby has caught up with new master, increase quorum_commit
 and then reload the configuration file.

For switching from a single node to a sync replication setup with one or
more standbies, that seems reasonable. There are way more components you
need to setup or adjust in such a case (network, load balancer, alerting
system and maybe even the application itself).

There's really no other option, if you want the kind of robustness
guarantee that sync rep with wait forever provides. OTOH, if you just
replicate to whatever standby is there and don't care much if it isn't,
the admin doesn't need to worry much about quorum_commit - it doesn't
have much of an effect anyway.

Regards

Markus Wanner

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Issues with Quorum Commit

2010-10-13 Thread Fujii Masao
On Wed, Oct 13, 2010 at 3:43 PM, Heikki Linnakangas
heikki.linnakan...@enterprisedb.com wrote:
 On 13.10.2010 08:21, Fujii Masao wrote:

 On Sat, Oct 9, 2010 at 4:31 AM, Heikki Linnakangas
 heikki.linnakan...@enterprisedb.com  wrote:

 It shouldn't be too hard to fix. Walsender needs to be able to read WAL
 from
 preceding timelines, like recovery does, and walreceiver needs to write
 the
 incoming WAL to the right file.

 And walsender seems to need to transfer the current timeline history to
 the standby. Otherwise, the standby cannot recover the WAL file with new
 timeline. And the standby might need to create the timeline history file
 in order to recover the WAL file with new timeline even after it's
 restarted.

 Yes, true, you need that too.

 It might be good to divide this work into two phases, teaching archive
 recovery to notice new timelines appearing in the archive first, and doing
 the walsender/walreceiver changes after that.

OK. In detail,

1. After failover, when the standby connects to new master, walsender transfers
   the current timeline history in the handshake processing.

2. If the timeline history in the master is inconsistent with that in
the standby,
   walreceiver terminates the replication connection.

3. Walreceiver creates the timeline history file.

4. Walreceiver signals the change of timeline history to startup process and
   makes it read the timeline history file. After this, startup process tries
   to recover the WAL files with even new timeline ID.

5. After the handshake, walsender sends the WAL from preceding timelines,
   like recovery does, and walreceiver writes the incoming WAL to the right
   file.

Am I missing something?

Regards,

-- 
Fujii Masao
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Issues with Quorum Commit

2010-10-13 Thread Fujii Masao
On Wed, Oct 13, 2010 at 3:50 PM, Robert Haas robertmh...@gmail.com wrote:
 There's another problem here we should think about, too.  Suppose you
 have a master and two standbys.  The master dies.  You promote one of
 the standbys, which turns out to be behind the other.  You then
 repoint the other standby at the one you promoted.  Congratulations,
 your database is now very possible corrupt, and you may very well get
 no warning of that fact.  It seems to me that we would be well-advised
 to install some kind of bullet-proof safeguard against this kind of
 problem, so that you will KNOW that the standby needs to be re-synced.

Yep. This is why I said it's not easy to implement that.

To start the standby without taking a base backup from new master after
failover, the user basically has to promote the standby which is ahead
of the other standbys (e.g., by comparing pg_last_xlog_replay_location
on each standby).

As the safeguard, we seem to need to compare the location at the switch
of the timeline on the master with the last replay location on the standby.
If the latter location is ahead AND the timeline ID of the standby is not
the same as that of the master, we should emit warning and terminate the
replication connection.

Regards,

-- 
Fujii Masao
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Issues with Quorum Commit

2010-10-12 Thread Fujii Masao
On Sat, Oct 9, 2010 at 12:12 AM, Markus Wanner mar...@bluegap.ch wrote:
 On 10/08/2010 04:48 PM, Fujii Masao wrote:
 I believe many systems require write-availability.

 Sure. Make sure you have enough standbies to fail over to.

Unfortunately even enough standbys don't increase write-availability
unless you choose wait-forever. Because, after promoting one of
standbys to new master, you must keep all the transactions waiting
until at least one standby has connected to and caught up with new
master. Currently this wait time is not short.

 (I think there are even more situations where read-availability is much
 more important, though).

Even so, we should not ignore the write-availability aspect.

 Start with 0 (i.e. replication off), then add standbies, then increase
 quorum_commit to your new requirements.

 No. This only makes the procedure of failover more complex.

 Huh? This doesn't affect fail-over at all. Quite the opposite, the
 guarantees and requirements remain the same even after a fail-over.

Hmm.. that increases the number of procedures which the users must
perform at the failover. At least, the users seem to have to wait
until the standby has caught up with new master, increase quorum_commit
and then reload the configuration file.

 What is a full-cluster crash?

 The event that all of your cluster nodes are down (most probably due to
 power failure, but fires or other catastrophic events can be other
 causes). Chances for that to happen can certainly be reduced by
 distributing to distant locations, but that equally certainly increases
 latency, which isn't always an option.

Yep.

 Why does it cause a split-brain?

 First master node A fails, a standby B takes over, but then fails as
 well. Let node C take over. Then the power aggregates catches fire, the
 infamous full-cluster crash (where lights out management gets a
 completely new meaning ;-) ).

 Split brain would be the situation that arises if all three nodes (A, B
 and C) start up again and think they have been the former master, so
 they can now continue to apply new transactions. Their data diverges,
 leading to what could be seen as a split-brain from the outside.

 Obviously, you must disallow A and B to take the role of the master
 after recovery.

Yep. Something like STONITH would be required.

Regards,

-- 
Fujii Masao
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Issues with Quorum Commit

2010-10-12 Thread Fujii Masao
On Sat, Oct 9, 2010 at 1:41 AM, Josh Berkus j...@agliodbs.com wrote:

 And, I'd like to know whether the master waits forever because of the
 standby failure in other solutions such as Oracle DataGuard, MySQL
 semi-synchronous replication.

 MySQL used to be fond of simiply failing sliently.  Not sure what 5.4 does,
 or Oracle.  In any case MySQL's replication has always really been async
 (except Cluster, which is a very different database), so it's not really a
 comparison.

IIRC, MySQL *semi-synchronous* replication is not async, so it can be
comparison. Of course, though MySQL default replication is async.

Regards,

-- 
Fujii Masao
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Issues with Quorum Commit

2010-10-12 Thread Fujii Masao
On Sat, Oct 9, 2010 at 4:31 AM, Heikki Linnakangas
heikki.linnakan...@enterprisedb.com wrote:
 Yes. But if there is no unsent WAL when the master goes down,
 we can start new standby without new backup by copying the
 timeline history file from new master to new standby and
 setting recovery_target_timeline to 'latest'.

 .. and restart the standby.

Yes.

 It's a pretty severe shortcoming at the moment. For starters, it means that
 you need a shared archive, even if you set wal_keep_segments to a high
 number. Secondly, it's a lot of scripting to get it working, I don't like
 the thought of testing failovers in synchronous replication if I have to do
 all that. Frankly, this seems more important to me than synchronous
 replication.

There seems to be difference in outlook between us. I prefer sync rep.
But I'm OK to address that first if it's not hard.

 It shouldn't be too hard to fix. Walsender needs to be able to read WAL from
 preceding timelines, like recovery does, and walreceiver needs to write the
 incoming WAL to the right file.

And walsender seems to need to transfer the current timeline history to
the standby. Otherwise, the standby cannot recover the WAL file with new
timeline. And the standby might need to create the timeline history file
in order to recover the WAL file with new timeline even after it's restarted.

Regards,

-- 
Fujii Masao
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Issues with Quorum Commit

2010-10-11 Thread Markus Wanner
Greg,

to me it looks like we have very similar goals, but start from different
preconditions. I absolutely agree with you given the preconditions you
named.

On 10/08/2010 10:04 PM, Greg Smith wrote:
 How is that a new problem?  It's already possible to end up with a
 standby pair that has suffered through some bizarre failure chain such
 that it's not necessarily obvious which of the two systems has the most
 recent set of data on it.  And that's not this project's problem to
 solve.

Thanks for pointing that out. I think that might not have been clear to
me. This limitation of scope certainly make sense for the Postgres
project in general.

Regards

Markus Wanner

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Issues with Quorum Commit

2010-10-08 Thread Dimitri Fontaine
Greg Smith g...@2ndquadrant.com writes:
[…]
 I don't see this as needing any implementation any more complicated than the
 usual way such timeouts are handled.  Note how long you've been trying to
 reach the standby.  Default to -1 for forever.  And if you hit the timeout,
 mark the standby as degraded and force them to do a proper resync when they
 disconnect.  Once that's done, then they can re-enter sync rep mode again,
 via the same process a new node would have done so.

Thank you for this post, which is so much better than anything I could
achieve.

Just wanted to add that it should be possible in lots of cases to have a
standby rejoin the party without getting as far back as taking a new
base backup. Depends on wal_keep_segments and standby's degraded state,
among other parameters (archives, etc).

Regards,
-- 
Dimitri Fontaine
http://2ndQuadrant.fr PostgreSQL : Expertise, Formation et Support

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Issues with Quorum Commit

2010-10-08 Thread Markus Wanner
On 10/08/2010 12:30 AM, Simon Riggs wrote:
 I do, but its not a parameter. The k = 1 behaviour is hardcoded and
 considerably simplifies the design. Moving to k  1 is additional work,
 slows things down and seems likely to be fragile.

Perfect! So I'm all in favor of committing that, but leaving away the
timeout thing, which I think is just adding unneeded complexity and
fragility.

Regards

Markus Wanner



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Issues with Quorum Commit

2010-10-08 Thread Markus Wanner
Simon,

On 10/08/2010 12:25 AM, Simon Riggs wrote:
 Asking for k  1 does *not* mean those servers are time synchronised.

Yes, it's technically impossible to create a fully synchronized cluster
(on the basis of shared-nothing nodes we are aiming for, that is). There
always is some kind of lag on either side.

Maybe the use case for a no-lag cluster doesn't exist, because it's
technically not feasible.

 In a bad case, those 3 acknowledgements might happen say 5 seconds apart
 on the worst and best of the 3 servers. So the first standby to receive
 the data could have applied the changes ~4.8 seconds prior to the 3rd
 standby. There is still a chance of reading stale data on one standby,
 but reading fresh data on another server. In most cases the time window
 is small, but still exists.

Well, the transaction isn't committed on the master, so one could argue
it shouldn't matter. The guarantee just needs to be one way: as soon as
confirmed committed to the client, all k standbies need to have it
committed, too. (At least for the apply replication level).

 So standbys are eventually consistent whether or not the master relies
 on them to provide an acknowledgement. The only place where you can
 guarantee non-stale data is on the master.

That's formulated a bit too strong. With apply replication level, you
should be able to rely on the guarantee that a committed transaction is
visible on at least k standbies. Maybe in advance of the commit on the
master, but I wouldn't call that stale data.

Given the current proposals, the master is the one that's lagging the
most, compared to the k standbies.

 High values of k reduce the possibility of data loss, whereas expected
 cluster availability is reduced as N - k gets smaller.

Exactly. One addendum: a timeout increases availability at the cost of
increased danger of data loss and higher complexity. Don't use it, just
increase (N - k) instead.

Regards

Markus Wanner

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Issues with Quorum Commit

2010-10-08 Thread Heikki Linnakangas

On 07.10.2010 21:38, Markus Wanner wrote:

On 10/07/2010 03:19 PM, Dimitri Fontaine wrote:

I think you're all into durability, and that's good. The extra cost is
service downtime


It's just *reduced* availability. That doesn't necessarily mean
downtime, if you combine cleverly with async replication.


if that's not what you're after: there's also
availability and load balancing read queries on a system with no lag (no
stale data servicing) when all is working right.


All I'm saying is that those use cases are much better served with async
replication. Maybe together with something that warns and takes action
in case the standby's lag gets too big.

Or what kind of customers do you think really need a no-lag solution for
read-only queries? In the LAN case, the lag of async rep is negligible
and in the WAN case the latencies of sync rep are prohibitive.


There is a very good use case for that particular set up, actually. If 
your hot standby is guaranteed to be up-to-date with any transaction 
that has been committed in the master, you can use the standby 
interchangeably with the master for read-only queries. Very useful for 
load balancing. Imagine a web application that's mostly read-only, but a 
user can modify his own personal details like name and address, for 
example. Imagine that the user changes his street address and clicks 
'save', causing an UPDATE, and the next query fetches that information 
again to display to the user. If you use load balancing, the query can 
be routed to the hot standby server, and if it lags even 1-2 seconds 
behind it's quite possible that it will still return the old address. 
The user will go WTF, I just changed that!.


That's the load balancing use case, which is quite different from the 
zero data loss on server failure use case that most people here seem 
to be interested in.


--
  Heikki Linnakangas
  EnterpriseDB   http://www.enterprisedb.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Issues with Quorum Commit

2010-10-08 Thread Markus Wanner
On 10/08/2010 04:01 AM, Fujii Masao wrote:
 Really? I don't think that ko-count=0 means wait-forever.

Telling from the documentation, I'd also say it doesn't wait forever by
default. However, please note that there are different parameters for
the initial wait for connection during boot up (wfc-timeout and
degr-wfc-timeout). So you might to test what happens on a node failure,
not just absence of a standby.

Regards

Markus Wanner

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Issues with Quorum Commit

2010-10-08 Thread Heikki Linnakangas

On 08.10.2010 06:41, Fujii Masao wrote:

On Thu, Oct 7, 2010 at 3:01 AM, Markus Wannermar...@bluegap.ch  wrote:

Of course, it doesn't make sense to wait-forever on *every* standby that
ever gets added. Quorum commit is required, yes (and that's what this
thread is about, IIRC). But with quorum commit, adding a standby only
improves availability, but certainly doesn't block the master in any
way.


But, even with quorum commit, if you choose wait-forever option,
failover would decrease availability. Right after the failover,
no standby has connected to new master, so if quorum= 1, all
the transactions must wait for a while.


Sure, the new master can't proceed with commits until enough standbys 
have connected to it.



Basically we need to take a base backup from new master to start
the standbys and make them connect to new master.


Do we really need that? I don't think that's acceptable, we'll need to 
fix that if that's the case.


I think you're right, streaming replication doesn't work across timeline 
changes. We left that out of 9.0, to keep things simple, but it seems 
that we really should fix that for 9.1.


You can cross timelines with the archive, though. But IIRC there was 
some issue with that too, you needed to restart the standbys because the 
standby scans what timelines exist at the beginning of recovery, and 
won't notice new timelines that appear after that?


We need to address that, apart from any of the other things discussed 
wrt. synchronous replication. It will benefit asynchronous replication 
too. IMHO *that* is the next thing we should do, the next patch we commit.


--
  Heikki Linnakangas
  EnterpriseDB   http://www.enterprisedb.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Issues with Quorum Commit

2010-10-08 Thread Simon Riggs
On Fri, 2010-10-08 at 09:52 +0200, Markus Wanner wrote:

 One addendum: a timeout increases availability at the cost of
 increased danger of data loss and higher complexity. Don't use it,
 just increase (N - k) instead. 

Completely agree.

-- 
 Simon Riggs   www.2ndQuadrant.com
 PostgreSQL Development, 24x7 Support, Training and Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers



Re: [HACKERS] Issues with Quorum Commit

2010-10-08 Thread Markus Wanner
On 10/08/2010 05:41 AM, Fujii Masao wrote:
 But, even with quorum commit, if you choose wait-forever option,
 failover would decrease availability. Right after the failover,
 no standby has connected to new master, so if quorum = 1, all
 the transactions must wait for a while.

That's a point, yes. But again, this is just write-availability, you can
happily read from all active standbies. And connection time is certainly
negligible compared to any kind of timeout (which certainly needs to be
way bigger than a couple of network round-trips).

 Basically we need to take a base backup from new master to start
 the standbys and make them connect to new master. This might take
 a long time. Since transaction commits cannot advance for that time,
 availability would goes down.

Just don't increase your quorum_commit to unreasonable values which your
hardware cannot possible satisfy. It doesn't make sense to set a
quorum_commit of 1 or even bigger, if you don't already have a standby
attached.

Start with 0 (i.e. replication off), then add standbies, then increase
quorum_commit to your new requirements.

 Or you think that wait-forever option is applied only when the
 standby goes down?

That wouldn't work in case of a full-cluster crash, where the
wait-forever option is required again. Otherwise you risk a split-brain
situation.

Regards

Markus Wanner

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Issues with Quorum Commit

2010-10-08 Thread Heikki Linnakangas

On 08.10.2010 01:25, Simon Riggs wrote:

On Thu, 2010-10-07 at 13:44 -0400, Aidan Van Dyk wrote:


To get non-stale responses, you can only query those k=3 servers.
But you've shot your self in the foot because you don't know which
3/10 those will be.  The other 7 *are* stale (by definition).  They
talk about picking the caught up slave when the master fails, but
you actually need to do that for *every query*.


There is a big confusion around that point and I need to point out that
statement isn't accurate. It's taken me a long while to understand this.

Asking for k  1 does *not* mean those servers are time synchronised.
All it means is that the master will stop waiting after 3
acknowledgements. There is no connection between the master receiving
acknowledgements and the standby applying changes received from master;
the standbys are all independent of one another.

In a bad case, those 3 acknowledgements might happen say 5 seconds apart
on the worst and best of the 3 servers. So the first standby to receive
the data could have applied the changes ~4.8 seconds prior to the 3rd
standby. There is still a chance of reading stale data on one standby,
but reading fresh data on another server. In most cases the time window
is small, but still exists.

The other 7 are stale with respect to the first 3. But then so are the
last 9 compared with the first one. The value of k has nothing
whatsoever to do with the time difference between the master and the
last standby to receive/apply the changes. The gap between first and
last standby (i.e. N, not k) is the time window during which a query
might/might not see a particular committed result.

So standbys are eventually consistent whether or not the master relies
on them to provide an acknowledgement. The only place where you can
guarantee non-stale data is on the master.


Yes, that's a good point. Synchronous replication for load-balancing 
purposes guarantees that when *you* perform a commit, after it finishes 
it will be visible in all standbys. But if you run the same query across 
different standbys, you're not guaranteed get same results. If you just 
pick a random server for every query, you might even see time moving 
backwards. Affinity is definitely a good idea for the load-balancing 
scenario, but even then the anomaly is possible if you get re-routed to 
a different server because the one you were bound to dies.


--
  Heikki Linnakangas
  EnterpriseDB   http://www.enterprisedb.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Issues with Quorum Commit

2010-10-08 Thread Simon Riggs
On Fri, 2010-10-08 at 10:56 +0300, Heikki Linnakangas wrote:
 
  Or what kind of customers do you think really need a no-lag solution for
  read-only queries? In the LAN case, the lag of async rep is negligible
  and in the WAN case the latencies of sync rep are prohibitive.
 
 There is a very good use case for that particular set up, actually. If 
 your hot standby is guaranteed to be up-to-date with any transaction 
 that has been committed in the master, you can use the standby 
 interchangeably with the master for read-only queries. 

This is an important point. It is desirable, but there is no such thing.
We must not take any project decisions based upon that false premise.

Hot Standby is never guaranteed to be up-to-date with master. There is
no such thing as certainty that you have the same data as the master.

All sync rep gives you is a better durability guarantee that the changes
are safe. It doesn't guarantee those changes are transferred to all
nodes prior to making the data changes on any one standby.

-- 
 Simon Riggs   www.2ndQuadrant.com
 PostgreSQL Development, 24x7 Support, Training and Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Issues with Quorum Commit

2010-10-08 Thread Heikki Linnakangas

On 08.10.2010 11:25, Simon Riggs wrote:

On Fri, 2010-10-08 at 10:56 +0300, Heikki Linnakangas wrote:


Or what kind of customers do you think really need a no-lag solution for
read-only queries? In the LAN case, the lag of async rep is negligible
and in the WAN case the latencies of sync rep are prohibitive.


There is a very good use case for that particular set up, actually. If
your hot standby is guaranteed to be up-to-date with any transaction
that has been committed in the master, you can use the standby
interchangeably with the master for read-only queries.


This is an important point. It is desirable, but there is no such thing.
We must not take any project decisions based upon that false premise.

Hot Standby is never guaranteed to be up-to-date with master. There is
no such thing as certainty that you have the same data as the master.


Synchronous replication in the 'replay' mode is supposed to guarantee 
exactly that, no?


--
  Heikki Linnakangas
  EnterpriseDB   http://www.enterprisedb.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Issues with Quorum Commit

2010-10-08 Thread Markus Wanner
On 10/08/2010 10:27 AM, Heikki Linnakangas wrote:
 Synchronous replication in the 'replay' mode is supposed to guarantee
 exactly that, no?

The master may lag behind, so it's not strictly speaking the same data.

Regards

Markus Wanner


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Issues with Quorum Commit

2010-10-08 Thread Markus Wanner
On 10/08/2010 09:56 AM, Heikki Linnakangas wrote:
 Imagine a web application that's mostly read-only, but a
 user can modify his own personal details like name and address, for
 example. Imagine that the user changes his street address and clicks
 'save', causing an UPDATE, and the next query fetches that information
 again to display to the user.

I don't think that use case justifies sync replication and the
additional network overhead that brings. Latency is low in that case,
okay, but so is the lag for async replication.

Why not tell the load balancer to read from the master for n seconds
after the last write. After that, it should be save to query standbies,
again.

If the load on the master is the problem, and you want to reduce that by
moving the read-only transactions to the slave, sync replication pretty
certainly won't help you, either, because it actually *increases*
concurrency (by increased commit latency).

Regards

Markus Wanner

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Issues with Quorum Commit

2010-10-08 Thread Simon Riggs
On Fri, 2010-10-08 at 11:27 +0300, Heikki Linnakangas wrote:
 On 08.10.2010 11:25, Simon Riggs wrote:
  On Fri, 2010-10-08 at 10:56 +0300, Heikki Linnakangas wrote:
 
  Or what kind of customers do you think really need a no-lag solution for
  read-only queries? In the LAN case, the lag of async rep is negligible
  and in the WAN case the latencies of sync rep are prohibitive.
 
  There is a very good use case for that particular set up, actually. If
  your hot standby is guaranteed to be up-to-date with any transaction
  that has been committed in the master, you can use the standby
  interchangeably with the master for read-only queries.
 
  This is an important point. It is desirable, but there is no such thing.
  We must not take any project decisions based upon that false premise.
 
  Hot Standby is never guaranteed to be up-to-date with master. There is
  no such thing as certainty that you have the same data as the master.
 
 Synchronous replication in the 'replay' mode is supposed to guarantee 
 exactly that, no?

From the perspective of the person making the change on the master: yes.
If they make the change, wait for commit, then check the value on a
standby, yes it will be there (or a later version).

From the perspective of an observer, randomly selecting a standby for
load balancing purposes: No, they are not guaranteed to see the latest
answer, nor even can they find out whether what they are seeing is the
latest answer.

What sync rep does guarantee is that if the person making the change is
told it succeeded (commit) then that change is safe on at least k other
servers. Sync rep is about guarantees of safety, not observability.

-- 
 Simon Riggs   www.2ndQuadrant.com
 PostgreSQL Development, 24x7 Support, Training and Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Issues with Quorum Commit

2010-10-08 Thread Markus Wanner
On 10/08/2010 01:44 AM, Greg Smith wrote:
 They'll use Sync Rep to maximize
 the odds a system failure doesn't cause any transaction loss.  They'll
 use good quality hardware on the master so it's unlikely to fail.

..unlikely to fail?

Ehm.. is that you speaking, Greg? ;-)

 But
 when the database finds the standby unreachable, and it's left with the
 choice between either degrading into async rep or coming to a complete
 halt, you must give people the option of choosing to degrade instead
 after a timeout.  Let them set off the red flashing lights, sound the
 alarms, and pray the master doesn't go down until you can fix the
 problem.

Okay, okay, fair enough - if there had been red flashing lights. And
alarms. And bells and whistles. But that's what I'm afraid the timeout
is removing.

 I don't see this as needing any implementation any more complicated than
 the usual way such timeouts are handled.  Note how long you've been
 trying to reach the standby.  Default to -1 for forever.  And if you hit
 the timeout, mark the standby as degraded

..and how do you make sure you are not marking your second standby as
degraded just because it's currently lagging? Effectively degrading the
utterly needed one, because your first standby has just bitten the dust?

And how do you prevent the split brain situation in case the master dies
shortly after these events, but fails to come up again immediately?

Your list of data recovery projects will get larger and the projects
more complicated. Because there's a lot more to it than just the
implementation of a timeout.

Regards

Markus Wanner

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Issues with Quorum Commit

2010-10-08 Thread Markus Wanner
On 10/08/2010 11:00 AM, Simon Riggs wrote:
 From the perspective of an observer, randomly selecting a standby for
 load balancing purposes: No, they are not guaranteed to see the latest
 answer, nor even can they find out whether what they are seeing is the
 latest answer.

I completely agree. The application (or at least the load balancer)
needs to be aware of that fact.

Regards

Markus

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Issues with Quorum Commit

2010-10-08 Thread Dimitri Fontaine
Markus Wanner mar...@bluegap.ch writes:
 ..and how do you make sure you are not marking your second standby as
 degraded just because it's currently lagging? 

Well, in sync rep, a standby that's not able to stay under the timeout
is degraded. Full stop. The presence of the timeout (or its value not
being -1) means that the admin has chosen this definition.

 Effectively degrading the
 utterly needed one, because your first standby has just bitten the
 dust?

Well, now you have a worst case scenario: first standby is dead and the
remaining one was not able to keep up. You have lost all your master's
failover replacements.

 And how do you prevent the split brain situation in case the master dies
 shortly after these events, but fails to come up again immediately?

Same old story. Either you're able to try and fix the master so that you
don't lose any data and don't even have to check for that, or you take a
risk and start from a non synced standby. It's all availability against
durability again.

What I really want us to be able to provide is the clear facts so that
whoever has to take the decision is able to. Meaning, here, that it
should be easy to see that neither the standby are in sync at this
point.

Regards,
-- 
Dimitri Fontaine
http://2ndQuadrant.fr PostgreSQL : Expertise, Formation et Support

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Issues with Quorum Commit

2010-10-08 Thread Markus Wanner
On 10/08/2010 11:41 AM, Dimitri Fontaine wrote:
 Same old story. Either you're able to try and fix the master so that you
 don't lose any data and don't even have to check for that, or you take a
 risk and start from a non synced standby. It's all availability against
 durability again.

..and a whole lot of manual work, that's prone to error for something
that could easily be automated, at certainly less than 2000 EUR initial,
additional cost (if any at all, in case you already have three servers).
Sorry, I still fail to understand that use case.

It reminds me of the customer that wanted to save the cost of the BBU
and ran with fsync=off. Until his server got down due to a power outage.

But yeah, we provide that option as well, yes. Point taken.

Regards

Markus Wanner

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Issues with Quorum Commit

2010-10-08 Thread Dimitri Fontaine
Markus Wanner mar...@bluegap.ch writes:
 ..and a whole lot of manual work, that's prone to error for something
 that could easily be automated

So, the master just crashed, first standby is dead and second ain't in
sync. What's the easy and automated way out? Sorry, I need a hand here.

-- 
Dimitri Fontaine
http://2ndQuadrant.fr PostgreSQL : Expertise, Formation et Support

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Issues with Quorum Commit

2010-10-08 Thread Fujii Masao
On Fri, Oct 8, 2010 at 5:07 PM, Markus Wanner mar...@bluegap.ch wrote:
 On 10/08/2010 04:01 AM, Fujii Masao wrote:
 Really? I don't think that ko-count=0 means wait-forever.

 Telling from the documentation, I'd also say it doesn't wait forever by
 default. However, please note that there are different parameters for
 the initial wait for connection during boot up (wfc-timeout and
 degr-wfc-timeout). So you might to test what happens on a node failure,
 not just absence of a standby.

Unfortunately I've already taken down my DRBD environment. As far as
I heard from my colleague who is familiar with DRBD, standby node
failure doesn't prevent the master from writing data to the DRBD disk
by default. If there is DRBD environment available around me, I'll try
the test.

And, I'd like to know whether the master waits forever because of the
standby failure in other solutions such as Oracle DataGuard, MySQL
semi-synchronous replication.

Regards,

-- 
Fujii Masao
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Issues with Quorum Commit

2010-10-08 Thread Tom Lane
Greg Smith g...@2ndquadrant.com writes:
 I don't see this as needing any implementation any more complicated than 
 the usual way such timeouts are handled.  Note how long you've been 
 trying to reach the standby.  Default to -1 for forever.  And if you hit 
 the timeout, mark the standby as degraded and force them to do a proper 
 resync when they disconnect.  Once that's done, then they can re-enter 
 sync rep mode again, via the same process a new node would have done so.

Well, actually, that's *considerably* more complicated than just a
timeout.  How are you going to mark the standby as degraded?  The
standby can't keep that information, because it's not even connected
when the master makes the decision.  ISTM that this requires

1. a unique identifier for each standby (not just role names that
multiple standbys might share);

2. state on the master associated with each possible standby -- not just
the ones currently connected.

Both of those are perhaps possible, but the sense I have of the
discussion is that people want to avoid them.

Actually, #2 seems rather difficult even if you want it.  Presumably
you'd like to keep that state in reliable storage, so it survives master
crashes.  But how you gonna commit a change to that state, if you just
lost every standby (suppose master's ethernet cable got unplugged)?
Looks to me like it has to be reliable non-replicated storage.  Leaving
aside the question of how reliable it can really be if not replicated,
it's still the case that we have noplace to put such information given
the WAL-is-across-the-whole-cluster design.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Issues with Quorum Commit

2010-10-08 Thread Fujii Masao
On Fri, Oct 8, 2010 at 5:10 PM, Heikki Linnakangas
heikki.linnakan...@enterprisedb.com wrote:
 Do we really need that?

Yes. But if there is no unsent WAL when the master goes down,
we can start new standby without new backup by copying the
timeline history file from new master to new standby and
setting recovery_target_timeline to 'latest'. In this case,
new standby advances the recovery to the latest timeline ID
which new master uses before connecting to the master.

This seems to have been successful in my test environment.
Though I'm missing something.

 I don't think that's acceptable, we'll need to fix
 that if that's the case.

Agreed.

 You can cross timelines with the archive, though. But IIRC there was some
 issue with that too, you needed to restart the standbys because the standby
 scans what timelines exist at the beginning of recovery, and won't notice
 new timelines that appear after that?

Yes.

 We need to address that, apart from any of the other things discussed wrt.
 synchronous replication. It will benefit asynchronous replication too. IMHO
 *that* is the next thing we should do, the next patch we commit.

You mean to commit that capability before synchronous replication? If so,
I disagree with you. I think that it's not easy to address that problem.
So I'm worried about that implementing that capability first means the miss
of sync rep in 9.1.

Regards,

-- 
Fujii Masao
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Issues with Quorum Commit

2010-10-08 Thread Markus Wanner
On 10/08/2010 04:11 PM, Tom Lane wrote:
 Actually, #2 seems rather difficult even if you want it.  Presumably
 you'd like to keep that state in reliable storage, so it survives master
 crashes.  But how you gonna commit a change to that state, if you just
 lost every standby (suppose master's ethernet cable got unplugged)?

IIUC you seem to assume that the master node keeps its master role. But
users who value availability a lot certainly want automatic fail-over,
so any node can potentially be the new master.

After recovery from a full-cluster outage, the first question is which
node was the most recent master (or which former standby is up to date
and could take over).

Regards

Markus Wanner

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Issues with Quorum Commit

2010-10-08 Thread Dimitri Fontaine
Tom Lane t...@sss.pgh.pa.us writes:
 Well, actually, that's *considerably* more complicated than just a
 timeout.  How are you going to mark the standby as degraded?  The
 standby can't keep that information, because it's not even connected
 when the master makes the decision.  ISTM that this requires

 1. a unique identifier for each standby (not just role names that
 multiple standbys might share);

 2. state on the master associated with each possible standby -- not just
 the ones currently connected.

 Both of those are perhaps possible, but the sense I have of the
 discussion is that people want to avoid them.

What we'd like to avoid is for the users to have to cope with such
needs. Now, if that's internal to the code and automatic, that's not the
same thing at all.

What I'd have in mind is a Database standby system identifier that
would be part of the initial hand shake in the replication protocol. And
a system function to be able to unregister the standby.

 Actually, #2 seems rather difficult even if you want it.  Presumably
 you'd like to keep that state in reliable storage, so it survives master
 crashes.  But how you gonna commit a change to that state, if you just
 lost every standby (suppose master's ethernet cable got unplugged)?

I don't see that as a huge problem myself, because I'm already well sold
to the per-transaction replication-synchronous behaviour. So any change
done there by the master would be hard-coded as async. What I'm missing?

Regards,
-- 
Dimitri Fontaine
http://2ndQuadrant.fr PostgreSQL : Expertise, Formation et Support

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Issues with Quorum Commit

2010-10-08 Thread Markus Wanner
On 10/08/2010 12:05 PM, Dimitri Fontaine wrote:
 Markus Wanner mar...@bluegap.ch writes:
 ..and a whole lot of manual work, that's prone to error for something
 that could easily be automated
 
 So, the master just crashed, first standby is dead and second ain't in
 sync. What's the easy and automated way out? Sorry, I need a hand here.

Thinking this through, I'm realizing that this can potentially work
automatically with three nodes in both cases. Each node needs to keep
track of whether or not it is (or became) the master - and when (lamport
timestamp, maybe, not necessarily wall clock). A new master might
continue to commit new transactions after a fail-over, without the old
master being able to record that fact (because it's down).

This means there's a different requirement after a full-cluster crash
(i.e. master failure and no up-to-date standby is available). With the
timeout, you absolutely need the former master to come back up again for
zero data loss, no matter what your quorum_commit setting was. To be
able to automatically tell who was the most recent master, you need to
query the state of all other nodes, because they could be a more recent
master. If that's not possible (or not feasible, because the replacement
part isn't currently available), you are at risk of data loss.

With the given three node scenario, the zero data loss guarantee only
holds true as long as either at least one node (that is in sync) is
running or if you can recover the former master after a full cluster crash.

When waiting forever, you only need one of the k nodes to come back up
again. You also need to query other nodes to find out which the k of N
nodes are, but being able to recovery (N - k + 1) nodes is sufficient to
figure that out. So any (k-1) nodes may fail, even permanently, at any
point in time, and you are still not at risk of losing data. (Nor at
risk of losing availability, BTW). I'm still of the opinion that that's
the way easier and clearer guarantee.

Also note that with higher values for N, this gets more and more
important, because the chance to be able to recovery all N nodes after a
full crash shrinks with increasing N (while the time required to do so
increases). But maybe the current sync rep feature doesn't need to
target setups with that many nodes.

I certainly agree that either way is complicated to implement. With
Postgtres-R, I'm clearly going the way that's able to satisfy large
numbers of nodes.

Thanks for an interesting discussion. And for respectful disagreement.

Regards

Markus Wanner

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Issues with Quorum Commit

2010-10-08 Thread Tom Lane
Markus Wanner mar...@bluegap.ch writes:
 On 10/08/2010 04:11 PM, Tom Lane wrote:
 Actually, #2 seems rather difficult even if you want it.  Presumably
 you'd like to keep that state in reliable storage, so it survives master
 crashes.  But how you gonna commit a change to that state, if you just
 lost every standby (suppose master's ethernet cable got unplugged)?

 IIUC you seem to assume that the master node keeps its master role. But
 users who value availability a lot certainly want automatic fail-over,

Huh?  Surely loss of the slaves shouldn't force a failover.  Maybe the
slaves really are all dead.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Issues with Quorum Commit

2010-10-08 Thread Markus Wanner
On 10/08/2010 04:38 PM, Tom Lane wrote:
 Markus Wanner mar...@bluegap.ch writes:
 IIUC you seem to assume that the master node keeps its master role. But
 users who value availability a lot certainly want automatic fail-over,
 
 Huh?  Surely loss of the slaves shouldn't force a failover.  Maybe the
 slaves really are all dead.

I think we are talking across each other. I'm speaking about the need to
be able to fail-over to a standby in case the master fails.

In case of a full-cluster crash after such a fail-over, you need to take
care you don't enter split brain. Some kind of STONITH, lamport clock,
or what not. Figuring out which node has been the most recent (and thus
most up to date) master is far from trivial.

(See also my mail in answer to Dimitri a few minutes ago).

Regards

Markus Wanner

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Issues with Quorum Commit

2010-10-08 Thread Simon Riggs
On Fri, 2010-10-08 at 10:11 -0400, Tom Lane wrote:

 1. a unique identifier for each standby (not just role names that
 multiple standbys might share);

That is difficult because each standby is identical. If a standby goes
down, people can regenerate a new standby by taking a copy from another
standby. What number do we give this new standby?...

 2. state on the master associated with each possible standby -- not just
 the ones currently connected.
 
 Both of those are perhaps possible, but the sense I have of the
 discussion is that people want to avoid them.

Yes, I really want to avoid such issues and likely complexities we get
into trying to solve them. In reality they should not be common because
it only happens if the sysadmin has not configured sufficient number of
redundant standbys.

My proposed design is that the timeout does not cause the standby to be
marked as degraded. It is up to the user to decide whether they wait,
or whether they progress without sync rep. Or sysadmin can release the
waiters via a function call.

If the cluster does become degraded the sysadmin just generates a new
standby and plugs in back into the cluster and away we go. Simple, no
state to be recorded and no state to get screwed up either. I don't
think we should be spending too much time trying to help people that say
they want additional durability guarantees but do not match that with
sufficient hardware resources to make it happen smoothly.

If we do try to tackle those problems who will be able to validate our
code actually works?

-- 
 Simon Riggs   www.2ndQuadrant.com
 PostgreSQL Development, 24x7 Support, Training and Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Issues with Quorum Commit

2010-10-08 Thread Fujii Masao
On Fri, Oct 8, 2010 at 5:16 PM, Markus Wanner mar...@bluegap.ch wrote:
 On 10/08/2010 05:41 AM, Fujii Masao wrote:
 But, even with quorum commit, if you choose wait-forever option,
 failover would decrease availability. Right after the failover,
 no standby has connected to new master, so if quorum = 1, all
 the transactions must wait for a while.

 That's a point, yes. But again, this is just write-availability, you can
 happily read from all active standbies.

I believe many systems require write-availability.

 Basically we need to take a base backup from new master to start
 the standbys and make them connect to new master. This might take
 a long time. Since transaction commits cannot advance for that time,
 availability would goes down.

 Just don't increase your quorum_commit to unreasonable values which your
 hardware cannot possible satisfy. It doesn't make sense to set a
 quorum_commit of 1 or even bigger, if you don't already have a standby
 attached.

 Start with 0 (i.e. replication off), then add standbies, then increase
 quorum_commit to your new requirements.

No. This only makes the procedure of failover more complex.

 Or you think that wait-forever option is applied only when the
 standby goes down?

 That wouldn't work in case of a full-cluster crash, where the
 wait-forever option is required again. Otherwise you risk a split-brain
 situation.

What is a full-cluster crash? Why does it cause a split-brain?

Regards,

-- 
Fujii Masao
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Issues with Quorum Commit

2010-10-08 Thread Fujii Masao
On Fri, Oct 8, 2010 at 6:00 PM, Simon Riggs si...@2ndquadrant.com wrote:
 From the perspective of an observer, randomly selecting a standby for
 load balancing purposes: No, they are not guaranteed to see the latest
 answer, nor even can they find out whether what they are seeing is the
 latest answer.

To guarantee that each standby returns the same result, we would need to
use the cluster-wide snapshot to run queries. IIRC, Postgres-XC provides
that feature. Though I'm not sure if it can be applied in HS/SR.

Regards,

-- 
Fujii Masao
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Issues with Quorum Commit

2010-10-08 Thread Simon Riggs
On Fri, 2010-10-08 at 23:55 +0900, Fujii Masao wrote:
 On Fri, Oct 8, 2010 at 6:00 PM, Simon Riggs si...@2ndquadrant.com wrote:
  From the perspective of an observer, randomly selecting a standby for
  load balancing purposes: No, they are not guaranteed to see the latest
  answer, nor even can they find out whether what they are seeing is the
  latest answer.
 
 To guarantee that each standby returns the same result, we would need to
 use the cluster-wide snapshot to run queries. IIRC, Postgres-XC provides
 that feature. Though I'm not sure if it can be applied in HS/SR.

That is my understanding.

-- 
 Simon Riggs   www.2ndQuadrant.com
 PostgreSQL Development, 24x7 Support, Training and Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Issues with Quorum Commit

2010-10-08 Thread Markus Wanner
On 10/08/2010 04:47 PM, Simon Riggs wrote:
 Yes, I really want to avoid such issues and likely complexities we get
 into trying to solve them. In reality they should not be common because
 it only happens if the sysadmin has not configured sufficient number of
 redundant standbys.

Well, full cluster outages are infrequent, but sadly cannot be avoided
entirely. (Murphy's laughing). IMO we should be prepared to deal with
those. Or am I understanding you wrongly here?

 I don't
 think we should be spending too much time trying to help people that say
 they want additional durability guarantees but do not match that with
 sufficient hardware resources to make it happen smoothly.

I fully agree to that statement.

Regards

Markus Wanner

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Issues with Quorum Commit

2010-10-08 Thread Markus Wanner
On 10/08/2010 04:48 PM, Fujii Masao wrote:
 I believe many systems require write-availability.

Sure. Make sure you have enough standbies to fail over to.

(I think there are even more situations where read-availability is much
more important, though).

 Start with 0 (i.e. replication off), then add standbies, then increase
 quorum_commit to your new requirements.
 
 No. This only makes the procedure of failover more complex.

Huh? This doesn't affect fail-over at all. Quite the opposite, the
guarantees and requirements remain the same even after a fail-over.

 What is a full-cluster crash?

The event that all of your cluster nodes are down (most probably due to
power failure, but fires or other catastrophic events can be other
causes). Chances for that to happen can certainly be reduced by
distributing to distant locations, but that equally certainly increases
latency, which isn't always an option.

 Why does it cause a split-brain?

First master node A fails, a standby B takes over, but then fails as
well. Let node C take over. Then the power aggregates catches fire, the
infamous full-cluster crash (where lights out management gets a
completely new meaning ;-) ).

Split brain would be the situation that arises if all three nodes (A, B
and C) start up again and think they have been the former master, so
they can now continue to apply new transactions. Their data diverges,
leading to what could be seen as a split-brain from the outside.

Obviously, you must disallow A and B to take the role of the master
after recovery. Ideally, C would continue as the master. However, if the
fire destroyed node C, let's hope you had another (sync!) standby that
can act as the new master. Otherwise you've lost data.

Hope that explains it. Wikipedia certainly provides a better (and less
Postgres colored) explanation.

Regards

Markus Wanner

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Issues with Quorum Commit

2010-10-08 Thread Josh Berkus



And, I'd like to know whether the master waits forever because of the
standby failure in other solutions such as Oracle DataGuard, MySQL
semi-synchronous replication.


MySQL used to be fond of simiply failing sliently.  Not sure what 5.4 
does, or Oracle.  In any case MySQL's replication has always really been 
async (except Cluster, which is a very different database), so it's not 
really a comparison.


Here's the comparables:
Oracle DataGuard
DRBD
SQL Server
DB2

If anyone knows what the above do by default, please speak up!

--
  -- Josh Berkus
 PostgreSQL Experts Inc.
 http://www.pgexperts.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Issues with Quorum Commit

2010-10-08 Thread Rob Wultsch
*

On 10/8/10, Fujii Masao masao.fu...@gmail.com wrote:
 On Fri, Oct 8, 2010 at 5:10 PM, Heikki Linnakangas
 heikki.linnakan...@enterprisedb.com wrote:
 Do we really need that?

 Yes. But if there is no unsent WAL when the master goes down,
 we can start new standby without new backup by copying the
 timeline history file from new master to new standby and
 setting recovery_target_timeline to 'latest'. In this case,
 new standby advances the recovery to the latest timeline ID
 which new master uses before connecting to the master.

 This seems to have been successful in my test environment.
 Though I'm missing something.

 I don't think that's acceptable, we'll need to fix
 that if that's the case.

 Agreed.

 You can cross timelines with the archive, though. But IIRC there was some
 issue with that too, you needed to restart the standbys because the
 standby
 scans what timelines exist at the beginning of recovery, and won't notice
 new timelines that appear after that?

 Yes.

 We need to address that, apart from any of the other things discussed wrt.
 synchronous replication. It will benefit asynchronous replication too.
 IMHO
 *that* is the next thing we should do, the next patch we commit.

 You mean to commit that capability before synchronous replication? If so,
 I disagree with you. I think that it's not easy to address that problem.
 So I'm worried about that implementing that capability first means the miss
 of sync rep in 9.1.

 Regards,

 --
 Fujii Masao
 NIPPON TELEGRAPH AND TELEPHONE CORPORATION
 NTT Open Source Software Center

 --
 Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
 To make changes to your subscription:
 http://www.postgresql.org/mailpref/pgsql-hackers



-- 
Rob Wultsch
wult...@gmail.com

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Issues with Quorum Commit

2010-10-08 Thread Heikki Linnakangas

On 08.10.2010 17:26, Fujii Masao wrote:

On Fri, Oct 8, 2010 at 5:10 PM, Heikki Linnakangas
heikki.linnakan...@enterprisedb.com  wrote:

Do we really need that?


Yes. But if there is no unsent WAL when the master goes down,
we can start new standby without new backup by copying the
timeline history file from new master to new standby and
setting recovery_target_timeline to 'latest'.


.. and restart the standby.


In this case,
new standby advances the recovery to the latest timeline ID
which new master uses before connecting to the master.

This seems to have been successful in my test environment.
Though I'm missing something.


Yeah, that should work, but it's awfully complicated.


I don't think that's acceptable, we'll need to fix
that if that's the case.


Agreed.


You can cross timelines with the archive, though. But IIRC there was some
issue with that too, you needed to restart the standbys because the standby
scans what timelines exist at the beginning of recovery, and won't notice
new timelines that appear after that?


Yes.


We need to address that, apart from any of the other things discussed wrt.
synchronous replication. It will benefit asynchronous replication too. IMHO
*that* is the next thing we should do, the next patch we commit.


You mean to commit that capability before synchronous replication? If so,
I disagree with you. I think that it's not easy to address that problem.
So I'm worried about that implementing that capability first means the miss
of sync rep in 9.1.


It's a pretty severe shortcoming at the moment. For starters, it means 
that you need a shared archive, even if you set wal_keep_segments to a 
high number. Secondly, it's a lot of scripting to get it working, I 
don't like the thought of testing failovers in synchronous replication 
if I have to do all that. Frankly, this seems more important to me than 
synchronous replication.


It shouldn't be too hard to fix. Walsender needs to be able to read WAL 
from preceding timelines, like recovery does, and walreceiver needs to 
write the incoming WAL to the right file.


--
  Heikki Linnakangas
  EnterpriseDB   http://www.enterprisedb.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Issues with Quorum Commit

2010-10-08 Thread Greg Smith

Markus Wanner wrote:

..and how do you make sure you are not marking your second standby as
degraded just because it's currently lagging? Effectively degrading the
utterly needed one, because your first standby has just bitten the dust?
  


People are going to monitor the standby lag.  If it gets excessive 
relative to where it's approaching the known timeout, the flashing 
yellow lights should go off at this point, before it gets this bad.  And 
if you've set a reasonable business oriented timeout on how long you can 
stand for the master to be held up waiting for a lagging standby, the 
right thing to do may very well be to cut it off.  At some point people 
will want to stop waiting for a standby if it's taking so long to commit 
that it's interfering with the ability of the master to operate 
normally.  Such a master is already degraded, if your performance 
metrics for availability includes processing transactions in a timely 
manner.



And how do you prevent the split brain situation in case the master dies
shortly after these events, but fails to come up again immediately?
  


How is that a new problem?  It's already possible to end up with a 
standby pair that has suffered through some bizarre failure chain such 
that it's not necessarily obvious which of the two systems has the most 
recent set of data on it.  And that's not this project's problem to 
solve.  Useful answers to the split brain problem involve fencing 
implementations that normally drop to the hardware level, and clustering 
solutions including those features are already available that PostgreSQL 
can integrate into.  Assuming you have to solve this in order to deliver 
a useful database replication component is excessively ambitious. 

You seem to be under the assumption that a more complicated replication 
implementation here will make reaching a bad state impossible.  I think 
that's optimistic, both in theory and in regards to how successful code 
gets built.  Here's the thing:  the difficultly of testing to prove your 
code actually works is also proportional to that complexity.  This 
project can chose to commit and potentially ship a simple solution that 
has known limitations, and expect that people will fill in the gap with 
existing add-on software to handle the clustering parts it doesn't:  
fencing, virtual IP address assignment, etc.  All while getting useful 
testing feedback on the simple bottom layer, whose main purpose in life 
is to transport WAL data synchronously.  Or, we can argue in favor of 
adding additional complexity on top first instead, so we end up with 
layers and layers of untested code.  That path leads to situations where 
you're lucky to ship at all, and when you do the result is difficult to 
support.


--
Greg Smith, 2ndQuadrant US g...@2ndquadrant.com Baltimore, MD
PostgreSQL Training, Services and Support  www.2ndQuadrant.us



--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Issues with Quorum Commit

2010-10-08 Thread Greg Smith

Tom Lane wrote:

How are you going to mark the standby as degraded?  The
standby can't keep that information, because it's not even connected
when the master makes the decision.


From a high level, I'm assuming only that the master has a list in 
memory of the standby system(s) it believes are up to date, and that it 
is supposed to commit to synchronously.  When I say mark as degraded, I 
mean that the master merely closes whatever communications channel it 
had open with that system and removes the standby from that list.


If that standby now reconnects again, I don't see how resolving what 
happens at that point is any different from when a standby is first 
started after both systems were turned off.  If the standby is current 
with the data available on the master when it has an initial 
conversation, great; it's now available for synchronous commit too 
then.  If it's not, it goes into a catchup mode first instead.  When the 
master sees you're back to current again, if you're on the list of sync 
servers too you go back onto the list of active sync systems.


There's shouldn't be any state information to save here.  If the master 
and standby can't figure out if they are in or out of sync with one 
another based on the conversation they have when they first connect to 
one another, that suggests to me there needs to be improvements made in 
the communications protocol they use to exchange messages. 


--
Greg Smith, 2ndQuadrant US g...@2ndquadrant.com Baltimore, MD
PostgreSQL Training, Services and Support  www.2ndQuadrant.us



--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Issues with Quorum Commit

2010-10-08 Thread Simon Riggs
On Fri, 2010-10-08 at 17:06 +0200, Markus Wanner wrote:
 Well, full cluster outages are infrequent, but sadly cannot be avoided
 entirely. (Murphy's laughing). IMO we should be prepared to deal with
 those. 

I've described how I propose to deal with those. I'm not waving away
these issues, just proposing that we consciously choose simplicity and
therefore robustness.

Let me say it again for clarity. (This is written for the general case,
though my patch uses only k=1 i.e. one acknowledgement):

If we want robustness, we have multiple standbys. So if you lose one,
you continue as normal without interruption. That is the first and most
important line of defence - not software.

When we start to wait, if there aren't sufficient active standbys to
acknowledge a commit, then the commit won't wait. This behaviour helps
us avoid situations where we are hours or days away from having a
working standby to acknowledge the commit. We've had a long debate about
servers that ought to be there but aren't; I suggest we treat standbys
that aren't there as having a strong possibility they won't come back,
and hence not worth waiting for. Heikki disagrees; I have no problem
with adding server registration so that we can add additional waits, but
I doubt that the majority of users prefer waiting over availability. It
can be an option

Once we are waiting, if insufficient standbys acknowledge the commit we
will wait until the timeout expires, after which we commit and continue
working. If you don't like timeouts, set the timeout to 0 to wait
forever. This behaviour is designed to emphasise availability. (I
acknowledge that some people are so worried by data loss that they would
choose to stop changes altogether, and accept unavailability; I regard
that as a minority use case, but one which I would not argue against
including as an options at some point in the future.)

To cover Dimitri's observation that when a streaming standby first
connects it might take some time before it can sensibly acknowledge, we
don't activate the standby until it has caught up. Once caught up, it
will advertise it's capability to offer a sync rep service. Standbys
that don't wish to be failover targets can set
synchronous_replication_service = off.

The paths between servers aren't defined explicitly, so the parameters
all still work even after failover.
 
-- 
 Simon Riggs   www.2ndQuadrant.com
 PostgreSQL Development, 24x7 Support, Training and Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Issues with Quorum Commit

2010-10-08 Thread Simon Riggs
On Fri, 2010-10-08 at 16:34 -0400, Greg Smith wrote:
 Tom Lane wrote:
  How are you going to mark the standby as degraded?  The
  standby can't keep that information, because it's not even connected
  when the master makes the decision.
 
  From a high level, I'm assuming only that the master has a list in 
 memory of the standby system(s) it believes are up to date, and that it 
 is supposed to commit to synchronously.  When I say mark as degraded, I 
 mean that the master merely closes whatever communications channel it 
 had open with that system and removes the standby from that list.

My current coding works with two sets of parameters: 

The master marks standby as degraded is handled by the tcp keepalives.
When it notices no response, it kicks out the standby. We already had
this, so I never mentioned it before as being part of the solution.

The second part is the synchronous_replication_timeout which is a user
settable parameter defining how long the app is prepared to wait, which
could be more or less time than the keepalives.

 If that standby now reconnects again, I don't see how resolving what 
 happens at that point is any different from when a standby is first 
 started after both systems were turned off.  If the standby is current 
 with the data available on the master when it has an initial 
 conversation, great; it's now available for synchronous commit too 
 then.  If it's not, it goes into a catchup mode first instead.  When the 
 master sees you're back to current again, if you're on the list of sync 
 servers too you go back onto the list of active sync systems.
 
 There's shouldn't be any state information to save here.  If the master 
 and standby can't figure out if they are in or out of sync with one 
 another based on the conversation they have when they first connect to 
 one another, that suggests to me there needs to be improvements made in 
 the communications protocol they use to exchange messages. 

Agreed.

-- 
 Simon Riggs   www.2ndQuadrant.com
 PostgreSQL Development, 24x7 Support, Training and Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Issues with Quorum Commit

2010-10-07 Thread Simon Riggs
On Wed, 2010-10-06 at 10:57 -0700, Josh Berkus wrote:

 I also strongly believe that we should get single-standby
 functionality committed and tested *first*, before working further on
 multi-standby.

Yes, lets get k = 1 first.

With k = 1 the number of standbys is not limited, so we can still have
very robust and highly available architectures. So we mean
first-acknowledgement-releases-waiters.

 (1) Consistency: this is another DBA-false-confidence issue.  DBAs who
 implement (1) are liable to do so thinking that they are not only
 guaranteeing the consistency of every standby with the master, but the
 consistency of every standby with every other standby -- a kind of
 dummy multi-master.  They are not, so it will take multiple reminders
 and workarounds in the docs to explain this.  And we'll get complaints
 anyway.

This puts the matter very clearly. Setting k = N is not as good an idea
as it sounds when first described.

-- 
 Simon Riggs   www.2ndQuadrant.com
 PostgreSQL Development, 24x7 Support, Training and Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Issues with Quorum Commit

2010-10-07 Thread Simon Riggs
On Wed, 2010-10-06 at 10:57 -0700, Josh Berkus wrote:
 (2), (3) Degradation: (Jeff) these two cases make sense only if we
 give
 DBAs the tools they need to monitor which standbys are falling behind,
 and to drop and replace those standbys.  Otherwise we risk giving DBAs
 false confidence that they have better-than-1-standby reliability when
 actually they don't.  Current tools are not really adequate for this.

Current tools work just fine for identifying if a server is falling
behind. This improved in 9.0 to give fine-grained information. Nothing
more is needed here within the server.

-- 
 Simon Riggs   www.2ndQuadrant.com
 PostgreSQL Development, 24x7 Support, Training and Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Issues with Quorum Commit

2010-10-07 Thread Markus Wanner
On 10/06/2010 10:01 PM, Simon Riggs wrote:
 The code to implement your desired option is
 more complex and really should come later.

I'm sorry, but I think of that exactly the opposite way. The timeout for
automatic continuation after waiting for a standby is the addition. The
wait state of the master is there anyway, whether or not it's bound by a
timeout. The timeout option should thus come later.

Regards

Markus Wanner

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Issues with Quorum Commit

2010-10-07 Thread Dimitri Fontaine
Markus Wanner mar...@bluegap.ch writes:
 I'm just saying that this should be an option, not the only choice.

 I'm sorry, I just don't see the use case for a mode that drops
 guarantees when they are most needed. People who don't need those
 guarantees should definitely go for async replication instead.

We're still talking about freezing the master and all the applications
when the first standby still has to do a base backup and catch-up to
where the master currently is, right?

 What does a synchronous replication mode that falls back to async upon
 failure give you, except for a severe degradation in performance during
 normal operation? Why not use async right away in such a case?

It's all about the standard case you're building, sync rep, and how to
manage errors. In most cases I want flexibility. Alert says standby is
down, you lost your durability requirements, so now I'm building a new
standby. Does it mean my applications are all off and the master
refusing to work? I sure hope I can choose about that, if possible per
application.

Next step, the old standby has been able to boot again, thanks to the
sysadmins who repaired it, so it's online again, and my replacement
machine is doing a base-backup. Are all the applications still
unavailable? I sure hope I have a word in this decision.

 so opening a
 superuser connection to act on the currently waiting transaction is
 still possible (pass/fail, but fail is what at this point? shutdown to
 wait some more offline?).

 Not sure I'm following here. The admin will be busy re-establishing
 (connections to) standbies, killing transactions on the master doesn't
 help anything - whether or not the master waits forever.

The idea here would be to be able to manually ACK a transaction that's
waiting forever, because you know it won't have an answer and you'd
prefer the application to just continue. But I see that's not a valid
use case for you.

Regards,
-- 
Dimitri Fontaine
http://2ndQuadrant.fr PostgreSQL : Expertise, Formation et Support

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Issues with Quorum Commit

2010-10-07 Thread Heikki Linnakangas

On 07.10.2010 12:52, Dimitri Fontaine wrote:

Markus Wannermar...@bluegap.ch  writes:

I'm just saying that this should be an option, not the only choice.


I'm sorry, I just don't see the use case for a mode that drops
guarantees when they are most needed. People who don't need those
guarantees should definitely go for async replication instead.


We're still talking about freezing the master and all the applications
when the first standby still has to do a base backup and catch-up to
where the master currently is, right?


Either that, or you configure your system for asynchronous replication 
first, and flip the switch to synchronous only after the standby has 
caught up. Setting up the first standby happens only once when you 
initially set up the system, or if you're recovering from a catastrophic 
loss of the standby.



What does a synchronous replication mode that falls back to async upon
failure give you, except for a severe degradation in performance during
normal operation? Why not use async right away in such a case?


It's all about the standard case you're building, sync rep, and how to
manage errors. In most cases I want flexibility. Alert says standby is
down, you lost your durability requirements, so now I'm building a new
standby. Does it mean my applications are all off and the master
refusing to work?


Yes. That's why you want to have at least two standbys if you care about 
availability. Or if durability isn't that important to you after all, 
use asynchronous replication.


Of course, if in the heat of the moment the admin is willing to forge 
ahead without the standby, he can temporarily change the configuration 
in the master. If you want the standby to be rebuilt automatically, you 
can even incorporate that configuration change in the scripts too. The 
important point is that you or your scripts are in control, and you know 
at all times whether you can trust the standby or not. If the master 
makes such decisions automatically, you don't know if the standby is 
trustworthy (ie. guaranteed up-to-date) or not.



so opening a
superuser connection to act on the currently waiting transaction is
still possible (pass/fail, but fail is what at this point? shutdown to
wait some more offline?).


Not sure I'm following here. The admin will be busy re-establishing
(connections to) standbies, killing transactions on the master doesn't
help anything - whether or not the master waits forever.


The idea here would be to be able to manually ACK a transaction that's
waiting forever, because you know it won't have an answer and you'd
prefer the application to just continue. But I see that's not a valid
use case for you.


I don't see anything wrong with having tools for admins to deal with the 
unexpected. I'm not sure overriding individual transactions is very 
useful though, more likely you'll want to take the whole server offline, 
or you want to change the config to allow all transactions to continue 
without the synchronous standby.


--
  Heikki Linnakangas
  EnterpriseDB   http://www.enterprisedb.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Issues with Quorum Commit

2010-10-07 Thread Dimitri Fontaine
Heikki Linnakangas heikki.linnakan...@enterprisedb.com writes:
 Either that, or you configure your system for asynchronous replication
 first, and flip the switch to synchronous only after the standby has caught
 up. Setting up the first standby happens only once when you initially set up
 the system, or if you're recovering from a catastrophic loss of the
 standby.

Or if the standby is lagging and the master wal_keep_segments is not
sized big enough. Is that a catastrophic loss of the standby too?

 It's all about the standard case you're building, sync rep, and how to
 manage errors. In most cases I want flexibility. Alert says standby is
 down, you lost your durability requirements, so now I'm building a new
 standby. Does it mean my applications are all off and the master
 refusing to work?

 Yes. That's why you want to have at least two standbys if you care about
 availability. Or if durability isn't that important to you after all, use
 asynchronous replication.

Agreed, that's a nice simple use case.

Another one is to say that I want sync rep when the standby is
available, but I don't have the budget for more. So I prefer a good
alerting system and low-budget-no-guarantee when the standby is down,
that's my risk evaluation.

 Of course, if in the heat of the moment the admin is willing to forge ahead
 without the standby, he can temporarily change the configuration in the
 master. If you want the standby to be rebuilt automatically, you can even
 incorporate that configuration change in the scripts too. The important
 point is that you or your scripts are in control, and you know at all times
 whether you can trust the standby or not. If the master makes such decisions
 automatically, you don't know if the standby is trustworthy (ie. guaranteed
 up-to-date) or not.

My proposal is that the master has the information to make the decision,
and the behavior is something you setup. Default to security, so wait
forever and block the applications, but could be set to ignore standby
that have not at least reached this state.

I don't see that you can make everybody happy without a knob here, and I
don't see how we can deliver one without a clear state diagram of the
standby possible current states and transitions.

The other alternative is to just don't care and accept the timeout as
being an option with the quorum, so that you just don't wait for the
quorum if so you want. It's much more dynamic and dangerous, but with a
good alerting system it'll be very popular I guess.

 I don't see anything wrong with having tools for admins to deal with the
 unexpected. I'm not sure overriding individual transactions is very useful
 though, more likely you'll want to take the whole server offline, or you
 want to change the config to allow all transactions to continue without the
 synchronous standby.

The question then is, should the new configuration alter running
transactions? My implicit was that I don't think so, and then I need
another facility, such as

  SELECT pg_cancel_quorum_wait(procpid)
FROM pg_stat_activity
   WHERE waiting_quorum;

Regards,
-- 
Dimitri Fontaine
http://2ndQuadrant.fr PostgreSQL : Expertise, Formation et Support

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Issues with Quorum Commit

2010-10-07 Thread Simon Riggs
On Thu, 2010-10-07 at 11:46 +0200, Markus Wanner wrote:
 On 10/06/2010 10:01 PM, Simon Riggs wrote:
  The code to implement your desired option is
  more complex and really should come later.
 
 I'm sorry, but I think of that exactly the opposite way. 

I see why you say that. Dimitri's suggestion is an enhancement on the
basic feature, just as Heikki's is. My reply was directed at Heikki, but
should also apply to Dimitri's idea also.

 The timeout for
 automatic continuation after waiting for a standby is the addition. The
 wait state of the master is there anyway, whether or not it's bound by a
 timeout. The timeout option should thus come later.

Adding timeout is very little code. We can take that out of the patch if
that's an objection.

-- 
 Simon Riggs   www.2ndQuadrant.com
 PostgreSQL Development, 24x7 Support, Training and Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Issues with Quorum Commit

2010-10-07 Thread Markus Wanner
On 10/07/2010 01:08 PM, Simon Riggs wrote:
 Adding timeout is very little code. We can take that out of the patch if
 that's an objection.

Okay. If you take it out, we are at the wait-forever option, right?

If not, I definitely don't understand how you envision things to happen.
I've been asking [1] about that distinction before, but didn't get a
direct answer.

Regards

Markus Wanner


[1]: Re: Configuring synchronous replication, Markus Wanner:
http://archives.postgresql.org/message-id/4c9c5887.4040...@bluegap.ch

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Issues with Quorum Commit

2010-10-07 Thread Robert Haas
On Thu, Oct 7, 2010 at 3:30 AM, Simon Riggs si...@2ndquadrant.com wrote:
 Yes, lets get k = 1 first.

 With k = 1 the number of standbys is not limited, so we can still have
 very robust and highly available architectures. So we mean
 first-acknowledgement-releases-waiters.

+1.  I like the design Greg Smith proposed yesterday (though there are
details to be worked out).

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise Postgres Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Issues with Quorum Commit

2010-10-07 Thread Markus Wanner
Salut Dimitri,

On 10/07/2010 12:32 PM, Dimitri Fontaine wrote:
 Another one is to say that I want sync rep when the standby is
 available, but I don't have the budget for more. So I prefer a good
 alerting system and low-budget-no-guarantee when the standby is down,
 that's my risk evaluation.

I think that's a pretty special case, because the good alerting system
is at least as expensive as another server that just persistently stores
and ACKs incoming WAL.

Why does one ever want the guarantee that sync replication gives to only
hold true up to one failure, if a better guarantee doesn't cost anything
extra? (Note that a good alerting system is impossible to achieve with
only two servers. You need a third device anyway).

Or put another way: a good alerting system is one that understands
Postgres to some extent. It protects you from data loss in *every* case.
If you attach at least two database servers to it, you get availability
as long as any one of the two is up and running. No matter what happened
before, even a full cluster power outage is guaranteed to recover from
automatically without any data loss.

[ Okay, the standby mode that only stores and ACKs WAL without having a
full database behind still needs to be written. However, pg_streamrecv
certainly goes that direction already, see [1]. ]

Sync replication between really just two servers is asking for trouble
and certainly not worth the savings in hardware cost. Better invest in a
good UPS and redundant power supplies for a single server.

 The question then is, should the new configuration alter running
 transactions?

It should definitely affect all currently running and waiting
transactions. For anything beyond three servers, where quorum_commit
could be bigger than one, it absolutely makes sense to be able to just
lower the requirements temporarily, instead of having to cancel the
guarantee completely.

Regards

Markus Wanner


[1]: Using streaming replication as log archiving, Magnus Hagander
http://archives.postgresql.org/message-id/aanlkti=_bzsyt8a1kjtpwzxnwyygqnvp1nbjwrnsd...@mail.gmail.com

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Issues with Quorum Commit

2010-10-07 Thread Dimitri Fontaine
Markus Wanner mar...@bluegap.ch writes:
 Why does one ever want the guarantee that sync replication gives to only
 hold true up to one failure, if a better guarantee doesn't cost anything
 extra? (Note that a good alerting system is impossible to achieve with
 only two servers. You need a third device anyway).

I think you're all into durability, and that's good. The extra cost is
service downtime if that's not what you're after: there's also
availability and load balancing read queries on a system with no lag (no
stale data servicing) when all is working right.

I still think your use case is a solid one, but that we need to be ready
to answer to some other ones, that you call relaxed and wrong because of
data loss risks. My proposal is to make the risk window obvious and the
behavior when you enter it configurable.

Regards,
-- 
Dimitri Fontaine
http://2ndQuadrant.fr PostgreSQL : Expertise, Formation et Support

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Issues with Quorum Commit

2010-10-07 Thread Aidan Van Dyk
On Thu, Oct 7, 2010 at 6:32 AM, Dimitri Fontaine dimi...@2ndquadrant.fr wrote:

 Or if the standby is lagging and the master wal_keep_segments is not
 sized big enough. Is that a catastrophic loss of the standby too?

Sure, but that lagged standy is already asynchrounous, not
synchrounous.  If it was synchronous, it would have slowed the master
down enough it would not be lagged.

I'm really confused with all this k  N scenarious I see bandied
about, because, all it really amounts to is I only want *one*
syncronous replication, and a bunch of synchrounous replications.
And a bit of chance thrown in the mix to hope the syncronous one is
pretty stable, the asynchronous ones aren't *too* far behind (define
too and far at your leisure).

And then I see a lot of posturing about how to recover when the
asynchronous standbys aren't synchronous enough at some point...


 Agreed, that's a nice simple use case.

 Another one is to say that I want sync rep when the standby is
 available, but I don't have the budget for more. So I prefer a good
 alerting system and low-budget-no-guarantee when the standby is down,
 that's my risk evaluation.

That screems wrong in my books:

OK, I want durability, so I always want to have 2 copies of the data,
but if we loose one, copy, I want to keep on trucking, because I don't
*really* want durability.

If you want most-of-the time mostly 2 copy durabiltiy, then really
good asynchronous replication is a really good solutions.

Yes, I believe you need to have a way for an admin (or
process/control/config) to be able to demote a synchronous
replication scenario into async (or standalone, which is just an
extension of really async).  But it's no longer syncronous replication
at that point.  And if the choice is made to keep trucking while a
new standby is being brought online and available and caught up,
that's fine too.  But during that perioud, until the slave is caught
up and synchrounously replicating, it's *not* synchronous replication.

So I'm not arguing that there shouldn't be a way to turn of
synchronous replication once it's on.  Hopefully without having to
take down the cluster (pg instance  type cluster) But I am pleading
that there is a way to setup PG such that synchronous replication *is*
synchronously replicating, or things stop and backup until such a time
as it is.

a.

-- 
Aidan Van Dyk                                             Create like a god,
ai...@highrise.ca                                       command like a king,
http://www.highrise.ca/                                   work like a slave.

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Issues with Quorum Commit

2010-10-07 Thread Dimitri Fontaine
Aidan Van Dyk ai...@highrise.ca writes:
 Sure, but that lagged standy is already asynchrounous, not
 synchrounous.  If it was synchronous, it would have slowed the master
 down enough it would not be lagged.

Agreed, except in the case of a joining standby. But you're saying it
better than I do:

 Yes, I believe you need to have a way for an admin (or
 process/control/config) to be able to demote a synchronous
 replication scenario into async (or standalone, which is just an
 extension of really async).  But it's no longer syncronous replication
 at that point.  And if the choice is made to keep trucking while a
 new standby is being brought online and available and caught up,
 that's fine too.  But during that perioud, until the slave is caught
 up and synchrounously replicating, it's *not* synchronous replication.

That's exactly my point. I think we need to handle the case and make it
obvious that this window is a data-loss window where there's no sync rep
ongoing, then offer users a choice of behaviour.

Regards,
-- 
Dimitri Fontaine
http://2ndQuadrant.fr PostgreSQL : Expertise, Formation et Support

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Issues with Quorum Commit

2010-10-07 Thread Aidan Van Dyk
On Thu, Oct 7, 2010 at 10:08 AM, Dimitri Fontaine
dimi...@2ndquadrant.fr wrote:
 Aidan Van Dyk ai...@highrise.ca writes:
 Sure, but that lagged standy is already asynchrounous, not
 synchrounous.  If it was synchronous, it would have slowed the master
 down enough it would not be lagged.

 Agreed, except in the case of a joining standby.

*shrug*  The joining standby is still asynchronous at this point.
It's not synchronous replication.  It's just another ^k of the N
slaves serving stale data ;-)

   But you're saying it
 better than I do:

 Yes, I believe you need to have a way for an admin (or
 process/control/config) to be able to demote a synchronous
 replication scenario into async (or standalone, which is just an
 extension of really async).  But it's no longer syncronous replication
 at that point.  And if the choice is made to keep trucking while a
 new standby is being brought online and available and caught up,
 that's fine too.  But during that perioud, until the slave is caught
 up and synchrounously replicating, it's *not* synchronous replication.

 That's exactly my point. I think we need to handle the case and make it
 obvious that this window is a data-loss window where there's no sync rep
 ongoing, then offer users a choice of behaviour.

Again, I'm stating there is *no* choice in synchronous replication.
It's *got* to block, otherwise it's not synchronous replication.  The
choice is if you want synchronous replication or not at that point.

And turning it off might be a good (best) choice for for most people.
I just want to make sure that:
1) There's now way to *sensibly* think it's still synchronously replicating
2) There is a way to enforce that the commits happening *are*
synchronously replicating.

a.

-- 
Aidan Van Dyk                                             Create like a god,
ai...@highrise.ca                                       command like a king,
http://www.highrise.ca/                                   work like a slave.

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Issues with Quorum Commit

2010-10-07 Thread Dimitri Fontaine
Aidan Van Dyk ai...@highrise.ca writes:
 *shrug*  The joining standby is still asynchronous at this point.
 It's not synchronous replication.  It's just another ^k of the N
 slaves serving stale data ;-)

Agreed *here*, but if you read the threads again, you'll see that's not
at all what's been talked about before my proposal.

In particular, the questions about how to unlock a master's setup while
its synced standby is doing a base backup should not be allowed to
exists, and you seem to agree with my point.

 That's exactly my point. I think we need to handle the case and make it
 obvious that this window is a data-loss window where there's no sync rep
 ongoing, then offer users a choice of behaviour.

 Again, I'm stating there is *no* choice in synchronous replication.
 It's *got* to block, otherwise it's not synchronous replication.  The
 choice is if you want synchronous replication or not at that point.

Exactly, even if I didn't dare spell it this way.

What I want to propose is for the user to be able to configure things so
that he loses the sync aspect of the replication if it so happens that
the setup is not able to provide for it.

It may sound strange, but it's needed when all you want is a no stale
data reporting stanbdy, e.g. And it so happens that it's already in
Simon's code, AFAIUI (yet to read it, see).

 And turning it off might be a good (best) choice for for most people.
 I just want to make sure that:
 1) There's now way to *sensibly* think it's still synchronously replicating
 2) There is a way to enforce that the commits happening *are*
 synchronously replicating.

We're on the same track. I don't know how to offer your options without
a clear listing of standby states and transitions, which must include
the synchronicity and whether you just lost it or whatnot.

Regards,
-- 
Dimitri Fontaine
http://2ndQuadrant.fr PostgreSQL : Expertise, Formation et Support

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Issues with Quorum Commit

2010-10-07 Thread Greg Smith

Markus Wanner wrote:

I think that's a pretty special case, because the good alerting system
is at least as expensive as another server that just persistently stores
and ACKs incoming WAL.
  


The cost of hardware capable of running a database server is a large 
multiple of what you can build an alerting machine for.  I have two 
systems that are approaching the trash heap just at my house, relative 
to the main work I do, but that are fully capable of running an alerting 
system.  Building a production quality database server requires a more 
significant investment:  high quality disks, ECC RAM, battery-backed 
RAID controller, etc.  Relative to what the hardware in a database 
server costs, what you need to build an alerting system is almost free.  
Oh:  and most businesses that are complicated enough to need a serious 
database server already have them, so they actually cost nothing beyond 
the software setup time to point them toward the databases, too.



Why does one ever want the guarantee that sync replication gives to only
hold true up to one failure, if a better guarantee doesn't cost anything
extra? (Note that a good alerting system is impossible to achieve with
only two servers. You need a third device anyway).
  


I do not disagree with your theory or reasoning.  But as a practical 
matter, I'm afraid the true cost of the better guarantee you're 
suggesting here is additional code complexity that will likely cause 
this feature to miss 9.1 altogether.  As far as I'm concerned, this 
whole diversion into the topic of quorum commit is only consuming 
resources away from targeting something achievable in the time frame of 
a single release.



Sync replication between really just two servers is asking for trouble
and certainly not worth the savings in hardware cost. Better invest in a
good UPS and redundant power supplies for a single server.
  


I wish I could give you the long list of data recovery projects I've 
worked on over the last few years, so you could really appreciate how 
much what you're saying here is exactly the opposite of the reality 
here.  You cannot make a single server reliable enough to survive all of 
the things that Murphy's Law will inflict upon it, at any price.  For 
most of the businesses I work with who want sync rep, data is not 
considered safe until the second copy is on storage miles away from the 
original, because they know this too.


Personal anecdote I can share:  I used to have an important project 
related to stock trading where I kept my backup system about 50 miles 
away from me.  I was aiming for constant availability, while still being 
able to drive to the other server if needed for disaster recovery.  
Guess what?  Even those two turned out not to be nearly independent 
enough; see http://en.wikipedia.org/wiki/Northeast_Blackout_of_2003 for 
details of how I lost both of those at the same time for days.  Silly 
me, I'd only spread them across two adjacent states with different power 
providers!  Not nearly good enough to avoid a correlated failure.


--
Greg Smith, 2ndQuadrant US g...@2ndquadrant.com Baltimore, MD
PostgreSQL Training, Services and Support  www.2ndQuadrant.us



--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Issues with Quorum Commit

2010-10-07 Thread Josh Berkus
On 10/7/10 6:41 AM, Aidan Van Dyk wrote:
 I'm really confused with all this k  N scenarious I see bandied
 about, because, all it really amounts to is I only want *one*
 syncronous replication, and a bunch of synchrounous replications.
 And a bit of chance thrown in the mix to hope the syncronous one is
 pretty stable, the asynchronous ones aren't *too* far behind (define
 too and far at your leisure).

Effectively, yes.  The the difference between k of N synch rep and 1
synch standby + several async standbys is that in k of N, you have a
pool and aren't dependent on having a specific standby be very reliable,
just that any one of them is.

So if you have k = 3 and N = 10, then you can have 10 standbys and only
3 of them need to ack any specific commit for the master to proceed. As
long as (a) you retain at least one of the 3 which ack'd, and (b) you
have some way of determining which standby is the most caught up, data
loss is fairly unlikely; you'd need to lose 4 of the 10, and the wrong
4, to lose data.

The advantage of this for availability over just having k = N = 3 comes
when one of the standbys is responding slowly (due to traffic) or goes
offline unexpectedly due to a hardware failure.  In the k = N = 3 case,
the system halts.  In the k = 3, N = 10 case, you can lose up to 7
standbys without the system going down.

It's notable that the massively scalable transactional databases
(Dynamo, Cassandra, various telecom databases, etc.) all operate this way.

However, I do consider this advanced functionality and not worth
pursuing until we have the k = 1 case implemented and well-tested.  For
comparison, Cassandra, Hypertable and Riak have been working on their k
 N functionality for a couple years now and none of them has it stable
*and* fast.

-- 
  -- Josh Berkus
 PostgreSQL Experts Inc.
 http://www.pgexperts.com

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Issues with Quorum Commit

2010-10-07 Thread Aidan Van Dyk
On Thu, Oct 7, 2010 at 1:22 PM, Josh Berkus j...@agliodbs.com wrote:

 So if you have k = 3 and N = 10, then you can have 10 standbys and only
 3 of them need to ack any specific commit for the master to proceed. As
 long as (a) you retain at least one of the 3 which ack'd, and (b) you
 have some way of determining which standby is the most caught up, data
 loss is fairly unlikely; you'd need to lose 4 of the 10, and the wrong
 4, to lose data.

 The advantage of this for availability over just having k = N = 3 comes
 when one of the standbys is responding slowly (due to traffic) or goes
 offline unexpectedly due to a hardware failure.  In the k = N = 3 case,
 the system halts.  In the k = 3, N = 10 case, you can lose up to 7
 standbys without the system going down.

Sure, but here is where I might not be following.

If you want synchronous replication because you want query
availabilty while making sure you're not getting stale queries from
all your slaves, than using your k  N (k = 3 and N - 10) situation is
screwing your self.

To get non-stale responses, you can only query those k=3 servers.
But you've shot your self in the foot because you don't know which
3/10 those will be.  The other 7 *are* stale (by definition).  They
talk about picking the caught up slave when the master fails, but
you actually need to do that for *every query*.

If you say they are pretty close so by the time you get the query to
them they will be caught up, well then, all you really want is good
async replication, you don't really *need* the synchronous part.

The only case I see a race to quorum type of k  N being useful is
if you're just trying to duplicate data everywhere, but not actually
querying any of the replicas.  I can see that all queries go to the
master, but the chances are pretty high the multiple machines are
going to fail so I want  multiple replicas being useful, but I
*don't* think that's what most people are wanting in their I want 3
of 10 servers to ack the commit.

The difference between good async and sync is only the *guarentee*.
If you don't need the guarantee, you don't need the synchronous part.

a.


-- 
Aidan Van Dyk                                             Create like a god,
ai...@highrise.ca                                       command like a king,
http://www.highrise.ca/                                   work like a slave.

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Issues with Quorum Commit

2010-10-07 Thread Josh Berkus

 If you want synchronous replication because you want query
 availabilty while making sure you're not getting stale queries from
 all your slaves, than using your k  N (k = 3 and N - 10) situation is
 screwing your self.

Correct. If that is your reason for synch standby, then you should be
using k = N configuration.

However, some people are willing to sacrifice consistency for durability
and availability.  We should give them that option (eventually), since
among that triad you can never have more than two.

-- 
  -- Josh Berkus
 PostgreSQL Experts Inc.
 http://www.pgexperts.com

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Issues with Quorum Commit

2010-10-07 Thread Markus Wanner
On 10/07/2010 06:41 PM, Greg Smith wrote:
 The cost of hardware capable of running a database server is a large
 multiple of what you can build an alerting machine for.

You realize you don't need lots of disks nor RAM for a box that only
ACKs? A box with two SAS disks and a BBU isn't that expensive anymore.

 I do not disagree with your theory or reasoning.  But as a practical
 matter, I'm afraid the true cost of the better guarantee you're
 suggesting here is additional code complexity that will likely cause
 this feature to miss 9.1 altogether.  As far as I'm concerned, this
 whole diversion into the topic of quorum commit is only consuming
 resources away from targeting something achievable in the time frame of
 a single release.

So far I've been under the impression that Simon already has the code
for quorum_commit k = 1.

What I'm opposing to is the timeout feature, which I consider to be
additional code, unneeded complexity and foot-gun.

 You cannot make a single server reliable enough to survive all of
 the things that Murphy's Law will inflict upon it, at any price.

That's exactly what I'm saying applies to two servers as well. And why a
timeout is a bad thing here, because the chance the second nodes fails
as well is there (and is higher than you think, according to Murphy).

 For
 most of the businesses I work with who want sync rep, data is not
 considered safe until the second copy is on storage miles away from the
 original, because they know this too.

Now, that are the people who really need sync rep, yes. What do you
think how happy those businesses were to find out that Postgres is
cheating on them in case of a network outage, for example? Do they
really value (write!) availability more than data safety?

 Silly
 me, I'd only spread them across two adjacent states with different power
 providers!  Not nearly good enough to avoid a correlated failure.

Thanks for sharing this. I hope you didn't loose data.

Regards

Markus Wanner

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Issues with Quorum Commit

2010-10-07 Thread Josh Berkus

 But as a practical matter, I'm afraid the true cost of the better
 guarantee you're suggesting here is additional code complexity that will
 likely cause this feature to miss 9.1 altogether.  As far as I'm
 concerned, this whole diversion into the topic of quorum commit is only
 consuming resources away from targeting something achievable in the time
 frame of a single release.

Yes. My purpose in starting this thread was to show that k  1 quorum
commit is considerably more complex than the people who have been
bringing it up in other threads seem to think it is.  It is not
achievable for 9.1, and maybe not even for 9.2.

-- 
  -- Josh Berkus
 PostgreSQL Experts Inc.
 http://www.pgexperts.com

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Issues with Quorum Commit

2010-10-07 Thread Kevin Grittner
Aidan Van Dyk ai...@highrise.ca wrote:
 
 To get non-stale responses, you can only query those k=3
 servers.  But you've shot your self in the foot because you don't
 know which 3/10 those will be.  The other 7 *are* stale (by
 definition). They talk about picking the caught up slave when
 the master fails, but you actually need to do that for *every
 query*.
 
With web applications, at least, you often don't care that the data
read is absolutely up-to-date, as long as the point in time doesn't
jump around from one request to the next.  When we have used load
balancing between multiple database servers (which has actually
become unnecessary for us lately because PostgreSQL has gotten so
darned fast!), we have established affinity between a session and
one of the database servers, so that if they became slightly out of
sync, data would not pop in and out of existence arbitrarily.  I
think a reasonable person could combine this technique with a 3 of
10 synchronous replication quorum to get both safe persistence of
data and reasonable performance.
 
I can also envision use cases where this would not be desirable.
 
-Kevin

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Issues with Quorum Commit

2010-10-07 Thread Robert Haas
On Thu, Oct 7, 2010 at 2:10 PM, Kevin Grittner
kevin.gritt...@wicourts.gov wrote:
 Aidan Van Dyk ai...@highrise.ca wrote:

 To get non-stale responses, you can only query those k=3
 servers.  But you've shot your self in the foot because you don't
 know which 3/10 those will be.  The other 7 *are* stale (by
 definition). They talk about picking the caught up slave when
 the master fails, but you actually need to do that for *every
 query*.

 With web applications, at least, you often don't care that the data
 read is absolutely up-to-date, as long as the point in time doesn't
 jump around from one request to the next.  When we have used load
 balancing between multiple database servers (which has actually
 become unnecessary for us lately because PostgreSQL has gotten so
 darned fast!), we have established affinity between a session and
 one of the database servers, so that if they became slightly out of
 sync, data would not pop in and out of existence arbitrarily.  I
 think a reasonable person could combine this technique with a 3 of
 10 synchronous replication quorum to get both safe persistence of
 data and reasonable performance.

 I can also envision use cases where this would not be desirable.

Well, keep in mind all updates have to be done on the single master.
That works pretty well for fine-grained replication, but I don't think
it's very good for full-cluster replication.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise Postgres Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Issues with Quorum Commit

2010-10-07 Thread Kevin Grittner
Robert Haas robertmh...@gmail.com wrote:
 Kevin Grittner kevin.gritt...@wicourts.gov wrote:
 
 With web applications, at least, you often don't care that the
 data read is absolutely up-to-date, as long as the point in time
 doesn't jump around from one request to the next.  When we have
 used load balancing between multiple database servers (which has
 actually become unnecessary for us lately because PostgreSQL has
 gotten so darned fast!), we have established affinity between a
 session and one of the database servers, so that if they became
 slightly out of sync, data would not pop in and out of existence
 arbitrarily.  I think a reasonable person could combine this
 technique with a 3 of 10 synchronous replication quorum to get
 both safe persistence of data and reasonable performance.
 
 I can also envision use cases where this would not be desirable.
 
 Well, keep in mind all updates have to be done on the single
 master.  That works pretty well for fine-grained replication, but
 I don't think it's very good for full-cluster replication.
 
I'm completely failing to understand your point here.  Could you
restate another way?
 
-Kevin

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Issues with Quorum Commit

2010-10-07 Thread Markus Wanner
On 10/07/2010 03:19 PM, Dimitri Fontaine wrote:
 I think you're all into durability, and that's good. The extra cost is
 service downtime

It's just *reduced* availability. That doesn't necessarily mean
downtime, if you combine cleverly with async replication.

 if that's not what you're after: there's also
 availability and load balancing read queries on a system with no lag (no
 stale data servicing) when all is working right.

All I'm saying is that those use cases are much better served with async
replication. Maybe together with something that warns and takes action
in case the standby's lag gets too big.

Or what kind of customers do you think really need a no-lag solution for
read-only queries? In the LAN case, the lag of async rep is negligible
and in the WAN case the latencies of sync rep are prohibitive.

 My proposal is to make the risk window obvious and the
 behavior when you enter it configurable.

I don't buy that. The risk calculation gets a lot simpler and obvious
with strict guarantees.

Regards

Markus Wanner

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Issues with Quorum Commit

2010-10-07 Thread Markus Wanner
On 10/07/2010 07:44 PM, Aidan Van Dyk wrote:
 The only case I see a race to quorum type of k  N being useful is
 if you're just trying to duplicate data everywhere, but not actually
 querying any of the replicas.  I can see that all queries go to the
 master, but the chances are pretty high the multiple machines are
 going to fail so I want  multiple replicas being useful, but I
 *don't* think that's what most people are wanting in their I want 3
 of 10 servers to ack the commit.

What else do you think they want it for, if not for protection against
data loss?

(Note that the queries don't need to go to the master exclusively if you
can live with some lag - and I think the vast majority of people can.
The zero data loss guarantee holds true in any case, though).

 The difference between good async and sync is only the *guarentee*.
 If you don't need the guarantee, you don't need the synchronous part.

Here we are exactly on the same page again.

Regards

Markus Wanner

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Issues with Quorum Commit

2010-10-07 Thread Robert Haas
On Thu, Oct 7, 2010 at 2:31 PM, Kevin Grittner
kevin.gritt...@wicourts.gov wrote:
 Robert Haas robertmh...@gmail.com wrote:
 Kevin Grittner kevin.gritt...@wicourts.gov wrote:

 With web applications, at least, you often don't care that the
 data read is absolutely up-to-date, as long as the point in time
 doesn't jump around from one request to the next.  When we have
 used load balancing between multiple database servers (which has
 actually become unnecessary for us lately because PostgreSQL has
 gotten so darned fast!), we have established affinity between a
 session and one of the database servers, so that if they became
 slightly out of sync, data would not pop in and out of existence
 arbitrarily.  I think a reasonable person could combine this
 technique with a 3 of 10 synchronous replication quorum to get
 both safe persistence of data and reasonable performance.

 I can also envision use cases where this would not be desirable.

 Well, keep in mind all updates have to be done on the single
 master.  That works pretty well for fine-grained replication, but
 I don't think it's very good for full-cluster replication.

 I'm completely failing to understand your point here.  Could you
 restate another way?

Establishing an affinity between a session and one of the database
servers will only help if the traffic is strictly read-only.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise Postgres Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Issues with Quorum Commit

2010-10-07 Thread Dimitri Fontaine
Markus Wanner mar...@bluegap.ch writes:
 I don't buy that. The risk calculation gets a lot simpler and obvious
 with strict guarantees.

Ok, I'm lost in the use cases and analysis.


I still don't understand why you want to consider the system already
synchronous when it's not, whatever is the guarantee you're asking for.

All I'm saying is that we should be able to know and show what the
current system is up to, and we should be able to offer sane reactions
in case of errors.

You're calling a sane reaction blocking the master entirely when the
standby ain't ready yet (it's still at the base backup state), and I can
live with that. As an option.

I say that either we go the lax quorum route, or we have to care for
details and summary the failure cases with precision, and the possible
responses with care. I don't see that possible without a clear state of
each element in the system, their transitions, and a way to derive the
global state of the distributed system out of that.

It might be that the simpler way to go here is what Greg Smith has been
proposing for a long time already, and again quite recently on this
thread: have all the information you need in a system table and offer to
run a user defined function to determine the state of the system.


I think we managed to show what Josh Berkus wanted to know now. That's a
quagmire here. Now, the problem I have is not Quorum Commit but the very
definition of synchronous replication and the system we're trying to
build. Not sure there's two of us wanting the same thing here.


Regards,
-- 
Dimitri Fontaine
http://2ndQuadrant.fr PostgreSQL : Expertise, Formation et Support

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Issues with Quorum Commit

2010-10-07 Thread Kevin Grittner
Robert Haas robertmh...@gmail.com wrote:
 
 Establishing an affinity between a session and one of the database
 servers will only help if the traffic is strictly read-only.
 
Thanks; I now see your point.
 
In our environment, that's pretty common.  Our most heavily used web
app (the one for which we have, at times, needed load balancing)
connects to the database with a read-only login.  Many of our web
apps do their writing by posting to queues which are handled at the
appropriate source database later.  (I had the opportunity to use
one of these for real last night, to fill in a juror questionnaire
after receiving a summons from the jury clerk in the county where I
live.)
 
Like I said, there are sane cases for this usage, but it won't fit
everybody.  I have no idea on percentages.
 
-Kevin

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Issues with Quorum Commit

2010-10-07 Thread Simon Riggs
On Thu, 2010-10-07 at 13:44 -0400, Aidan Van Dyk wrote:

 To get non-stale responses, you can only query those k=3 servers.
 But you've shot your self in the foot because you don't know which
 3/10 those will be.  The other 7 *are* stale (by definition).  They
 talk about picking the caught up slave when the master fails, but
 you actually need to do that for *every query*.

There is a big confusion around that point and I need to point out that
statement isn't accurate. It's taken me a long while to understand this.

Asking for k  1 does *not* mean those servers are time synchronised.
All it means is that the master will stop waiting after 3
acknowledgements. There is no connection between the master receiving
acknowledgements and the standby applying changes received from master;
the standbys are all independent of one another.

In a bad case, those 3 acknowledgements might happen say 5 seconds apart
on the worst and best of the 3 servers. So the first standby to receive
the data could have applied the changes ~4.8 seconds prior to the 3rd
standby. There is still a chance of reading stale data on one standby,
but reading fresh data on another server. In most cases the time window
is small, but still exists.

The other 7 are stale with respect to the first 3. But then so are the
last 9 compared with the first one. The value of k has nothing
whatsoever to do with the time difference between the master and the
last standby to receive/apply the changes. The gap between first and
last standby (i.e. N, not k) is the time window during which a query
might/might not see a particular committed result.

So standbys are eventually consistent whether or not the master relies
on them to provide an acknowledgement. The only place where you can
guarantee non-stale data is on the master.

High values of k reduce the possibility of data loss, whereas expected
cluster availability is reduced as N - k gets smaller.

-- 
 Simon Riggs   www.2ndQuadrant.com
 PostgreSQL Development, 24x7 Support, Training and Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Issues with Quorum Commit

2010-10-07 Thread Simon Riggs
On Thu, 2010-10-07 at 19:50 +0200, Markus Wanner wrote:

 So far I've been under the impression that Simon already has the code
 for quorum_commit k = 1.

I do, but its not a parameter. The k = 1 behaviour is hardcoded and
considerably simplifies the design. Moving to k  1 is additional work,
slows things down and seems likely to be fragile.

-- 
 Simon Riggs   www.2ndQuadrant.com
 PostgreSQL Development, 24x7 Support, Training and Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Issues with Quorum Commit

2010-10-07 Thread Josh Berkus
All,

 Establishing an affinity between a session and one of the database
 servers will only help if the traffic is strictly read-only.

I think this thread has drifted very far away from anything we're going
to do for 9.1.  And seems to have little to do with synchronous replication.

Synch rep ensures durability.  It is not, by itself, a method of
ensuring consistency, nor does it pretend to be one.

-- 
  -- Josh Berkus
 PostgreSQL Experts Inc.
 http://www.pgexperts.com

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Issues with Quorum Commit

2010-10-07 Thread Greg Smith

Markus Wanner wrote:

So far I've been under the impression that Simon already has the code
for quorum_commit k = 1.

What I'm opposing to is the timeout feature, which I consider to be
additional code, unneeded complexity and foot-gun.
  


Additional code?  Yes.  Foot-gun?  Yes.  Timeout should be disabled by 
default so that you get wait forever unless you ask for something 
different?  Probably.  Unneeded?  This is where we don't agree anymore.  
The example that Josh Berkus just sent to the list is a typical example 
of what I expect people to do here.  They'll use Sync Rep to maximize 
the odds a system failure doesn't cause any transaction loss.  They'll 
use good quality hardware on the master so it's unlikely to fail.  But 
when the database finds the standby unreachable, and it's left with the 
choice between either degrading into async rep or coming to a complete 
halt, you must give people the option of choosing to degrade instead 
after a timeout.  Let them set off the red flashing lights, sound the 
alarms, and pray the master doesn't go down until you can fix the 
problem.  But the choice to allow uptime concerns to win over the normal 
sync rep preferences, that's a completely valid business decision people 
will absolutely want to make in a way opposite of your personal 
preference here.


I don't see this as needing any implementation any more complicated than 
the usual way such timeouts are handled.  Note how long you've been 
trying to reach the standby.  Default to -1 for forever.  And if you hit 
the timeout, mark the standby as degraded and force them to do a proper 
resync when they disconnect.  Once that's done, then they can re-enter 
sync rep mode again, via the same process a new node would have done so.


--
Greg Smith, 2ndQuadrant US g...@2ndquadrant.com Baltimore, MD
PostgreSQL Training, Services and Support  www.2ndQuadrant.us
Author, PostgreSQL 9.0 High PerformancePre-ordering at:
https://www.packtpub.com/postgresql-9-0-high-performance/book


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Issues with Quorum Commit

2010-10-07 Thread Fujii Masao
On Wed, Oct 6, 2010 at 6:11 PM, Markus Wanner mar...@bluegap.ch wrote:
 Yeah, sounds more likely. Then I'm surprised that I didn't find any
 warning that the Protocol C definitely reduces availability (with the
 ko-count=0 default, that is).

Really? I don't think that ko-count=0 means wait-forever. IIRC,
when I tried DRBD, I can write data in master's DRBD disk, without
connected standby. So I think that by default the master waits for
timeout and works alone when the standby goes down.

Regards,

-- 
Fujii Masao
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Issues with Quorum Commit

2010-10-07 Thread Fujii Masao
On Wed, Oct 6, 2010 at 6:00 PM, Heikki Linnakangas
heikki.linnakan...@enterprisedb.com wrote:
 In general, salvaging the WAL that was not sent to the standby yet is
 outright impossible. You can't achieve zero data loss with asynchronous
 replication at all.

No. That depends on the type of failure. Unless the disk in the master has
been corrupted, we might be able to salvage WAL.

 If we want only no data loss, we have only to implement the wait-forever
 option. But if we make consideration for the above-mentioned availability,
 the return-immediately option also would be required.

 In some (many, I think) cases, I think that we need to consider
 availability
 and no data loss together, and consider the balance of them.

 If you need both, you need three servers as Simon pointed out earlier. There
 is no way around that.

No. That depends on how far you'd like to ensure no data loss.

Poeple who use shared disk failover solution with one master and one standby
don't such a high durability. They can avoid data loss by using something
like RAID to a certain extent. So it's not problem for them to run the master
alone after failover happens or standby goes down. But something like RAID
cannot increase availability. Synchronous replication is solution for that
purpose.

Of course, if we are worried about running the master alone, we can increase
the number of standbys. Furthermore, if we'd like to avoid data loss from the
disaster which can destroy all the servers at the same time, we might need to
increase the standbys further and locate some of them in the remote site.

Please imagine that return-immediately (i.e., timeout = small) is useful
for some use cases.

Regards,

-- 
Fujii Masao
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Issues with Quorum Commit

2010-10-07 Thread Robert Haas
On Thu, Oct 7, 2010 at 10:24 PM, Fujii Masao masao.fu...@gmail.com wrote:
 On Wed, Oct 6, 2010 at 6:00 PM, Heikki Linnakangas
 heikki.linnakan...@enterprisedb.com wrote:
 In general, salvaging the WAL that was not sent to the standby yet is
 outright impossible. You can't achieve zero data loss with asynchronous
 replication at all.

 No. That depends on the type of failure. Unless the disk in the master has
 been corrupted, we might be able to salvage WAL.

So I guess another way to say this is that zero data loss is
unachievable, period.  Greg Smith made a flip comment about having
been so silly as to only put his redundant servers in adjacent states
on different power grids, and yet still having an outage due to the
Northeast blackouts.  So what would he have had to do to completely
rule out a correlated failure?

Answer: It can't be done.  If a massive asteroid comes zooming into
the inner solar system tomorrow and hits the earth, obliterating all
life, you're toast.  Or likewise if nuclear war ensues.  You could put
your redundant server on the moon or, better yet, on a moon of one of
the outer planets, but the hosting costs are pretty high and the ping
times suck.

So the point is that the question is not whether or not a correlated
failure can happen, but whether you can imagine a scenario where a
correlated failure has occurred yet you still wish you had your data.
Different people will, obviously, draw that line in different places.
Let's start by doing something simple that covers SOME of the cases
people want, get it committed, and then move on from there.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise Postgres Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Issues with Quorum Commit

2010-10-07 Thread Fujii Masao
On Wed, Oct 6, 2010 at 9:22 PM, Dimitri Fontaine dimi...@2ndquadrant.fr wrote:
 From my experience operating londiste, those states would be:

  1. base-backup  — self explaining
  2. catch-up     — getting the WAL to catch up after base backup
  3. wanna-sync   — don't yet have all the WAL to get in sync
  4. do-sync      — all WALs are there, coming soon
  5. ok (async | recv | fsync | reply — feedback loop engaged)

I agree to mange these standby states in a different standpoint.

To avoid data loss, we must not promote the standby which is
catching up with the master in half way to the new master at the
failover. If clusterware can get the current standby state via SQL,
it can check whether the failover causes data loss or not
and give up failover before creating the trigger file.

Regards,

-- 
Fujii Masao
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Issues with Quorum Commit

2010-10-07 Thread Fujii Masao
On Thu, Oct 7, 2010 at 5:01 AM, Simon Riggs si...@2ndquadrant.com wrote:
 You seem willing to trade anything for that guarantee. I seek a more
 pragmatic approach that balances availability and risk.

 Those views are different, but not inconsistent. Oracle manages to offer
 multiple options and so can we.

+1

Regards,

-- 
Fujii Masao
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Issues with Quorum Commit

2010-10-07 Thread Fujii Masao
On Thu, Oct 7, 2010 at 3:01 AM, Markus Wanner mar...@bluegap.ch wrote:
 Of course, it doesn't make sense to wait-forever on *every* standby that
 ever gets added. Quorum commit is required, yes (and that's what this
 thread is about, IIRC). But with quorum commit, adding a standby only
 improves availability, but certainly doesn't block the master in any
 way.

But, even with quorum commit, if you choose wait-forever option,
failover would decrease availability. Right after the failover,
no standby has connected to new master, so if quorum = 1, all
the transactions must wait for a while.

Basically we need to take a base backup from new master to start
the standbys and make them connect to new master. This might take
a long time. Since transaction commits cannot advance for that time,
availability would goes down.

Or you think that wait-forever option is applied only when the
standby goes down?

Regards,

-- 
Fujii Masao
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Issues with Quorum Commit

2010-10-07 Thread Joshua D. Drake
On Thu, 2010-10-07 at 19:44 -0400, Greg Smith wrote:

 I don't see this as needing any implementation any more complicated than 
 the usual way such timeouts are handled.  Note how long you've been 
 trying to reach the standby.  Default to -1 for forever.  And if you hit 
 the timeout, mark the standby as degraded and force them to do a proper 
 resync when they disconnect.  Once that's done, then they can re-enter 
 sync rep mode again, via the same process a new node would have done so.

What I don't understand is why this isn't obvious to everyone. Greg this
is very well put and the -hackers need to start thinking like people
that actually use the database.

JD
-- 
PostgreSQL.org Major Contributor
Command Prompt, Inc: http://www.commandprompt.com/ - 509.416.6579
Consulting, Training, Support, Custom Development, Engineering
http://twitter.com/cmdpromptinc | http://identi.ca/commandprompt


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Issues with Quorum Commit

2010-10-07 Thread Fujii Masao
On Fri, Oct 8, 2010 at 8:44 AM, Greg Smith g...@2ndquadrant.com wrote:
 Additional code?  Yes.  Foot-gun?  Yes.  Timeout should be disabled by
 default so that you get wait forever unless you ask for something different?
  Probably.  Unneeded?  This is where we don't agree anymore.  The example
 that Josh Berkus just sent to the list is a typical example of what I expect
 people to do here.  They'll use Sync Rep to maximize the odds a system
 failure doesn't cause any transaction loss.  They'll use good quality
 hardware on the master so it's unlikely to fail.  But when the database
 finds the standby unreachable, and it's left with the choice between either
 degrading into async rep or coming to a complete halt, you must give people
 the option of choosing to degrade instead after a timeout.  Let them set off
 the red flashing lights, sound the alarms, and pray the master doesn't go
 down until you can fix the problem.  But the choice to allow uptime concerns
 to win over the normal sync rep preferences, that's a completely valid
 business decision people will absolutely want to make in a way opposite of
 your personal preference here.

Definitely agreed.

 I don't see this as needing any implementation any more complicated than the
 usual way such timeouts are handled.  Note how long you've been trying to
 reach the standby.  Default to -1 for forever.  And if you hit the timeout,
 mark the standby as degraded and force them to do a proper resync when they
 disconnect.  Once that's done, then they can re-enter sync rep mode again,
 via the same process a new node would have done so.

Fair enough.

One question is when this timeout is applied. Obviously it should be applied
when the standby goes down. But timeout should be applied when we initially
start the master, and when no standby has not connected to new master yet after
failover?

I guess that people who want wait-forever would want to use timeout = -1
for all those cases. Otherwise they cannot ensure their no data loss.

OTOH, people who don't want wait-forever would not want to wait for timeout
in the latter two cases. So ISTM that something like enable_wait_forever or
reaction_after_timeout parameter is required separately from the timeout.

Regards,

-- 
Fujii Masao
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Issues with Quorum Commit

2010-10-06 Thread Heikki Linnakangas

On 06.10.2010 01:14, Josh Berkus wrote:

Last I checked, our goal with synch standby was to increase availablity,
not decrease it.


No. Synchronous replication does not help with availability. It allows 
you to achieve zero data loss, ie. if the master dies, you are 
guaranteed that any transaction that was acknowledged as committed, is 
still committed.


The other use case is keeping a hot standby server (or servers) 
up-to-date, so that you can run queries against it and you are 
guaranteed to get the same results you would if you ran the query in the 
master.


Those are the two reasonable use cases I've seen. Anything else that has 
been discussed is some sort of a combination of those two, or something 
that doesn't make much sense when you scratch the surface and start 
looking at the failure modes.


--
  Heikki Linnakangas
  EnterpriseDB   http://www.enterprisedb.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Issues with Quorum Commit

2010-10-06 Thread Heikki Linnakangas

On 06.10.2010 01:14, Josh Berkus wrote:

You start a new one from the latest base backup and let it catch up?
Possibly modifying the config file in the master to let it know about
the new standby, if we go down that path. This part doesn't seem
particularly hard to me.


Agreed, not sure of the issue there.


See previous post.  The critical phrase is *without restarting the
master*.  AFAICT, no patch has addressed the need to change the master's
synch configuration without restarting it.  It's possible that I'm not
following something, in which case I'd love to have it pointed out.


Fair enough. I agree it's important that the configuration can be 
changed on the fly. It's orthogonal to the other things discussed, so 
let's just assume for now that we'll have that. If not in the first 
version, it can be added afterwards. pg_ctl reload is probably how it 
will be done.


There is some interesting behavioral questions there on what happens 
when the configuration is changed. Like if you first define that 3 out 
of 5 servers must acknowledge, and you have an in-progress commit that 
has received 2 acks already. If you then change the config to 2 out of 
4 servers must acknowledge, is the in-progress commit now satisfied? 
From the admin point of view, the server that was removed from the 
system might've been one that had acknowledged already, and logically in 
the new configuration the transaction has only received 1 acknowledgment 
from those servers that are still part of the system. Explicitly naming 
the standbys in the config file would solve that particular corner case, 
but it would no doubt introduce other similar ones.


But it's an orthogonal issue, we'll figure it out when we get there.

--
  Heikki Linnakangas
  EnterpriseDB   http://www.enterprisedb.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Issues with Quorum Commit

2010-10-06 Thread Markus Wanner
On 10/06/2010 04:31 AM, Simon Riggs wrote:
 That situation would require two things
 * First, you have set up async replication and you're not monitoring it
 properly. Shame on you.

The way I read it, Jeff is complaining about the timeout you propose
that effectively turns sync into async replication in case of a failure.

With a master that waits forever, the standby that's newly required for
quorum certainly still needs its time to catch up. But it wouldn't live
in danger of being optimized away for availability in case it cannot
catch up within the given timeout. It's a tradeoff between availability
and durability.

 So it can occur in both cases, though it now looks to me that its less
 important an issue in either case. So I think this doesn't rate the term
 dangerous to describe it any longer.

The proposed timeout certainly still sounds dangerous to me. I'd rather
recommend setting it to an incredibly huge value to minimize its dangers
and get sync replication when that is what has been asked for. Use async
replication for increased availability.

Or do you envision any use case that requires a quorum of X standbies
for normal operation but is just fine with only none to (X-1) standbies
in case of failures? IMO that's when sync replication is most needed and
when it absolutely should hold to its promises - even if it means to
stop the system.

There's no point in continuing operation if you cannot guarantee the
minimum requirements for durability. If you happen to want such a thing,
you should better rethink your minimum requirement (as performance for
normal operations might benefit from a lower minimum as well).

Regards

Markus Wanner


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Issues with Quorum Commit

2010-10-06 Thread Fujii Masao
On Wed, Oct 6, 2010 at 10:52 AM, Jeff Davis pg...@j-davis.com wrote:
 I'm not sure I entirely understand. I was concerned about the case of a
 standby server being allowed to lag behind the rest by a large number of
 WAL records. That can't happen in the wait for all servers to apply
 case, because the system would become unavailable rather than allow a
 significant difference in the amount of WAL applied.

 I'm not saying that an unavailable system is good, but I don't see how
 my particular complaint applies to the wait for all servers to apply
 case.

 The case I was worried about is:
  * 1 master and 2 standby
  * The rule is wait for at least one standby to apply the WAL

 In your notation, I believe that's M - { S1, S2 }

 In that case, if one S1 is just a little faster than S2, then S2 might
 build up a significant queue of unapplied WAL. Then, when S1 goes down,
 there's no way for the slower one to acknowledge a new transaction
 without playing through all of the unapplied WAL.

 Intuitively, the administrator would think that he was getting both HA
 and redundancy, but in reality the availability is no better than if
 there were only two servers (M - S1), except that it might be faster to
 replay the WAL then to set up a new standby (but that's not guaranteed).

Agreed. This is similar to my previous complaint.
http://archives.postgresql.org/pgsql-hackers/2010-09/msg00946.php

This problem would happen even if we fix the quorum to 1 as Josh propose.
To avoid this, the master must wait for ACK from all the connected
synchronous standbys.

I think that this is likely to happen especially when we choose 'apply'
replication level. Because that level can easily lag a synchronous
standby because of the conflict between recovery and read-only query.

Regards,

-- 
Fujii Masao
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Issues with Quorum Commit

2010-10-06 Thread Markus Wanner
On 10/06/2010 08:31 AM, Heikki Linnakangas wrote:
 On 06.10.2010 01:14, Josh Berkus wrote:
 Last I checked, our goal with synch standby was to increase availablity,
 not decrease it.
 
 No. Synchronous replication does not help with availability. It allows
 you to achieve zero data loss, ie. if the master dies, you are
 guaranteed that any transaction that was acknowledged as committed, is
 still committed.

Strictly speaking, it even reduces availability. Which is why nobody
actually wants *only* synchronous replication. Instead they use quorum
commit or semi-synchronous (shudder) replication, which only requires
*some* nodes to be in sync, but effectively replicates asynchronously to
the others.

From that point of view, the requirement of having one synch and two
async standbies is pretty much the same as having three synch standbies
with a quorum commit of 1. (Except for additional availability of the
later variant, because in case of a failure of the one sync standby, any
of the others can take over without admin intervention).

Regards

Markus Wanner

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


  1   2   >