Re: [HACKERS] Synchronization levels in SR

2010-09-08 Thread Fujii Masao
On Tue, Sep 7, 2010 at 6:02 AM, Simon Riggs si...@2ndquadrant.com wrote:
 On Mon, 2010-09-06 at 22:32 +0200, Boszormenyi Zoltan wrote:
 (in commit)
 write wal record
 release locks/etc   xact2 can proceed from here
 wait for sync ack

 In the first case, the contention is obviously increased.
 With this, we are creating more idle time in the server
 instead of letting other transactions do their jobs as soon
 as possible. The second method was implemented in my
 patch. Are there any drawbacks with this?

 Then I respectfully suggest that you're releasing locks too early.

 Your proposal would allow a 2nd user to see the results of the 1st
 user's transaction before the 1st user knew about whether it had
 committed or not.

 I know why you want that, but I don't think its right.

Agreed. That's why I put the wait before ProcArrayEndTransaction()
is called.

Regards,

-- 
Fujii Masao
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Synchronization levels in SR

2010-09-08 Thread Boszormenyi Zoltan
Fujii Masao írta:
 On Tue, Sep 7, 2010 at 6:02 AM, Simon Riggs si...@2ndquadrant.com wrote:
   
 On Mon, 2010-09-06 at 22:32 +0200, Boszormenyi Zoltan wrote:
 
 (in commit)
 write wal record
 release locks/etc   xact2 can proceed from here
 wait for sync ack

 In the first case, the contention is obviously increased.
 With this, we are creating more idle time in the server
 instead of letting other transactions do their jobs as soon
 as possible. The second method was implemented in my
 patch. Are there any drawbacks with this?
   
 Then I respectfully suggest that you're releasing locks too early.

 Your proposal would allow a 2nd user to see the results of the 1st
 user's transaction before the 1st user knew about whether it had
 committed or not.

 I know why you want that, but I don't think its right.
 

 Agreed. That's why I put the wait before ProcArrayEndTransaction()
 is called.
   

Then there is no use to implement individual sync/async
replicated transactions, period. An async replicated transaction
that waits for a sync replicated transaction because of locks
will become implicitely sync. It just waits for another transactions'
sync ack.

Best regards,
Zoltán Böszörményi

 Regards,

   


-- 
--
Zoltán Böszörményi
Cybertec Schönig  Schönig GmbH
Gröhrmühlgasse 26
A-2700 Wiener Neustadt, Austria
Web: http://www.postgresql-support.de
 http://www.postgresql.at/


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Synchronization levels in SR

2010-09-08 Thread Fujii Masao
On Wed, Sep 8, 2010 at 7:04 PM, Boszormenyi Zoltan z...@cybertec.at wrote:
 Then there is no use to implement individual sync/async
 replicated transactions, period. An async replicated transaction
 that waits for a sync replicated transaction because of locks
 will become implicitely sync. It just waits for another transactions'
 sync ack.

Hmm.. it's the same with async transaction (i.e., synchronous_commit = false)
and sync one (synchronous_commit = true). Async transaction cannot take the
lock held by sync one until the sync has flushed the WAL.

Regards,

-- 
Fujii Masao
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Synchronization levels in SR

2010-09-08 Thread Boszormenyi Zoltan
Fujii Masao írta:
 On Wed, Sep 8, 2010 at 7:04 PM, Boszormenyi Zoltan z...@cybertec.at wrote:
   
 Then there is no use to implement individual sync/async
 replicated transactions, period. An async replicated transaction
 that waits for a sync replicated transaction because of locks
 will become implicitely sync. It just waits for another transactions'
 sync ack.
 

 Hmm.. it's the same with async transaction (i.e., synchronous_commit = false)
 and sync one (synchronous_commit = true). Async transaction cannot take the
 lock held by sync one until the sync has flushed the WAL.
   

You are right.

-- 
--
Zoltán Böszörményi
Cybertec Schönig  Schönig GmbH
Gröhrmühlgasse 26
A-2700 Wiener Neustadt, Austria
Web: http://www.postgresql-support.de
 http://www.postgresql.at/


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Synchronization levels in SR

2010-09-08 Thread Robert Haas
On Wed, Sep 8, 2010 at 6:52 AM, Boszormenyi Zoltan z...@cybertec.at wrote:
 Fujii Masao írta:
 On Wed, Sep 8, 2010 at 7:04 PM, Boszormenyi Zoltan z...@cybertec.at wrote:

 Then there is no use to implement individual sync/async
 replicated transactions, period. An async replicated transaction
 that waits for a sync replicated transaction because of locks
 will become implicitely sync. It just waits for another transactions'
 sync ack.


 Hmm.. it's the same with async transaction (i.e., synchronous_commit = false)
 and sync one (synchronous_commit = true). Async transaction cannot take the
 lock held by sync one until the sync has flushed the WAL.


 You are right.

I still don't see why it matters whether you wait before or after
releasing locks.  As soon as the transaction is marked committed in
CLOG, other transactions can potentially see its effects.  Holding on
to all the locks might mitigate that somewhat, but it's not going to
eliminate the problem.  And in any event, there is ALWAYS a window of
time during which the client doesn't know the transaction has
committed but other transactions can potentially see its effects.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise Postgres Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Synchronization levels in SR

2010-09-08 Thread Fujii Masao
On Wed, Sep 8, 2010 at 8:43 PM, Robert Haas robertmh...@gmail.com wrote:
 I still don't see why it matters whether you wait before or after
 releasing locks.  As soon as the transaction is marked committed in
 CLOG, other transactions can potentially see its effects.

AFAIR, even if CLOG has been updated, until the transaction is marked
as no longer running in PGPROC, probably other transactions cannot
see its effects. But, if it's not true, I'd make the transaction wait
for replication before CLOG update.

 And in any event, there is ALWAYS a window of
 time during which the client doesn't know the transaction has
 committed but other transactions can potentially see its effects.

Yep. The problem here is that synchronous replication is likely to
make the window very big.

Regards,

-- 
Fujii Masao
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Synchronization levels in SR

2010-09-08 Thread Markus Wanner

On 09/08/2010 12:04 PM, Boszormenyi Zoltan wrote:

Then there is no use to implement individual sync/async
replicated transactions, period.


I disagree. Different transactions have different priorities for latency 
vs. failure-resistance.



An async replicated transaction
that waits for a sync replicated transaction because of locks
will become implicitely sync.


Sure. But how often do your transactions wait for another one because of 
locks? What do we have MVCC for?


Regards

Markus Wanner

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Synchronization levels in SR

2010-09-08 Thread Simon Riggs
On Wed, 2010-09-08 at 12:04 +0200, Boszormenyi Zoltan wrote:
 
  I know why you want that, but I don't think its right.
  
 
  Agreed. That's why I put the wait before ProcArrayEndTransaction()
  is called.

 
 Then there is no use to implement individual sync/async
 replicated transactions, period. An async replicated transaction
 that waits for a sync replicated transaction because of locks
 will become implicitely sync. It just waits for another transactions'
 sync ack.

You aren't making any sense. You have made a general observation and
deduced something specific about replication from it. Most transactions
are not blocked by locks, especially in well designed applications, so
the argument is not relevant to replication.

If *any* two transactions wait upon each other then t2 will always wait
until t1 has completed. 

If t1 is slow then any tuning you do on t2 will likely be wasted. If you
are concerned about performance you should first remove the dependency
between t1 and t2. The above observation isn't sufficient to conclude
that tuning of t2 should not happen via the tuning feature Simon has
suggested. It's not sufficient to conclude much, if anything.

As it turns out, in the scenario you outline t2 *would* return faster
because you had marked it as async. But it would wait behind t1, as
you say. So the performance gain will be clear and measurable. Even so,
it would be best to tune the problem (lock contention) not moan that the
tool you're choosing to use using (tuning replication) is at fault for
being inappropriate to the problem.

Mixing sync and async transactions is useful and it's a simple matter to
come up with real examples where it would benefit, as well as easily
testable workloads using pgbench. For example, customer table updates
(sync) alongside chat messages (async).

-- 
 Simon Riggs   www.2ndQuadrant.com
 PostgreSQL Development, 24x7 Support, Training and Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Synchronization levels in SR

2010-09-08 Thread Robert Haas
On Wed, Sep 8, 2010 at 8:30 AM, Fujii Masao masao.fu...@gmail.com wrote:
 And in any event, there is ALWAYS a window of
 time during which the client doesn't know the transaction has
 committed but other transactions can potentially see its effects.

 Yep. The problem here is that synchronous replication is likely to
 make the window very big.

So what?  If the correctness of your application depends on the
*amount of time* this window lasts, it's already broken.  It seems
like you're arguing that we should artificially increase lock
contention to guard against possible race conditions in user
applications.  That doesn't make any sense to me, so one of us is
confused.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise Postgres Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Synchronization levels in SR

2010-09-08 Thread Fujii Masao
On Wed, Sep 8, 2010 at 10:07 PM, Robert Haas robertmh...@gmail.com wrote:
 On Wed, Sep 8, 2010 at 8:30 AM, Fujii Masao masao.fu...@gmail.com wrote:
 And in any event, there is ALWAYS a window of
 time during which the client doesn't know the transaction has
 committed but other transactions can potentially see its effects.

 Yep. The problem here is that synchronous replication is likely to
 make the window very big.

 So what?  If the correctness of your application depends on the
 *amount of time* this window lasts, it's already broken.  It seems
 like you're arguing that we should artificially increase lock
 contention to guard against possible race conditions in user
 applications.  That doesn't make any sense to me, so one of us is
 confused.

Yep ;) On second thought, the problem here is that the effects of
the transaction marked as committed but still waiting for replication
can disappear after failover.

Regards,

-- 
Fujii Masao
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Synchronization levels in SR

2010-09-08 Thread Robert Haas
On Wed, Sep 8, 2010 at 9:32 AM, Fujii Masao masao.fu...@gmail.com wrote:
 On Wed, Sep 8, 2010 at 10:07 PM, Robert Haas robertmh...@gmail.com wrote:
 On Wed, Sep 8, 2010 at 8:30 AM, Fujii Masao masao.fu...@gmail.com wrote:
 And in any event, there is ALWAYS a window of
 time during which the client doesn't know the transaction has
 committed but other transactions can potentially see its effects.

 Yep. The problem here is that synchronous replication is likely to
 make the window very big.

 So what?  If the correctness of your application depends on the
 *amount of time* this window lasts, it's already broken.  It seems
 like you're arguing that we should artificially increase lock
 contention to guard against possible race conditions in user
 applications.  That doesn't make any sense to me, so one of us is
 confused.

 Yep ;) On second thought, the problem here is that the effects of
 the transaction marked as committed but still waiting for replication
 can disappear after failover.

Ah!  I think that's right.  So the scenario we're trying to guard
against something like this.  A customer makes a withdrawal of money
from an ATM; their bank balance is debited.  The transaction tries to
commit.  After the transaction becomes visible to other backends but
before WAL is reaches the standby, another transaction begins and
reads the customer's balance.  Naturally, they get the new, lower
balance.  Crash, master dead.  Failover.  If another transcation
begins and reads the customer's balance again, it's back to the old
value.  So we have a phantom transaction: it appeared as committed and
then vanished again.

So that means we have to make sure that none of the effects of a
transaction can be seen until WAL is fsync'd on the master AND the
slave has acked.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise Postgres Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Synchronization levels in SR

2010-09-08 Thread Simon Riggs
On Wed, 2010-09-08 at 09:50 -0400, Robert Haas wrote:

 So that means we have to make sure that none of the effects of a
 transaction can be seen until WAL is fsync'd on the master AND the
 slave has acked.

Yes, that's right. And I like your example; one for the docs.

There is a slight complexity there: An application might connect to the
standby and see the changes made by the transaction, even though the
master has not yet been notified, but will be in a moment. I don't see
that as an issue though, but worth mentioning cos its just the
Byzantine Generals problem.

-- 
 Simon Riggs   www.2ndQuadrant.com
 PostgreSQL Development, 24x7 Support, Training and Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Synchronization levels in SR

2010-09-08 Thread David Fetter
On Wed, Sep 08, 2010 at 03:22:46PM +0100, Simon Riggs wrote:
 On Wed, 2010-09-08 at 09:50 -0400, Robert Haas wrote:
 
  So that means we have to make sure that none of the effects of a
  transaction can be seen until WAL is fsync'd on the master AND the
  slave has acked.
 
 Yes, that's right. And I like your example; one for the docs.
 
 There is a slight complexity there: An application might connect to
 the standby and see the changes made by the transaction, even though
 the master has not yet been notified, but will be in a moment. I
 don't see that as an issue though, but worth mentioning cos its just
 the Byzantine Generals problem.

For completeness, a reference to the aforementioned Byzantine
Generals:  http://en.wikipedia.org/wiki/Byzantine_fault_tolerance

Cheers,
David.
-- 
David Fetter da...@fetter.org http://fetter.org/
Phone: +1 415 235 3778  AIM: dfetter666  Yahoo!: dfetter
Skype: davidfetter  XMPP: david.fet...@gmail.com
iCal: webcal://www.tripit.com/feed/ical/people/david74/tripit.ics

Remember to vote!
Consider donating to Postgres: http://www.postgresql.org/about/donate

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Synchronization levels in SR

2010-09-08 Thread Robert Haas
On Wed, Sep 8, 2010 at 10:22 AM, Simon Riggs si...@2ndquadrant.com wrote:
 On Wed, 2010-09-08 at 09:50 -0400, Robert Haas wrote:

 So that means we have to make sure that none of the effects of a
 transaction can be seen until WAL is fsync'd on the master AND the
 slave has acked.

 Yes, that's right. And I like your example; one for the docs.

 There is a slight complexity there: An application might connect to the
 standby and see the changes made by the transaction, even though the
 master has not yet been notified, but will be in a moment. I don't see
 that as an issue though, but worth mentioning cos its just the
 Byzantine Generals problem.

I think that's OK too, because there's no way we can guarantee that
the transaction becomes visible exactly simultaneously on both nodes.
What we do need to guarantee is that it is known committed on both
nodes before it becomes visible on either, so that even if there is a
crash or failover it can't uncommit itself.  So the order of events
must be:

- fsync WAL on master
- send WAL to slave
- wait for ack from slave
- allow transaction's effects to become visible on master

If the slave is only guaranteeing *receipt* of the WAL rather than
fsync or replay of the WAL, then there is still a possibility of a
disappearing transaction if the master and standby fail simultaneously
AND a failover then occurs.  So don't pick that mode if a disappearing
transaction will result in someone dying or your $20B company going
bankrupt or ...

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise Postgres Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Synchronization levels in SR

2010-09-07 Thread Markus Wanner

Hi,

On 05/27/2010 01:28 PM, Robert Haas wrote:

How do you propose to guarantee that?  ISTM that you have to either
commit locally first, or send the commit to the remote first.  Either
way, the two events won't occur exactly simultaneously.


I'm not getting the point of this discussion. As long as the database 
confirms the commit to the client only *after* having an ack from the 
standby and *after* committing locally, there's no problem.


In any case, a server failure in between the commit request of the 
client and the commit confirmation for the client results in a client 
that can't tell if its transaction committed or not.


So why do we care what to do first internally? Ideally, these two tasks 
should happen concurrently, IMO.


Regards

Markus Wanner

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Synchronization levels in SR

2010-09-07 Thread Robert Haas
On Tue, Sep 7, 2010 at 4:01 AM, Markus Wanner mar...@bluegap.ch wrote:
 In any case, a server failure in between the commit request of the client
 and the commit confirmation for the client results in a client that can't
 tell if its transaction committed or not.

 So why do we care what to do first internally? Ideally, these two tasks
 should happen concurrently, IMO.

Right, definitely.  The trouble is that if they happen concurrently,
and there's a crash, you have to be prepared for the possibility that
either one of the two has completed and the other is not.  In
practice, this means that the master and standby need to compare notes
on the ending WAL location and whichever one is further advanced needs
to stream the intervening records to the other.  This would be an
awesome feature, but it's hard, so for a first version, it makes sense
to commit on the master first and then on the standby after the master
is known done.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise Postgres Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Synchronization levels in SR

2010-09-07 Thread Markus Wanner

On 09/07/2010 02:16 PM, Robert Haas wrote:

Right, definitely.  The trouble is that if they happen concurrently,
and there's a crash, you have to be prepared for the possibility that
either one of the two has completed and the other is not.


Understood.


In
practice, this means that the master and standby need to compare notes
on the ending WAL location and whichever one is further advanced needs
to stream the intervening records to the other.


Not necessarily, no. Remember that the client didn't get a commit 
confirmation. So reverting might also be a correct solution (i.e. not 
violating the durability constraint).



This would be an
awesome feature, but it's hard, so for a first version, it makes sense
to commit on the master first and then on the standby after the master
is known done.


The obvious downside of that is that latency adds up, instead of just 
being the max of the two operations. And that for normal operation. 
While at best it saves an un-confirmed transaction in the failure case.


It might be harder to implement, yes.

Regards

Markus Wanner

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Synchronization levels in SR

2010-09-07 Thread Robert Haas
On Tue, Sep 7, 2010 at 9:45 AM, Markus Wanner mar...@bluegap.ch wrote:
 On 09/07/2010 02:16 PM, Robert Haas wrote:

 Right, definitely.  The trouble is that if they happen concurrently,
 and there's a crash, you have to be prepared for the possibility that
 either one of the two has completed and the other is not.

 Understood.

 In
 practice, this means that the master and standby need to compare notes
 on the ending WAL location and whichever one is further advanced needs
 to stream the intervening records to the other.

 Not necessarily, no. Remember that the client didn't get a commit
 confirmation. So reverting might also be a correct solution (i.e. not
 violating the durability constraint).

In theory, that's true, but if we do that, then there's an even bigger
problem: the slave might have replayed WAL ahead of the master
location; therefore the slave is now corrupt and a new base backup
must be taken.

 This would be an
 awesome feature, but it's hard, so for a first version, it makes sense
 to commit on the master first and then on the standby after the master
 is known done.

 The obvious downside of that is that latency adds up, instead of just being
 the max of the two operations. And that for normal operation. While at best
 it saves an un-confirmed transaction in the failure case.

 It might be harder to implement, yes.

Yeah, I hope we'll get there eventually.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise Postgres Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Synchronization levels in SR

2010-09-07 Thread Markus Wanner

On 09/07/2010 04:15 PM, Robert Haas wrote:

In theory, that's true, but if we do that, then there's an even bigger
problem: the slave might have replayed WAL ahead of the master
location; therefore the slave is now corrupt and a new base backup
must be taken.


The slave isn't corrupt. It would suffice to late abort committed 
transactions the master doesn't know about.


However, I realize that undoing of WAL isn't something that's 
implemented (nor planned). So it's probably easier to forward the master 
in such a case.



Yeah, I hope we'll get there eventually.


Understood. Thanks.

Markus Wanner


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Synchronization levels in SR

2010-09-07 Thread Ron Mayer
Markus Wanner wrote:
 On 09/07/2010 02:16 PM, Robert Haas wrote:
 practice, this means that the master and standby need to compare notes
 on the ending WAL location and whichever one is further advanced needs
 to stream the intervening records to the other.
 
 Not necessarily, no. Remember that the client didn't get a commit
 confirmation. So reverting might also be a correct solution (i.e. not
 violating the durability constraint).

In that situation, wouldn't it be possible that a different client
queried the slave and already saw the result of that transaction
which would later be rolled back?


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Synchronization levels in SR

2010-09-07 Thread Simon Riggs
On Tue, 2010-09-07 at 16:31 +0200, Markus Wanner wrote:
 On 09/07/2010 04:15 PM, Robert Haas wrote:
  In theory, that's true, but if we do that, then there's an even bigger
  problem: the slave might have replayed WAL ahead of the master
  location; therefore the slave is now corrupt and a new base backup
  must be taken.
 
 The slave isn't corrupt. It would suffice to late abort committed 
 transactions the master doesn't know about.

The slave *might* be ahead of the master. And if it is, the case we're
discussing is where the master just crashed and *might* not even be
coming back at all, at least for a while. The standby does differ from
master, but with the master down I don't regard that as a useful
statement.

If we wait for fsync on master and then transfer to standby the times
are additive. If we do them concurrently the response times will be the
maximum response time of fsync/transfer, as Markus observes.

ISTM that most people would be more interested in reducing response
times by ~50% rather than in being exactly correct in an edge case. So
we should be planning that as a robustness option, not it cannot be
done, which seems to be echoing around to much for my liking.

-- 
 Simon Riggs   www.2ndQuadrant.com
 PostgreSQL Development, 24x7 Support, Training and Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Synchronization levels in SR

2010-09-07 Thread Tom Lane
Markus Wanner mar...@bluegap.ch writes:
 On 09/07/2010 04:15 PM, Robert Haas wrote:
 In theory, that's true, but if we do that, then there's an even bigger
 problem: the slave might have replayed WAL ahead of the master
 location; therefore the slave is now corrupt and a new base backup
 must be taken.

 The slave isn't corrupt. It would suffice to late abort committed 
 transactions the master doesn't know about.

Oh yes it is.  If the slave replays WAL that didn't happen on the
master, it might for instance have heap tuples in TID slots that are
empty on the master, or index pages laid out differently from the
master.  Trying to apply additional WAL from the master will fail badly.

We can *not* allow the slave to replay WAL ahead of what is known
committed to disk on the master.  The only way to make that safe
is the compare-notes-and-ship-WAL-back approach that Robert mentioned.

If you feel that decoupling WAL application is absolutely essential
to have a credible feature, then you'd better bite the bullet and
start working on the ship-WAL-back code.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Synchronization levels in SR

2010-09-07 Thread Simon Riggs
On Tue, 2010-09-07 at 11:17 -0400, Tom Lane wrote:
 Markus Wanner mar...@bluegap.ch writes:
  On 09/07/2010 04:15 PM, Robert Haas wrote:
  In theory, that's true, but if we do that, then there's an even bigger
  problem: the slave might have replayed WAL ahead of the master
  location; therefore the slave is now corrupt and a new base backup
  must be taken.
 
  The slave isn't corrupt. It would suffice to late abort committed 
  transactions the master doesn't know about.
 
 Oh yes it is.  If the slave replays WAL that didn't happen on the
 master, it might for instance have heap tuples in TID slots that are
 empty on the master, or index pages laid out differently from the
 master.  Trying to apply additional WAL from the master will fail badly.
 
 We can *not* allow the slave to replay WAL ahead of what is known
 committed to disk on the master.  The only way to make that safe
 is the compare-notes-and-ship-WAL-back approach that Robert mentioned.
 
 If you feel that decoupling WAL application is absolutely essential
 to have a credible feature, then you'd better bite the bullet and
 start working on the ship-WAL-back code.

Why not just failover? 

-- 
 Simon Riggs   www.2ndQuadrant.com
 PostgreSQL Development, 24x7 Support, Training and Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Synchronization levels in SR

2010-09-07 Thread Markus Wanner

On 09/07/2010 04:47 PM, Ron Mayer wrote:

In that situation, wouldn't it be possible that a different client
queried the slave and already saw the result of that transaction
which would later be rolled back?


Good point, yes.

Regards

Markus Wanner

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Synchronization levels in SR

2010-09-07 Thread Tom Lane
Simon Riggs si...@2ndquadrant.com writes:
 On Tue, 2010-09-07 at 11:17 -0400, Tom Lane wrote:
 We can *not* allow the slave to replay WAL ahead of what is known
 committed to disk on the master.  The only way to make that safe
 is the compare-notes-and-ship-WAL-back approach that Robert mentioned.
 
 If you feel that decoupling WAL application is absolutely essential
 to have a credible feature, then you'd better bite the bullet and
 start working on the ship-WAL-back code.

 Why not just failover? 

Guaranteed failover is another large piece we don't have.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Synchronization levels in SR

2010-09-07 Thread Markus Wanner

Hi,

On 09/07/2010 05:17 PM, Tom Lane wrote:

Oh yes it is.  If the slave replays WAL that didn't happen on the
master, it might for instance have heap tuples in TID slots that are
empty on the master, or index pages laid out differently from the
master.  Trying to apply additional WAL from the master will fail badly.


Sure. Reverting to the master's state would be required to be able to 
safely proceed. Granted, that's far from simple.


Robert's argument about read queries on the standby convinced me, that 
you always need to recover to the node with the newest transactions 
applied (i.e. better advance rather than revert). Making sure the 
standby can't ever be ahead of the master node certainly is the simplest 
way to guarantee that. At its cost for normal operation, though.


How about a master failure which leads to a fail-over, immediately 
followed by a failure of that former standby (and now a master)? The old 
master might then be in the very same situation: having WAL applied that 
the new master doesn't. Do we require former masters to fetch a base 
backup? How does it know the difference, once it gets back up?



We can *not* allow the slave to replay WAL ahead of what is known
committed to disk on the master.  The only way to make that safe
is the compare-notes-and-ship-WAL-back approach that Robert mentioned.


Agreed.

(And it's worth pointing out that this approach has a pretty nasty 
requirement for a full-cluster crash: all nodes that were synchronously 
replicated to need to come back up after such a crash, so as to be able 
to reliably determine which has the newest transaction).



If you feel that decoupling WAL application is absolutely essential
to have a credible feature, then you'd better bite the bullet and
start working on the ship-WAL-back code.


My feeling is that WAL is the wrong format to do replication. But that's 
a another story.


Regards

Markus Wanner

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Synchronization levels in SR

2010-09-07 Thread Robert Haas
On Tue, Sep 7, 2010 at 11:06 AM, Simon Riggs si...@2ndquadrant.com wrote:
 On Tue, 2010-09-07 at 16:31 +0200, Markus Wanner wrote:
 On 09/07/2010 04:15 PM, Robert Haas wrote:
  In theory, that's true, but if we do that, then there's an even bigger
  problem: the slave might have replayed WAL ahead of the master
  location; therefore the slave is now corrupt and a new base backup
  must be taken.

 The slave isn't corrupt. It would suffice to late abort committed
 transactions the master doesn't know about.

 The slave *might* be ahead of the master. And if it is, the case we're
 discussing is where the master just crashed and *might* not even be
 coming back at all, at least for a while. The standby does differ from
 master, but with the master down I don't regard that as a useful
 statement.

 If we wait for fsync on master and then transfer to standby the times
 are additive. If we do them concurrently the response times will be the
 maximum response time of fsync/transfer, as Markus observes.

 ISTM that most people would be more interested in reducing response
 times by ~50% rather than in being exactly correct in an edge case. So
 we should be planning that as a robustness option, not it cannot be
 done, which seems to be echoing around to much for my liking.

People who are more concerned about performance than robustness aren't
going to use sync rep in the first place.  They're going to run it in
async, which will improve performance by FAR more than you'll ever be
able to manage by deciding that you don't care about handling some of
the failure cases correctly.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise Postgres Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Synchronization levels in SR

2010-09-07 Thread Markus Wanner

Hi,

On 09/07/2010 06:00 PM, Robert Haas wrote:

People who are more concerned about performance than robustness aren't
going to use sync rep in the first place.


I'm advocating sync (or eager, FWIW) replication for years, now. One of 
the hardest preconception I'm always confronted with is: this must 
perform poorly!


Whether or not that's true depends, but my point is: people who need 
that level of robustness certainly care about performance as well. 
Telling them to use async replication instead is not an option. (The 
ability to mix sync and async replication per transaction is one, BTW).



They're going to run it in
async, which will improve performance by FAR more than you'll ever be
able to manage by deciding that you don't care about handling some of
the failure cases correctly.


Running in async and then trying to achieve the required level of 
robustness in the application layer pretty certainly performs worse than 
a good sync replication implementation. Async only wins if you really 
don't care about the loss of transactions in the case of a failure. In 
every other case, robustness is better taken care of by the database 
system itself, IMO.


That being said, I certainly agree to do things step by step. And the 
ability to write to WAL and wait for ack from a standby concurrently can 
(and probably should) be considered an optimization, yes.


Regards

Markus Wanner

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Synchronization levels in SR

2010-09-07 Thread Markus Wanner

On 09/07/2010 05:55 PM, Markus Wanner wrote:

Robert's argument


Sorry, I meant Ron.

Regards

Markus Wanner

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Synchronization levels in SR

2010-09-07 Thread marcin mank
On Tue, Sep 7, 2010 at 5:17 PM, Tom Lane t...@sss.pgh.pa.us wrote:
 We can *not* allow the slave to replay WAL ahead of what is known
 committed to disk on the master.  The only way to make that safe
 is the compare-notes-and-ship-WAL-back approach that Robert mentioned.

 If you feel that decoupling WAL application is absolutely essential
 to have a credible feature, then you'd better bite the bullet and
 start working on the ship-WAL-back code.


In the mode where it is not required that the WAL is applied (only
sent to the slave / synced to slave disk) one alternative is to have a
separate pointer to the last WAL record that can be safely applied on
the slave. Then You can send the un-synced WAL to the slave (while
concurrently syncing it on the master). When both the slave an the
master sync complete, one can give the client a commit notification,
increase the pointer, and send it to the slave (it would be a separate
WAL record type I guess).

In case of master failure, the slave can discard the un-applied WAL
after the pointer.

Greetings
marcin

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Synchronization levels in SR

2010-09-07 Thread Robert Haas
On Tue, Sep 7, 2010 at 4:06 PM, marcin mank marcin.m...@gmail.com wrote:
 On Tue, Sep 7, 2010 at 5:17 PM, Tom Lane t...@sss.pgh.pa.us wrote:
 We can *not* allow the slave to replay WAL ahead of what is known
 committed to disk on the master.  The only way to make that safe
 is the compare-notes-and-ship-WAL-back approach that Robert mentioned.

 If you feel that decoupling WAL application is absolutely essential
 to have a credible feature, then you'd better bite the bullet and
 start working on the ship-WAL-back code.


 In the mode where it is not required that the WAL is applied (only
 sent to the slave / synced to slave disk) one alternative is to have a
 separate pointer to the last WAL record that can be safely applied on
 the slave. Then You can send the un-synced WAL to the slave (while
 concurrently syncing it on the master). When both the slave an the
 master sync complete, one can give the client a commit notification,
 increase the pointer, and send it to the slave (it would be a separate
 WAL record type I guess).

 In case of master failure, the slave can discard the un-applied WAL
 after the pointer.

But the pointer on the slave has to be fsync'd to make it persistent,
which likely takes roughly the same amount of time as fsync-ing the
WAL itself.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise Postgres Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Synchronization levels in SR

2010-09-06 Thread Dimitri Fontaine
Boszormenyi Zoltan z...@cybertec.at writes:
 Sorry for answering such an old mail, but what is the purpose of
 a transaction level synchronous behaviour if async transactions
 can be held back by a sync transaction?

I don't understand why it would be the case (sync holding back async
transactions) — it's been proposed that walsender could periodically
feed back to the master the current WAL position received, synced and
applied. 

So you can register your sync transaction to wait (and block) until
walsender sees a synced WAL position after your own (including it) and
another transaction can wait until walsender sees a received WAL
position after its own, for example. Of course, meanwhile, any async
transaction would just commit without caring about slaves.

Not implementing it nor thinking about how to implement it, it seems
simple enough :)

Regards,
-- 
dim

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Synchronization levels in SR

2010-09-06 Thread Boszormenyi Zoltan
Dimitri Fontaine írta:
 Boszormenyi Zoltan z...@cybertec.at writes:
   
 Sorry for answering such an old mail, but what is the purpose of
 a transaction level synchronous behaviour if async transactions
 can be held back by a sync transaction?
 

 I don't understand why it would be the case (sync holding back async
 transactions) — it's been proposed that walsender could periodically
 feed back to the master the current WAL position received, synced and
 applied. 

 So you can register your sync transaction to wait (and block) until
 walsender sees a synced WAL position after your own (including it) and
 another transaction can wait until walsender sees a received WAL
 position after its own, for example. Of course, meanwhile, any async
 transaction would just commit without caring about slaves.
   

The locks held by a transaction are released after
RecordTransactionCommit(), and waiting for the sync ack
happens in this function. Now what happens when a sync
transaction hold a lock that an async one is waiting for?

 Not implementing it nor thinking about how to implement it, it seems
 simple enough :)

 Regards,
   


-- 
--
Zoltán Böszörményi
Cybertec Schönig  Schönig GmbH
Gröhrmühlgasse 26
A-2700 Wiener Neustadt, Austria
Web: http://www.postgresql-support.de
 http://www.postgresql.at/


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Synchronization levels in SR

2010-09-06 Thread Simon Riggs
On Mon, 2010-09-06 at 21:45 +0200, Boszormenyi Zoltan wrote:
 Dimitri Fontaine írta:
  Boszormenyi Zoltan z...@cybertec.at writes:

  Sorry for answering such an old mail, but what is the purpose of
  a transaction level synchronous behaviour if async transactions
  can be held back by a sync transaction?
  
 
  I don't understand why it would be the case (sync holding back async
  transactions) — it's been proposed that walsender could periodically
  feed back to the master the current WAL position received, synced and
  applied. 
 
  So you can register your sync transaction to wait (and block) until
  walsender sees a synced WAL position after your own (including it) and
  another transaction can wait until walsender sees a received WAL
  position after its own, for example. Of course, meanwhile, any async
  transaction would just commit without caring about slaves.

 
 The locks held by a transaction are released after
 RecordTransactionCommit(), and waiting for the sync ack
 happens in this function. Now what happens when a sync
 transaction hold a lock that an async one is waiting for?

It seems your glass in half-empty. Mine is half-full. My perspective
would be that if there is contention between async and sync transactions
then we will get better throughout than if all transactions were sync.
Though perhaps the main issue in that case would be application lock
contention, not the speed of synchronous replication.

The highest level issue is that the system only has so much physical
resources. If we are unable to focus our resources onto the things that
matter most then we end up wasting resources. Mixing async and sync
transactions on the same server allows a single application to carefully
balance performance and durability. Exactly as we do with
synchronous_commit.

By now, people are beginning to see that synchronous replication is
important but has poor performance. Fine grained control is essential to
using it effectively in areas that matter most.

-- 
 Simon Riggs   www.2ndQuadrant.com
 PostgreSQL Development, 24x7 Support, Training and Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Synchronization levels in SR

2010-09-06 Thread Tom Lane
Simon Riggs si...@2ndquadrant.com writes:
 On Mon, 2010-09-06 at 21:45 +0200, Boszormenyi Zoltan wrote:
 The locks held by a transaction are released after
 RecordTransactionCommit(), and waiting for the sync ack
 happens in this function. Now what happens when a sync
 transaction hold a lock that an async one is waiting for?

 It seems your glass in half-empty. Mine is half-full.

Simon, you really are failing to advance the conversation.  You claim
that we can have sync plus async transactions without a performance hit,
but you have failed to explain how, at least in any way that anyone
else understands.  Pontificating about how that will be so much better
than not having it doesn't address the problem that others are having
with seeing how to implement it.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Synchronization levels in SR

2010-09-06 Thread Boszormenyi Zoltan
Simon Riggs írta:
 On Mon, 2010-09-06 at 21:45 +0200, Boszormenyi Zoltan wrote:
   
 Dimitri Fontaine írta:
 
 Boszormenyi Zoltan z...@cybertec.at writes:
   
   
 Sorry for answering such an old mail, but what is the purpose of
 a transaction level synchronous behaviour if async transactions
 can be held back by a sync transaction?
 
 
 I don't understand why it would be the case (sync holding back async
 transactions) — it's been proposed that walsender could periodically
 feed back to the master the current WAL position received, synced and
 applied. 

 So you can register your sync transaction to wait (and block) until
 walsender sees a synced WAL position after your own (including it) and
 another transaction can wait until walsender sees a received WAL
 position after its own, for example. Of course, meanwhile, any async
 transaction would just commit without caring about slaves.
   
   
 The locks held by a transaction are released after
 RecordTransactionCommit(), and waiting for the sync ack
 happens in this function. Now what happens when a sync
 transaction hold a lock that an async one is waiting for?
 

 It seems your glass in half-empty. Mine is half-full.

This is good, we can meet halfway. :-)

  My perspective
 would be that if there is contention between async and sync transactions
 then we will get better throughout than if all transactions were sync.
 Though perhaps the main issue in that case would be application lock
 contention, not the speed of synchronous replication.
   

The difference we are talking about is:

xact1xact2
begin
begin
lock something
lock same

(in commit)
write wal record
wait for sync ack
release locks/etc   xact2 can proceed from here

vs.

xact1xact2
begin
begin
lock something
lock same

(in commit)
write wal record
release locks/etc   xact2 can proceed from here
wait for sync ack

In the first case, the contention is obviously increased.
With this, we are creating more idle time in the server
instead of letting other transactions do their jobs as soon
as possible. The second method was implemented in my
patch. Are there any drawbacks with this?

 The highest level issue is that the system only has so much physical
 resources. If we are unable to focus our resources onto the things that
 matter most then we end up wasting resources. Mixing async and sync
 transactions on the same server allows a single application to carefully
 balance performance and durability. Exactly as we do with
 synchronous_commit.
   

I don't think this is the same situation. With synchronous_commit,
you have an auxiliary process that's handed the job of doing
the syncing. But there's nowhere to hand out the waiting for
sync ack from the standby.

 By now, people are beginning to see that synchronous replication is
 important but has poor performance. Fine grained control is essential to
 using it effectively in areas that matter most.
   


-- 
--
Zoltán Böszörményi
Cybertec Schönig  Schönig GmbH
Gröhrmühlgasse 26
A-2700 Wiener Neustadt, Austria
Web: http://www.postgresql-support.de
 http://www.postgresql.at/


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Synchronization levels in SR

2010-09-06 Thread Simon Riggs
On Mon, 2010-09-06 at 22:32 +0200, Boszormenyi Zoltan wrote:
 (in commit)
 write wal record
 release locks/etc   xact2 can proceed from here
 wait for sync ack
 
 In the first case, the contention is obviously increased.
 With this, we are creating more idle time in the server
 instead of letting other transactions do their jobs as soon
 as possible. The second method was implemented in my
 patch. Are there any drawbacks with this? 

Then I respectfully suggest that you're releasing locks too early.

Your proposal would allow a 2nd user to see the results of the 1st
user's transaction before the 1st user knew about whether it had
committed or not.

I know why you want that, but I don't think its right.

This has very little, if anything, to do with mixing async/sync
connections. You make it sound like all transactions always wait for
other transactions, which they definitely don't, especially in
reasonably well designed applications.

-- 
 Simon Riggs   www.2ndQuadrant.com
 PostgreSQL Development, 24x7 Support, Training and Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Synchronization levels in SR

2010-09-06 Thread Greg Stark
On Mon, Sep 6, 2010 at 10:02 PM, Simon Riggs si...@2ndquadrant.com wrote:
 Then I respectfully suggest that you're releasing locks too early.

 Your proposal would allow a 2nd user to see the results of the 1st
 user's transaction before the 1st user knew about whether it had
 committed or not.

 I know why you want that, but I don't think its right.

Well that's always possible. The 1st user might just not wake up
before the 2nd user gets the response back.

The question is what happens if the server crashes and is failed over
to the slave. The 2nd user with the async transaction might have seen
data commited by the 1st user with his sync transaction but was
subsequently lost. Is the user expecting that making his transaction
synchronously replicated guarantees that *nobody* can see this data
unless the transaction is guaranteed to have been replicated or is he
only expecting it to guarantee that *he* can't see the commit until it
can be trusted to be replicated?

For that matter I'm not entirely clear I understand how the timing
here works at all. If transactions can't be considered to be committed
before they're acknowledged by the replica what happens if the master
crashes after the WAL is written and then comes back without a
failover. Then the transaction would be immediately visible even if it
still hasn't been replicated yet.

I think there's no way with our current infrastructure to guarantee
that other transactions can't see your data before it's been
replicated, So making any promise otherwise for some cases is only
going to be a lie.

To guarantee synchronous replication doesn't show data until it's been
replicated we would have to some kind of 2-phase commit where we send
the commit record to the slave and wait until the slave has received
it and confirmed it has written it (but it doesn't replay it unless
there's a failover) then write the master's commit record and send the
message to the slave that it's safe to replay those records.

-- 
greg

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Synchronization levels in SR

2010-09-06 Thread Simon Riggs
On Mon, 2010-09-06 at 23:07 +0100, Greg Stark wrote:
 On Mon, Sep 6, 2010 at 10:02 PM, Simon Riggs si...@2ndquadrant.com wrote:
  Then I respectfully suggest that you're releasing locks too early.
 
  Your proposal would allow a 2nd user to see the results of the 1st
  user's transaction before the 1st user knew about whether it had
  committed or not.
 
  I know why you want that, but I don't think its right.
 
 Well that's always possible. The 1st user might just not wake up
 before the 2nd user gets the response back.
 
 The question is what happens if the server crashes and is failed over
 to the slave. The 2nd user with the async transaction might have seen
 data commited by the 1st user with his sync transaction but was
 subsequently lost. Is the user expecting that making his transaction
 synchronously replicated guarantees that *nobody* can see this data
 unless the transaction is guaranteed to have been replicated or is he
 only expecting it to guarantee that *he* can't see the commit until it
 can be trusted to be replicated?
 
 For that matter I'm not entirely clear I understand how the timing
 here works at all. If transactions can't be considered to be committed
 before they're acknowledged by the replica what happens if the master
 crashes after the WAL is written and then comes back without a
 failover. Then the transaction would be immediately visible even if it
 still hasn't been replicated yet.
 
 I think there's no way with our current infrastructure to guarantee
 that other transactions can't see your data before it's been
 replicated, So making any promise otherwise for some cases is only
 going to be a lie.
 
 To guarantee synchronous replication doesn't show data until it's been
 replicated we would have to some kind of 2-phase commit where we send
 the commit record to the slave and wait until the slave has received
 it and confirmed it has written it (but it doesn't replay it unless
 there's a failover) then write the master's commit record and send the
 message to the slave that it's safe to replay those records.

Just to add that this part of the discussion has nothing at all to do
with my proposal for master controlled replication. Zoltan is simply
discussing when the wait should occur with sync replication. I have no
proposal to vary that myself, wherever we eventually decide the wait
should occur.

-- 
 Simon Riggs   www.2ndQuadrant.com
 PostgreSQL Development, 24x7 Support, Training and Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Synchronization levels in SR

2010-09-06 Thread Simon Riggs
On Mon, 2010-09-06 at 16:11 -0400, Tom Lane wrote:
 Simon Riggs si...@2ndquadrant.com writes:
  On Mon, 2010-09-06 at 21:45 +0200, Boszormenyi Zoltan wrote:
  The locks held by a transaction are released after
  RecordTransactionCommit(), and waiting for the sync ack
  happens in this function. Now what happens when a sync
  transaction hold a lock that an async one is waiting for?
 
  It seems your glass in half-empty. Mine is half-full.
 
 Simon, you really are failing to advance the conversation.  You claim
 that we can have sync plus async transactions without a performance hit,
 but you have failed to explain how, at least in any way that anyone
 else understands.  Pontificating about how that will be so much better
 than not having it doesn't address the problem that others are having
 with seeing how to implement it.

A performance hit from mixing sync and async is unlikely. The overhead
of deciding whether to wait after commit is trivial. At worst, the async
transactions would go at the same speed as the sync transactions,
especially if the application contends with itself, which is by no means
a certainty. If acting independently, the async transactions would
clearly go much faster. So the right question for discussion is how
much will we gain by mixing async/sync?. Since we had exactly this
situation for synchronous_commit and a similar discussion, I expect a
similar eventual outcome.

The discussion would go better if we had clear performance results
published from existing work and we did not dissuade people from
objective testing. Then you'd probably understand why I think this is so
important to me.

I've explained more than once how my proposal can work and Dimitri at
least appears to have understood with zero off-list conversation. So far
the discussion has been mostly negative and the reasons given haven't
scored high on logic, I'm sorry to say. I will present a code-based
proposal rather than just huge piles of words, to make this a more
concrete discussion.

-- 
 Simon Riggs   www.2ndQuadrant.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Synchronization levels in SR

2010-09-06 Thread Robert Haas
On Mon, Sep 6, 2010 at 5:02 PM, Simon Riggs si...@2ndquadrant.com wrote:
 Then I respectfully suggest that you're releasing locks too early.

 Your proposal would allow a 2nd user to see the results of the 1st
 user's transaction before the 1st user knew about whether it had
 committed or not.

Marking the transaction committed in CLOG will have that effect
anyway.  We are not doing strict two-phase locking.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise Postgres Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Synchronization levels in SR

2010-09-03 Thread Boszormenyi Zoltan
Hi,

Dimitri Fontaine írta:
 Simon Riggs si...@2ndquadrant.com writes:
   
 Seems strange. If you have 2 standbys and you say you would like node1
 to be the preferred candidate, then you load it so heavily that a remote
 server with by-definition much larger network delay responds first, then
 I say your preference was wrong.
 

 There's a communication mismatch here I think. The problem with the
 dynamic aspect of the system is that the admin wants to plan ahead and
 choose in advance the failover server.

 Other than that I much prefer the automatic and dynamic quorum idea.

   
 If you, Jan and Yeb wish to completely exclude standbys from being part
 of any quorum, then I guess we need to have per-standby settings to
 allow that to be defined. I'm in favour of giving people options. That
 needn't be a mandatory per-standby setting, just a non-default option,
 so that we can reduce the complexity of configuration for common
 cases.
 

 +1

   
 Maximum Performance = quorum = 0
 Maximum Availability = quorum = 1, timeout_action = commit
 Maximum Protection = quorum = 1, timeout_action = shutdown
 

 +1

 Being able to say that a given server has not been granted to
 participate into the vote allowing to reach the global durability quorum
 will allow for choosing the failover candidates.

 Now you're able to have this reporting server and know for sure that
 your sync replicated transactions are not waiting for it.

 To summarize, the current per-transaction approach would be :

  - transaction level replication synchronous behaviour
   

Sorry for answering such an old mail, but what is the purpose of
a transaction level synchronous behaviour if async transactions
can be held back by a sync transaction?

In my patch, when the transactions were waiting for ack from
the standby, they have already released all their locks, the wait
happened at the latest possible point in CommitTransaction().

In Fujii's patch (I am looking at synch_rep_0722.patch, is there
a newer one?) the wait happens in RecordTransactionCommit()
so other transactions still see the sync transaction and most
importantly, the locks held by the sync transaction will make
the async  transactions waiting for the same lock wait too.

  - proxy/cascading in core
  - quorum setup for deciding any commit is safe
  - any server can be excluded from the sync quorum
  - timeout can still raises exception or ignore (commit)?

 This last point seems to need some more discussion, or I didn't
 understand well the current positions and proposals.

 Regards,
   

Best regards,
Zoltán Böszörményi


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Synchronization levels in SR

2010-09-03 Thread Fujii Masao
On Fri, Sep 3, 2010 at 6:43 PM, Boszormenyi Zoltan z...@cybertec.at wrote:
 In my patch, when the transactions were waiting for ack from
 the standby, they have already released all their locks, the wait
 happened at the latest possible point in CommitTransaction().

 In Fujii's patch (I am looking at synch_rep_0722.patch, is there
 a newer one?)

No ;)

We'll have to create the patch based on the result of the recent
discussion held on other thread.

 the wait happens in RecordTransactionCommit()
 so other transactions still see the sync transaction and most
 importantly, the locks held by the sync transaction will make
 the async  transactions waiting for the same lock wait too.

The transaction should be invisible to other transactions until
its replication has been completed. So I put the wait before
CommitTransaction() calls ProcArrayEndTransaction(). Is this unsafe?

Regards,

-- 
Fujii Masao
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Synchronization levels in SR

2010-09-03 Thread Boszormenyi Zoltan
Fujii Masao írta:
 On Fri, Sep 3, 2010 at 6:43 PM, Boszormenyi Zoltan z...@cybertec.at wrote:
   
 In my patch, when the transactions were waiting for ack from
 the standby, they have already released all their locks, the wait
 happened at the latest possible point in CommitTransaction().

 In Fujii's patch (I am looking at synch_rep_0722.patch, is there
 a newer one?)
 

 No ;)

 We'll have to create the patch based on the result of the recent
 discussion held on other thread.

   
 the wait happens in RecordTransactionCommit()
 so other transactions still see the sync transaction and most
 importantly, the locks held by the sync transaction will make
 the async  transactions waiting for the same lock wait too.
 

 The transaction should be invisible to other transactions until
 its replication has been completed.

Invisible? How can it be invisible? You are in RecordTransactionCommit(),
even before calling ProcArrayEndTransaction(MyProc, latestXid) and
releasing the locks the transaction holds.

  So I put the wait before
 CommitTransaction() calls ProcArrayEndTransaction(). Is this unsafe?
   

I don't know whether it's unsafe. In my patch, I only registered the Xid
at the point where you do WaitXLogSend(), this was the safe point
to setup the waiting for sync ack. Otherwise, when the Xid registration
for the sync ack was done in CommitTransaction() later than
RecordTransactionCommit(), there was a race between the primary and
the standby. The scenario was that the standby received and processed
the COMMIT of certain Xids even before the backend  on the primary
properly registered its Xid, so the backend has set up the waiting for
sync ack after this Xid was acked by the standby. The result was stuck
backends.

My idea to split up the registration for wait and the waiting itself
would allow for transaction-level synchronous setting, i.e. in my
patch the transaction released the locks and did all the post-commit
cleanups *then* it waited for sync ack if needed. In the meantime,
because locks were already released, other transactions could
progress with their job, allowing that e.g. async transactions to
progress and theoretically finish faster than the sync transaction
that was waiting for the ack.

The solution in my patch was not racy, registration of the Xid
was done before XLogInsert() in RecordTransactionCommit().
If the standby acked the Xid to the primary before reaching the
end of CommitTransaction() then this backend didn't even needed
to wait because the Xid was found in its PGPROC structure
and the waiting for sync ack was torn down.

But with the LSNs, as you are waiting for XactLastRecEnd
which is set by XLogInsert(). I don't know if it's safe to
WaitXLogSend() after XLogFlush() in RecordTransactionCommit().
I remember that in previous instances of my patch even if I
put the waiting for sync ack directly after
latestXid = RecordTransactionCommit();
in CommitTransaction(), there were cases when I got stuck
backends after a pgbench run. I had the primary and standbys
on the same machine on different ports, so the ack was almost
instant, which wouldn't be the case with a real network. But the
race condition was still there it just doesn't show up with networks
being slower than memory.  In your patch, the waiting happens
almost at the end of RecordTransactionCommit(), so theoretically
it has the same race condition. Am I missing something?

Best regards,
Zoltán Böszörményi

 Regards,

   


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Synchronization levels in SR

2010-06-04 Thread David Fetter
On Thu, Jun 03, 2010 at 10:57:05PM -0400, Robert Haas wrote:
 On Thu, Jun 3, 2010 at 8:47 PM, Jan Wieck janwi...@yahoo.com wrote:
  What would be the use case for such a query?
 
 Monitoring?

s/\?/!/;

Cheers,
David.
-- 
David Fetter da...@fetter.org http://fetter.org/
Phone: +1 415 235 3778  AIM: dfetter666  Yahoo!: dfetter
Skype: davidfetter  XMPP: david.fet...@gmail.com
iCal: webcal://www.tripit.com/feed/ical/people/david74/tripit.ics

Remember to vote!
Consider donating to Postgres: http://www.postgresql.org/about/donate

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Synchronization levels in SR

2010-06-04 Thread Jan Wieck

On 6/3/2010 10:57 PM, Robert Haas wrote:

On Thu, Jun 3, 2010 at 8:47 PM, Jan Wieck janwi...@yahoo.com wrote:

On 5/27/2010 4:31 PM, Bruce Momjian wrote:

Also, what would be cool would be if you could run a query on the master
to view the SR commit mode of each slave.


What would be the use case for such a query?


Monitoring?


So that justifies adding code, that the community needs to maintain and 
document, to the core system. If only I could find some monitoring case 
for transaction commit orders ... sigh!



Jan

--
Anyone who trades liberty for security deserves neither
liberty nor security. -- Benjamin Franklin

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Synchronization levels in SR

2010-06-04 Thread Robert Haas
On Fri, Jun 4, 2010 at 3:35 PM, Jan Wieck janwi...@yahoo.com wrote:
 On 6/3/2010 10:57 PM, Robert Haas wrote:

 On Thu, Jun 3, 2010 at 8:47 PM, Jan Wieck janwi...@yahoo.com wrote:

 On 5/27/2010 4:31 PM, Bruce Momjian wrote:

 Also, what would be cool would be if you could run a query on the master
 to view the SR commit mode of each slave.

 What would be the use case for such a query?

 Monitoring?

 So that justifies adding code, that the community needs to maintain and
 document, to the core system. If only I could find some monitoring case for
 transaction commit orders ... sigh!

Dude, I'm not the one arguing with you... actually I don't think
anyone really is, any more, except about details.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise Postgres Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Synchronization levels in SR

2010-06-04 Thread Jan Wieck

On 6/4/2010 4:22 PM, Robert Haas wrote:

On Fri, Jun 4, 2010 at 3:35 PM, Jan Wieck janwi...@yahoo.com wrote:

So that justifies adding code, that the community needs to maintain and
document, to the core system. If only I could find some monitoring case for
transaction commit orders ... sigh!


Dude, I'm not the one arguing with you... actually I don't think
anyone really is, any more, except about details.


I know. You actually pretty much defend my case. Sorry for lacking smiley's.

This is an old habit I have. A good friend from Germany once suspected 
one of my emails to be a spoof because I actually used a smiley.



Jan

--
Anyone who trades liberty for security deserves neither
liberty nor security. -- Benjamin Franklin

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Synchronization levels in SR

2010-06-03 Thread Jan Wieck

On 5/27/2010 4:31 PM, Bruce Momjian wrote:

Heikki Linnakangas wrote:
BTW, I think we're going to need a separate config file for listing the 
standbys anyway. There you can write per-server rules and options, but 
explicitly knowing about all the standbys also allows the master to 
recycle WAL as soon as it has been streamed to all the registered 
standbys. Currently we just keep wal_keep_segments files around, just in 
case there's a standby out there that needs them.


Ideally we could set 'slave_sync_count' and 'slave_commit_continue_mode'
on the master, and allow the sync/async mode to be set on each slave,
e.g. if slave_sync_count = 2 and slave_commit_continue_mode = #2, then
two slaves with sync mode of #2 or stricter have to complete before the
master can continue.

Naming the slaves on the master seems very confusing because I am
unclear how we would identify named slaves, and the names have to match,
etc.  


Also, what would be cool would be if you could run a query on the master
to view the SR commit mode of each slave.


What would be the use case for such a query?


Jan

--
Anyone who trades liberty for security deserves neither
liberty nor security. -- Benjamin Franklin

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Synchronization levels in SR

2010-06-03 Thread Robert Haas
On Thu, Jun 3, 2010 at 8:47 PM, Jan Wieck janwi...@yahoo.com wrote:
 On 5/27/2010 4:31 PM, Bruce Momjian wrote:

 Heikki Linnakangas wrote:

 BTW, I think we're going to need a separate config file for listing the
 standbys anyway. There you can write per-server rules and options, but
 explicitly knowing about all the standbys also allows the master to recycle
 WAL as soon as it has been streamed to all the registered standbys.
 Currently we just keep wal_keep_segments files around, just in case there's
 a standby out there that needs them.

 Ideally we could set 'slave_sync_count' and 'slave_commit_continue_mode'
 on the master, and allow the sync/async mode to be set on each slave,
 e.g. if slave_sync_count = 2 and slave_commit_continue_mode = #2, then
 two slaves with sync mode of #2 or stricter have to complete before the
 master can continue.

 Naming the slaves on the master seems very confusing because I am
 unclear how we would identify named slaves, and the names have to match,
 etc.
 Also, what would be cool would be if you could run a query on the master
 to view the SR commit mode of each slave.

 What would be the use case for such a query?

Monitoring?

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise Postgres Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Synchronization levels in SR

2010-06-02 Thread Greg Smith

Heikki Linnakangas wrote:
The possibilities are endless... Your proposal above covers a pretty 
good set of scenarios, but it's by no means complete. If we try to 
solve everything the configuration will need to be written in a 
Turing-complete Replication Description Language. We'll have to pick a 
useful, easy-to-understand subset that covers the common scenarios. To 
handle the more exotic scenarios, you can write a proxy that sits in 
front of the master, and implements whatever rules you wish, with the 
rules written in C.


I was thinking about this a bit recently.  As I see it, there are three 
fundamental parts of this:


1) We have a transaction that is being committed.  The rest of the 
computations here are all relative to it.


2) There is an (internal?) table that lists the state of each 
replication target relative to that transaction.  It would include the 
node name, perhaps some metadata ('location' seems the one that's most 
likely to help with the remote data center issue), and a state code.  
The codes from http://wiki.postgresql.org/wiki/Streaming_Replication 
work fine for the last part (which is the only dynamic one--everything 
else is static data being joined against):


async=hasn't received yet
recv=been received but just in RAM
fsync=received and synced to disk
apply=applied to the database

These would need to be enums so they can be ordered from lesser to 
greater consistency.


So in a 3 node case, the internal state table might look like this after 
a bit of data had been committed:


node | location | state
--
a | local| fsync 
b | remote | recv

c | remote | async

This means that the local node has a fully persistent copy, but the best 
either remote one has done is received the data, it's not on disk at all 
yet at the remote data center.  Still working its way through.


3) The decision about whether the data has been committed to enough 
places to be considered safe by the master is computed by a function 
that is passed this internal table as something like a SRF, and it 
returns a boolean.  Once that returns true, saying it's satisfied, the 
transaction closes on the master and continues to percolate out from 
there.  If it's false, we wait for another state change to come in and 
return to (2).


I would propose that most behaviors someone has expressed as being their 
desired implementation is possible to implement using this scheme. 

-Semi-sync commit:  proceed as soon somebody else has a copy and hope 
the copies all become consistent:  EXISTS WHERE state=recv
-Don't proceed until there's a fsync'd commit on at least one of the 
remote nodes:  EXISTS WHERE location='remote' AND state=fsync
-Look for a quorum of n commits of fsync quality:  CASE WHEN (SELECT 
COUNT(*) WHERE state=fsync)n THEN true else FALSE end;


Syntax is obviously rough but I think you can get the drift of what I'm 
suggesting.


While exposing the local state and running this computation isn't free, 
in situations where there truly are remote nodes in here being 
communicated with the network overhead is going to dwarf that.  If there 
were a fast path for the simplest cases and this complicated one for the 
rest, I think you could get the fully programmable behavior some people 
want using simple SQL, rather than having to write a new Replication 
Description Language or something so ambitious.  This data about what's 
been replicated to where looks an awful lot like a set of rows you can 
operate on using features already in the database to me.


--
Greg Smith  2ndQuadrant US  Baltimore, MD
PostgreSQL Training, Services and Support
g...@2ndquadrant.com   www.2ndQuadrant.us


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Synchronization levels in SR

2010-06-02 Thread Heikki Linnakangas

On 02/06/10 10:22, Greg Smith wrote:

Heikki Linnakangas wrote:

The possibilities are endless... Your proposal above covers a pretty
good set of scenarios, but it's by no means complete. If we try to
solve everything the configuration will need to be written in a
Turing-complete Replication Description Language. We'll have to pick a
useful, easy-to-understand subset that covers the common scenarios. To
handle the more exotic scenarios, you can write a proxy that sits in
front of the master, and implements whatever rules you wish, with the
rules written in C.


I was thinking about this a bit recently. As I see it, there are three
fundamental parts of this:

1) We have a transaction that is being committed. The rest of the
computations here are all relative to it.


Agreed.


So in a 3 node case, the internal state table might look like this after
a bit of data had been committed:

node | location | state
--
a | local | fsync b | remote | recv
c | remote | async

This means that the local node has a fully persistent copy, but the best
either remote one has done is received the data, it's not on disk at all
yet at the remote data center. Still working its way through.

3) The decision about whether the data has been committed to enough
places to be considered safe by the master is computed by a function
that is passed this internal table as something like a SRF, and it
returns a boolean. Once that returns true, saying it's satisfied, the
transaction closes on the master and continues to percolate out from
there. If it's false, we wait for another state change to come in and
return to (2).


You can't implement wait for X to ack the commit, but if that doesn't 
happen in Y seconds, time out and return true anyway with that.



While exposing the local state and running this computation isn't free,
in situations where there truly are remote nodes in here being
communicated with the network overhead is going to dwarf that. If there
were a fast path for the simplest cases and this complicated one for the
rest, I think you could get the fully programmable behavior some people
want using simple SQL, rather than having to write a new Replication
Description Language or something so ambitious. This data about what's
been replicated to where looks an awful lot like a set of rows you can
operate on using features already in the database to me.


Yeah, if we want to provide full control over when a commit is 
acknowledged to the client, there's certainly no reason we can't expose 
that using a hook or something.


It's pretty scary to call a user-defined function at that point in 
transaction. Even if we document that you must refrain from doing nasty 
stuff like modifying tables in that function, it's still scary.


--
  Heikki Linnakangas
  EnterpriseDB   http://www.enterprisedb.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Synchronization levels in SR

2010-06-02 Thread Simon Riggs
On Wed, 2010-06-02 at 03:22 -0400, Greg Smith wrote:
 Heikki Linnakangas wrote:
  The possibilities are endless... Your proposal above covers a pretty 
  good set of scenarios, but it's by no means complete. If we try to 
  solve everything the configuration will need to be written in a 
  Turing-complete Replication Description Language. We'll have to pick a 
  useful, easy-to-understand subset that covers the common scenarios. To 
  handle the more exotic scenarios, you can write a proxy that sits in 
  front of the master, and implements whatever rules you wish, with the 
  rules written in C.
 
 I was thinking about this a bit recently.  As I see it, there are three 
 fundamental parts of this:
 
 1) We have a transaction that is being committed.  The rest of the 
 computations here are all relative to it.
 
 2) There is an (internal?) table that lists the state of each 
 replication target relative to that transaction.  It would include the 
 node name, perhaps some metadata ('location' seems the one that's most 
 likely to help with the remote data center issue), and a state code.  
 The codes from http://wiki.postgresql.org/wiki/Streaming_Replication 
 work fine for the last part (which is the only dynamic one--everything 
 else is static data being joined against):
 
 async=hasn't received yet
 recv=been received but just in RAM
 fsync=received and synced to disk
 apply=applied to the database
 
 These would need to be enums so they can be ordered from lesser to 
 greater consistency.
 
 So in a 3 node case, the internal state table might look like this after 
 a bit of data had been committed:
 
 node | location | state
 --
 a | local| fsync 
 b | remote | recv
 c | remote | async
 
 This means that the local node has a fully persistent copy, but the best 
 either remote one has done is received the data, it's not on disk at all 
 yet at the remote data center.  Still working its way through.
 
 3) The decision about whether the data has been committed to enough 
 places to be considered safe by the master is computed by a function 
 that is passed this internal table as something like a SRF, and it 
 returns a boolean.  Once that returns true, saying it's satisfied, the 
 transaction closes on the master and continues to percolate out from 
 there.  If it's false, we wait for another state change to come in and 
 return to (2).
 
 I would propose that most behaviors someone has expressed as being their 
 desired implementation is possible to implement using this scheme. 
 
 -Semi-sync commit:  proceed as soon somebody else has a copy and hope 
 the copies all become consistent:  EXISTS WHERE state=recv
 -Don't proceed until there's a fsync'd commit on at least one of the 
 remote nodes:  EXISTS WHERE location='remote' AND state=fsync
 -Look for a quorum of n commits of fsync quality:  CASE WHEN (SELECT 
 COUNT(*) WHERE state=fsync)n THEN true else FALSE end;
 
 Syntax is obviously rough but I think you can get the drift of what I'm 
 suggesting.
 
 While exposing the local state and running this computation isn't free, 
 in situations where there truly are remote nodes in here being 
 communicated with the network overhead is going to dwarf that.  If there 
 were a fast path for the simplest cases and this complicated one for the 
 rest, I think you could get the fully programmable behavior some people 
 want using simple SQL, rather than having to write a new Replication 
 Description Language or something so ambitious.  This data about what's 
 been replicated to where looks an awful lot like a set of rows you can 
 operate on using features already in the database to me.

I think we're all agreed on the 4 levels: async, recv, fsync, apply.

I also like the concept of a synchronisation/wakeup rule as an abstract
concept. Certainly makes things easier to discuss.

The inputs to the wakeup rule can be defined in different ways. Holding
per-node state at local level looks too complex to me. I'm not
suggesting that we need both per-node AND per-transaction options
interacting at the same time. (That would be a clear argument against
per-transaction options, if that was a requirement - its not, for me).

There seems to be a simpler way: a service oriented model. The
transaction requests a minimum level of synchronisation, the standbys
together service that request. A simple, clear process:

1. If transaction requests recv, fsync or apply, backend sleeps in the
appropriate queue

2. An agent on behalf of the remote standby provides feedback according
to the levels of service defined for that standby.

3. The agent calls a wakeup-rule to see if the backend can be woken yet

The most basic rule is first-standby-wakes meaning that the first
standby to provide feedback that the required synchronisation level has
been met by at least one standby will cause the rule to fire.

The next most basic thing is that some standbys can be marked as not
taking part in the quorum 

Re: [HACKERS] Synchronization levels in SR

2010-06-02 Thread Tom Lane
Heikki Linnakangas heikki.linnakan...@enterprisedb.com writes:
 It's pretty scary to call a user-defined function at that point in 
 transaction.

Not so much pretty scary as zero chance of being accepted.
And I do mean zero.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Synchronization levels in SR

2010-06-02 Thread Greg Smith

Tom Lane wrote:

Heikki Linnakangas heikki.linnakan...@enterprisedb.com writes:
  
It's pretty scary to call a user-defined function at that point in 
transaction.



Not so much pretty scary as zero chance of being accepted.
And I do mean zero.
  


I swear, you guys are such buzzkills some days.  I was suggesting a 
model for building easy prototypes, and advocating a more formal way to 
explain, in what could be code form, what someone means when they 
suggest a particular quorum model or the like.  Maybe all that will ever 
be exposed into a production server are the best of the hand-written 
implementations, and the scary try your prototype here hook only shows 
up in debug builds, or never gets written at all.  I did comment that I 
expected faster built-in implementations to be the primary way these 
would be handled.


From what Heikki said, it sounds like the main thing I was didn't 
remember is to include some timestamp information to allow rules based 
on that information too.


--
Greg Smith  2ndQuadrant US  Baltimore, MD
PostgreSQL Training, Services and Support
g...@2ndquadrant.com   www.2ndQuadrant.us


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Synchronization levels in SR

2010-05-27 Thread Simon Riggs
On Thu, 2010-05-27 at 11:28 +0900, Fujii Masao wrote:
 On Wed, May 26, 2010 at 10:20 PM, Simon Riggs si...@2ndquadrant.com wrote:
  On Wed, 2010-05-26 at 18:52 +0900, Fujii Masao wrote:
 
  I guess that dropping the support of #3 doesn't reduce complexity
  since the code of #3 is almost the same as that of #2. Like
  walreceiver sends the ACK after receiving the WAL in #2 case, it has
  only to do the same thing after the WAL flush.
 
  Hmm, well the code for #3 is similar also to the code for #4. So if you
  do #2, its easy to do #2, #3 and #4 together.
 
 No. #4 requires the way of prompt communication between walreceiver and
 startup process, but #2 and #3 not. That is, in #4, walreceiver has to
 wake the startup process up as soon as it has flushed WAL. OTOH, the
 startup process has to wake walreceiver up as soon as it has replayed
 WAL, to request it to send the ACK to the master. In #2 and #3, the
 prompt communication from walreceiver to startup process, i.e., changing
 the poll loop in the startup process would also be useful for the data
 to be visible immediately on the standby. But it's not required.

You need to pass WAL promptly on primary from backend to WALSender.
Whatever mechanism you use can also be reused symmetrically on standby
to provide #4. So not a problem.

  The comment is about whether having #3 makes sense from a user interface
  perspective. It's easy to add options, but they must have useful
  meaning.
 
 #3 would be useful for people wanting further robustness. In #2,
 when simultaneous power failure on the master and the standby,
 and concurrent disk crash on the master happen, transaction whose
 success indicator has been returned to a client might be lost.
 #3 can avoid such a critical situation. This is one of reasons that
 DRBD supports Protocol C, I think.

Which few people use it, or if they do its because DRBD didn't
originally support multiple standbys. Not worth emulating IMHO.

-- 
 Simon Riggs   www.2ndQuadrant.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Synchronization levels in SR

2010-05-27 Thread Simon Riggs
On Thu, 2010-05-27 at 02:18 +0300, Heikki Linnakangas wrote:
 On 27/05/10 01:23, Simon Riggs wrote:
  On Thu, 2010-05-27 at 00:21 +0300, Heikki Linnakangas wrote:
  On 26/05/10 23:31, Dimitri Fontaine wrote:
 d. choice of commit or rollback at timeout
 
  Rollback is not an option. There is no going back after the commit
  record has been flushed to disk or sent to a standby.
 
  There's definitely no going back after the xid has been removed from
  procarray because other transactions will then depend upon the final
  state. Currently we PANIC if we abort after we've marked clog, though
  that happens after XLogFlush(), which is where we're planning to wait
  for synch rep. If we abort after having written a commit record to disk
  we can still successfully generate an abort record as well. (Luckily, I
  note HS does actually cope with that. Phew!)
 
  So actually, an abort is a reasonable possibility, though I know it
  doesn't sound like it could be at first thought.

 Hmm, that's an interesting thought. Interesting, as in crazy ;-).

:-) It's a surprising thought for me also.

 I don't understand how HS could handle that. As soon as it sees the 
 commit record, the transaction becomes visible to readers.

I meant not-barf completely.

  The choice is to either commit anyway after the timeout, or wait forever.
 
  Hmm, wait forever. What happens if we try to shutdown fast while there
  is a transaction that is waiting forever? Is that then a commit, even
  though it never made it to the standby? How would we know it was safe to
  switchover or not? Hmm.
 
 Refuse to shut down until the standby acknowledges the commit. That's 
 the only way to be sure..
 
 In practice, hard synchronous don't return ever until the commit hits 
 the standby behavior is rarely what admins actually want, because it's 
 disastrous from an availability point of view. More likely, admins want 
 wait for ack from standby, unless it's not responding, in which case to 
 hell with redundancy and just act like a single server. It makes sense 
 if you just want to make sure that the standby doesn't return stale 
 results when it's working properly, and you're not worried about 
 durability but I'm not sure it's very sound otherwise.

Which is also crazy. If you're using synch rep its because you care
deeply about durability. Some people wish to treat the COMMIT as a
guarantee, not just a shrug.

I agree that don't-return-ever isn't something anyone will want.

What we need is a COMMIT with ERROR message!

Note that Oracle gives the options of COMMIT | SHUTDOWN at this point.
Shutdown is an implicit abort for the writing transaction...

At this point the primary thinks standby is no longer available. If we
have a split brain situation then we should be assuming we will STONITH
and shutdown the primary anyway. If we have more than one standby we can
stay up and probably shouldn't be sending an abort after a commit.

The trouble is *every* option is crazy from some perspective, so we must
consider them all, to see whether they are practical or impractical.

-- 
 Simon Riggs   www.2ndQuadrant.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Synchronization levels in SR

2010-05-27 Thread Heikki Linnakangas

On 27/05/10 09:51, Simon Riggs wrote:

On Thu, 2010-05-27 at 02:18 +0300, Heikki Linnakangas wrote:

In practice, hard synchronous don't return ever until the commit hits
the standby behavior is rarely what admins actually want, because it's
disastrous from an availability point of view. More likely, admins want
wait for ack from standby, unless it's not responding, in which case to
hell with redundancy and just act like a single server. It makes sense
if you just want to make sure that the standby doesn't return stale
results when it's working properly, and you're not worried about
durability but I'm not sure it's very sound otherwise.


Which is also crazy. If you're using synch rep its because you care
deeply about durability.


No, not necessarily. As I said above, you might just want a guarantee 
that *if* you query the standby, you get up-to-date results. But if the 
standby is down for any reason, you don't care about it. That's a very 
sensible mode of operation, for example if you're offloading reads to 
the standby with something like pgpool.


In fact I have the feeling that that's the most common use case for 
synchronous replication, not a deep concern of durability.



I agree that don't-return-ever isn't something anyone will want.

What we need is a COMMIT with ERROR message!


Hmm, perhaps we could emit a warning with the commit. I'm not sure what 
an application could do with it, though.


--
  Heikki Linnakangas
  EnterpriseDB   http://www.enterprisedb.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Synchronization levels in SR

2010-05-27 Thread Fujii Masao
On Wed, May 26, 2010 at 10:37 PM, Simon Riggs si...@2ndquadrant.com wrote:
 If the remote server responded first, then that proves it is a better
 candidate for failover than the one you think of as near. If the two
 standbys vary over time then you have network problems that will
 directly affect the performance on the master; synch_rep = N would
 respond better to any such problems.

No. The remote standby might respond first temporarily though it's almost
behind the near one. The read-only queries or incrementally updated
backup operation might cause a bursty disk write, and delay the ACK from
the standby. The lock contention between read-only queries and recovery
would delay the ACK. So the standby which responds first is not always
the best candidate for failover. Also the administrator generally doesn't
put the remote standby under the control of a clusterware like heartbeat.
In this case, the remote standby will never be the candidate for failover.
But quorum commit cannot cover this simple case.

 OTOH, synchronous_replication=2 degrades the
 performance on the master very much.

 Yes, but only because you have only one near standby. It would clearly
 to be foolish to make this setting without 2+ near standbys. We would
 then have 4 or more servers; how do we specify everything for that
 config??

If you always want to use the near standby as the candidate for failover
by using quorum commit in the above simple case, you would need to choose
such a foolish setting. Otherwise, unfortunately you might have to failover
to the remote standby not under the control of a clusterware.

 synchronous_replication approach
 doesn't seem to cover the typical use case.

 You described the failure modes for the quorum proposal, but avoided
 describing the failure modes for the per-standby proposal.

 Please explain what will happen when the near server is unavailable,
 with per-standby settings. Please also explain what will happen if we
 choose to have 4 or 5 servers to maintain performance in case of the
 near server going down. How will we specify the failure modes?

I'll try to explain that.

(1) most standard case: 1 master + 1 sync standby (near)
When the master goes down, something like a clusterware detects that
failure, and brings the standby online. Since we can ensure that the
standby has all the committed transactions, failover doesn't cause
any data loss.

When the standby goes down or network outage happens, walsender
detects that failure via the replication timeout, keepalive or error
return from the system calls. Then walsender does something according
to the specified reaction (GUC) to the failure of the standby, e.g.,
walsender wakes the transaction commit up from the wait-for-ACK, and
exits. Then the master runs standalone.

(2) 1 master + 1 sync standby (near) + 1 async standby (remote)
When the master goes down, something like a clusterware brings the
sync standby in the near location online. The administrator would
need to take a fresh base backup of the new master, load it on the
remote standby, change the primary_conninfo, and restart the remote
standby.

When one of standbys goes down, walsender does the same thing described
in (1). Until the failed standby has restarted, the master runs together
with another standby.

In (1) and (2), after some failure happens, there would be only one server
which is guaranteed to have all the committed transactions. When it also
goes down, the database service stops. If you want to avoid this fragile
situation, you would need to add one more sync standby in the near site.

(3) 1 master + 2 sync standbys (near) + 1 async standby (remote)
When the master goes down, something like a clusterware brings the
one of sync standbys online by using some selection algorithm.
The administrator would need to take a fresh base backup of the new
master, load it on both remaining standbys, change the primary_conninfo,
and restart them.

When one of standbys goes down, walsender does the same thing described
in (1). Until the failed standby has restarted, the master runs together
with two standbys. At least one standby is guaranteed to be sync with
the master.

Is this explanation enough?

 Also, when synchronous_replication=1 and one of synchronous standbys
 goes down, how should the surviving standby catch up with the master?
 Such standby might be too far behind the master. The transaction commit
 should wait for the ACK from the lagging standby immediately even if
 there might be large gap? If yes, synch_rep_timeout would screw up
 the replication easily.

 That depends upon whether we send the ACK at point #2, #3 or #4. It
 would only cause a problem if you waited until #4.

Yeah, the problem happens. If we implement quorum commit, we need to
design how the surviving standby catches up with the master.

Regards,

-- 
Fujii Masao
NIPPON TELEGRAPH AND 

Re: [HACKERS] Synchronization levels in SR

2010-05-27 Thread Fujii Masao
On Thu, May 27, 2010 at 3:21 PM, Simon Riggs si...@2ndquadrant.com wrote:
 On Thu, 2010-05-27 at 11:28 +0900, Fujii Masao wrote:
 On Wed, May 26, 2010 at 10:20 PM, Simon Riggs si...@2ndquadrant.com wrote:
  On Wed, 2010-05-26 at 18:52 +0900, Fujii Masao wrote:
 
  I guess that dropping the support of #3 doesn't reduce complexity
  since the code of #3 is almost the same as that of #2. Like
  walreceiver sends the ACK after receiving the WAL in #2 case, it has
  only to do the same thing after the WAL flush.
 
  Hmm, well the code for #3 is similar also to the code for #4. So if you
  do #2, its easy to do #2, #3 and #4 together.

 No. #4 requires the way of prompt communication between walreceiver and
 startup process, but #2 and #3 not. That is, in #4, walreceiver has to
 wake the startup process up as soon as it has flushed WAL. OTOH, the
 startup process has to wake walreceiver up as soon as it has replayed
 WAL, to request it to send the ACK to the master. In #2 and #3, the
 prompt communication from walreceiver to startup process, i.e., changing
 the poll loop in the startup process would also be useful for the data
 to be visible immediately on the standby. But it's not required.

 You need to pass WAL promptly on primary from backend to WALSender.
 Whatever mechanism you use can also be reused symmetrically on standby
 to provide #4. So not a problem.

I cannot be so optimistic since the situation differs from one process
to another.

  The comment is about whether having #3 makes sense from a user interface
  perspective. It's easy to add options, but they must have useful
  meaning.

 #3 would be useful for people wanting further robustness. In #2,
 when simultaneous power failure on the master and the standby,
 and concurrent disk crash on the master happen, transaction whose
 success indicator has been returned to a client might be lost.
 #3 can avoid such a critical situation. This is one of reasons that
 DRBD supports Protocol C, I think.

 Which few people use it, or if they do its because DRBD didn't
 originally support multiple standbys. Not worth emulating IMHO.

If so, #3 would be useful for people who don't afford to buy more
than one standby servers, too :)

Regards,

-- 
Fujii Masao
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Synchronization levels in SR

2010-05-27 Thread Dimitri Fontaine
Heikki Linnakangas heikki.linnakan...@enterprisedb.com writes:
 On 26/05/10 23:31, Dimitri Fontaine wrote:
 So if you want simplicity to admin, effective data availability and
 precise control over the global setup, I say go for:
   a. transaction level control of the replication level
   b. cascading support
   c. quorum with timeout
   d. choice of commit or rollback at timeout

 Then give me a setup example that you can't express fully.

 One master, one synchronous standby on another continent for HA purposes,
 and one asynchronous reporting server in the same rack as the master. You
 don't want to set up the reporting server as a cascaded slave of the standby
 on the other continent, because that would double the bandwidth required,
 but you also don't want the master to wait for the reporting server.

 The possibilities are endless... Your proposal above covers a pretty good
 set of scenarios, but it's by no means complete. If we try to solve
 everything the configuration will need to be written in a Turing-complete
 Replication Description Language. We'll have to pick a useful,
 easy-to-understand subset that covers the common scenarios. To handle the
 more exotic scenarios, you can write a proxy that sits in front of the
 master, and implements whatever rules you wish, with the rules written
 in C.

Agreed on the Turing-completeness side of those things. My current
thinking is that the proxy I want might simply be a PostgreSQL instance
with cascading support. In your example that would give us:

  Remote Standby, HA 
  Master -- Proxy - 
  Local Standby, Reporting

So what I think we have here is a pretty good trade-off in terms of what
you can do with some simple setup knobs. What's left there is that with
the quorum idea, you're not sure if the one server that's synced is the
remote or local standby, in this example. Several ideas are floating
around (votes, mixed per-standby and per-transaction settings).

Maybe we could have the standby be able to say it's not interesting into
participating into the quorum, that is, it's an async replica, full
stop.

In your example we'd set the local reporting standby as a non-voting
member of the replication setting, the proxy and the master would have a
quorum of 1, and the remote HA standby would vote.

I don't think the idea of having any number of voting coupons other than
0 or 1 on any server will help us the least.

I do think that your proxy idea is a great one and should be in core. By
the way, the cascading/proxy instance could be set without Hot Standby,
if you don't like to be able to monitor it via a libpq connection and
some queries.

 BTW, I think we're going to need a separate config file for listing the
 standbys anyway. There you can write per-server rules and options, but
 explicitly knowing about all the standbys also allows the master to recycle
 WAL as soon as it has been streamed to all the registered
 standbys. Currently we just keep wal_keep_segments files around, just in
 case there's a standby out there that needs them.

I much prefer that each server in the set publish what it wants. It only
connects to 1 given provider. Then we've been talking about this exact
same retention problem for queueing solutions, with Jan, Marko and Jim.

The idea we came up with is a watermarking solution (which already
exists in Skytools 3, in its coarse-grain version). The first approach
is to have all slave give back to its local master/provider/origin the
last replayed WAL/LSN, once in a while. You derive from that a global
watermark and drop WAL files depending on it.

You now have two problems: no more space and why keeping that many files
on the master anyway, maybe some slave could be set up for retention
instead?

To solve that it's possible for each server to be setup with a
restricted set of servers they're deriving their watermark from. That's
when you need per-server options and an explicit list of all the
standbys whatever their level in the cascading tree. That means explicit
maintenance of the entire replication topology.

I don't think we need to solve that already. I think we need to provide
an option on each member of the replication tree to either PANIC or lose
WALs in case they're running out of space when trying to follow the
watermark. It's crude but already allows to have a standby set to
maintain the common archive and have the master drop the WAL files as
soon as possible (respecting wal_keep_segments).

In our case, if a WAL file is no more available from any active server
we still have the option to fetch it from the archives...

Regards,
-- 
Dimitri Fontaine
PostgreSQL DBA, Architecte

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Synchronization levels in SR

2010-05-27 Thread Simon Riggs
On Thu, 2010-05-27 at 10:09 +0300, Heikki Linnakangas wrote:
 No, not necessarily. As I said above, you might just want a guarantee 
 that *if* you query the standby, you get up-to-date results.

Of course. COMMIT was already one of the options, so this comment was
already understood.

What we are discussing is whether additional options exist and/or are
desirable. We should not be forcing everybody to COMMIT whether or not
it is robust.

-- 
 Simon Riggs   www.2ndQuadrant.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Synchronization levels in SR

2010-05-27 Thread Simon Riggs
On Thu, 2010-05-27 at 16:35 +0900, Fujii Masao wrote:
 On Thu, May 27, 2010 at 3:21 PM, Simon Riggs si...@2ndquadrant.com wrote:
  On Thu, 2010-05-27 at 11:28 +0900, Fujii Masao wrote:
  On Wed, May 26, 2010 at 10:20 PM, Simon Riggs si...@2ndquadrant.com 
  wrote:
   On Wed, 2010-05-26 at 18:52 +0900, Fujii Masao wrote:
  
   I guess that dropping the support of #3 doesn't reduce complexity
   since the code of #3 is almost the same as that of #2. Like
   walreceiver sends the ACK after receiving the WAL in #2 case, it has
   only to do the same thing after the WAL flush.
  
   Hmm, well the code for #3 is similar also to the code for #4. So if you
   do #2, its easy to do #2, #3 and #4 together.
 
  No. #4 requires the way of prompt communication between walreceiver and
  startup process, but #2 and #3 not. That is, in #4, walreceiver has to
  wake the startup process up as soon as it has flushed WAL. OTOH, the
  startup process has to wake walreceiver up as soon as it has replayed
  WAL, to request it to send the ACK to the master. In #2 and #3, the
  prompt communication from walreceiver to startup process, i.e., changing
  the poll loop in the startup process would also be useful for the data
  to be visible immediately on the standby. But it's not required.
 
  You need to pass WAL promptly on primary from backend to WALSender.
  Whatever mechanism you use can also be reused symmetrically on standby
  to provide #4. So not a problem.
 
 I cannot be so optimistic since the situation differs from one process
 to another.

This spurs some architectural thinking:

I think we need to disconnect the idea of waiting in any of the
components. Anytime we ask WALSender or WALReceiver to wait for
acknowledgement we will be reducing throughput. So we should assume that
they will continue to work as quickly as possible.

The acknowledgement from standby can contain the latest xlog location of
WAL received, WAL written to disk and WAL applied, all by reading values
from shared memory. It's all the same, whether we send back 2 or 3 xlog
locations in the ack message.

Who sends the ack message? Who receives it? Would it be easier to have
this happen in a second pair of processes WALSynchroniser (on primary)
and WAL Acknowledger (on standby). WALAcknowledger would send back a
stream of ack messages with latest xlog positions. WALSynchroniser would
receive these messages and wake up sleeping backends. If we did that
then there'd be almost no change at all to existing code, just
additional code and processes for the sync case. Code would be separate
and there would be no performance concerns either.

Backends can then choose to wait until the xlog location they wish has
been achieved which might be in the next acknowledgement message or in a
subsequent one. That also ensures that the logic for this is completely
on the master and the standby doesn't act differently, apart from
needing to start a WALAcknowledger process if sync rep is requested.

If you do choose to make #3 important, then I'd say you need to work out
how to make WALWriter active as well, so it can perform regular fsyncs,
rather than having WALReceiver wait across that I/O.

-- 
 Simon Riggs   www.2ndQuadrant.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Synchronization levels in SR

2010-05-27 Thread Simon Riggs
On Thu, 2010-05-27 at 16:13 +0900, Fujii Masao wrote:
 On Wed, May 26, 2010 at 10:37 PM, Simon Riggs si...@2ndquadrant.com wrote:

  Please explain what will happen when the near server is unavailable,
  with per-standby settings. Please also explain what will happen if we
  choose to have 4 or 5 servers to maintain performance in case of the
  near server going down. How will we specify the failure modes?
 
 I'll try to explain that.

We've been discussing parameters and how we would define what we want to
happen in various scenarios.

You've not explained what parameters you would use, how and where they
would be set, so we aren't yet any closer to understanding what it is
your proposing.

Please explain how your proposal will work.

-- 
 Simon Riggs   www.2ndQuadrant.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Synchronization levels in SR

2010-05-27 Thread Simon Riggs
On Thu, 2010-05-27 at 16:13 +0900, Fujii Masao wrote:
 On Wed, May 26, 2010 at 10:37 PM, Simon Riggs si...@2ndquadrant.com wrote:
  If the remote server responded first, then that proves it is a better
  candidate for failover than the one you think of as near. If the two
  standbys vary over time then you have network problems that will
  directly affect the performance on the master; synch_rep = N would
  respond better to any such problems.
 
 No. The remote standby might respond first temporarily though it's almost
 behind the near one. The read-only queries or incrementally updated
 backup operation might cause a bursty disk write, and delay the ACK from
 the standby. The lock contention between read-only queries and recovery
 would delay the ACK. So the standby which responds first is not always
 the best candidate for failover. 

Seems strange. If you have 2 standbys and you say you would like node1
to be the preferred candidate, then you load it so heavily that a remote
server with by-definition much larger network delay responds first, then
I say your preference was wrong. The above situation is caused by the
DBA and the DBA can solve it also - if the preference is to keep a
preferred server then that server would need to be lightly loaded so
it could respond sensibly. 

This is the same thing as having an optimizer pick the best path and
then the user saying no dumb-ass, use the index I tell you even though
it is slower. If you really don't want to know the fastest way, then I
personally will agree you can have that, as is my view (now) on the
optimizer issue also - sometimes the admin does know best.

 Also the administrator generally doesn't
 put the remote standby under the control of a clusterware like heartbeat.
 In this case, the remote standby will never be the candidate for failover.
 But quorum commit cannot cover this simple case.

If you, Jan and Yeb wish to completely exclude standbys from being part
of any quorum, then I guess we need to have per-standby settings to
allow that to be defined. I'm in favour of giving people options. That
needn't be a mandatory per-standby setting, just a non-default option,
so that we can reduce the complexity of configuration for common cases.
If we're looking for simplest-implementation-first that isn't it.

Currently, Oracle provides these settings, which correspond to 
Maximum Performance = quorum = 0
Maximum Availability = quorum = 1, timeout_action = commit
Maximum Protection = quorum = 1, timeout_action = shutdown

So Oracle already supports the quorum case...

Oracle doesn't provide
i) any capability to have quorum  1
ii) any capability to include an async node as a sync node, if the
quorum cannot be reached with servers marked sync, or in the situation
where because of mis-use/mis-configuration the sync servers are actually
slower.
iii) ability to wait for apply
iv) ability to specify wait mode at transaction level

all of those are desirable in some cases and easily possible by
specifying things in the way I've suggested.

-- 
 Simon Riggs   www.2ndQuadrant.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Synchronization levels in SR

2010-05-27 Thread Fujii Masao
On Thu, May 27, 2010 at 6:30 PM, Simon Riggs si...@2ndquadrant.com wrote:
 Who sends the ack message?

walreceiver

 Who receives it?

walsender

 Would it be easier to have
 this happen in a second pair of processes WALSynchroniser (on primary)
 and WAL Acknowledger (on standby). WALAcknowledger would send back a
 stream of ack messages with latest xlog positions. WALSynchroniser would
 receive these messages and wake up sleeping backends. If we did that
 then there'd be almost no change at all to existing code, just
 additional code and processes for the sync case. Code would be separate
 and there would be no performance concerns either.

No, this seems to be bad idea. We should not establish extra connection
between servers. That would be a source of trouble.

 If you do choose to make #3 important, then I'd say you need to work out
 how to make WALWriter active as well, so it can perform regular fsyncs,
 rather than having WALReceiver wait across that I/O.

Yeah, this might be an option for optimization though I'm not sure how
it has good effect.

Regards,

-- 
Fujii Masao
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Synchronization levels in SR

2010-05-27 Thread Simon Riggs
On Thu, 2010-05-27 at 19:21 +0900, Fujii Masao wrote:
 On Thu, May 27, 2010 at 6:30 PM, Simon Riggs si...@2ndquadrant.com wrote:

  Would it be easier to have
  this happen in a second pair of processes WALSynchroniser (on primary)
  and WAL Acknowledger (on standby). WALAcknowledger would send back a
  stream of ack messages with latest xlog positions. WALSynchroniser would
  receive these messages and wake up sleeping backends. If we did that
  then there'd be almost no change at all to existing code, just
  additional code and processes for the sync case. Code would be separate
  and there would be no performance concerns either.
 
 No, this seems to be bad idea. We should not establish extra connection
 between servers. That would be a source of trouble.

What kind of trouble? You think using an extra connection would cause
problems; why?

I've explained it would greatly simplify the code to do it that way and
improve performance. Those sound like good things, not problems.

  If you do choose to make #3 important, then I'd say you need to work out
  how to make WALWriter active as well, so it can perform regular fsyncs,
  rather than having WALReceiver wait across that I/O.
 
 Yeah, this might be an option for optimization though I'm not sure how
 it has good effect.

As I said, WALreceiver would not need to wait across fsync...

-- 
 Simon Riggs   www.2ndQuadrant.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Synchronization levels in SR

2010-05-27 Thread Fujii Masao
On Thu, May 27, 2010 at 7:21 PM, Simon Riggs si...@2ndquadrant.com wrote:
 Seems strange. If you have 2 standbys and you say you would like node1
 to be the preferred candidate, then you load it so heavily that a remote
 server with by-definition much larger network delay responds first, then
 I say your preference was wrong. The above situation is caused by the
 DBA and the DBA can solve it also - if the preference is to keep a
 preferred server then that server would need to be lightly loaded so
 it could respond sensibly.

No. Even if the load is very low in the preferred server, there is
*no* guarantee that it responds first. Per-standby setting can give
such a guarantee, i.e., we can specify #2, #3 or #4 in the preferred
server and #1 in the other.

 This is the same thing as having an optimizer pick the best path and
 then the user saying no dumb-ass, use the index I tell you even though
 it is slower. If you really don't want to know the fastest way, then I
 personally will agree you can have that, as is my view (now) on the
 optimizer issue also - sometimes the admin does know best.

I think that choice of wrong master causes more serious situation than
that of wrong plan.

 Also the administrator generally doesn't
 put the remote standby under the control of a clusterware like heartbeat.
 In this case, the remote standby will never be the candidate for failover.
 But quorum commit cannot cover this simple case.

 If you, Jan and Yeb wish to completely exclude standbys from being part
 of any quorum, then I guess we need to have per-standby settings to
 allow that to be defined. I'm in favour of giving people options. That
 needn't be a mandatory per-standby setting, just a non-default option,
 so that we can reduce the complexity of configuration for common cases.
 If we're looking for simplest-implementation-first that isn't it.

For now, I agree that we support a quorum commit feature for 9.1 or later.
But I don't think that it's simpler, more intuitive and easier-to-understand
than per-standby setting. So I think that we should include the per-standby
setting in the first patch.

Regards,

-- 
Fujii Masao
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Synchronization levels in SR

2010-05-27 Thread Fujii Masao
On Thu, May 27, 2010 at 7:33 PM, Simon Riggs si...@2ndquadrant.com wrote:
 On Thu, 2010-05-27 at 19:21 +0900, Fujii Masao wrote:
 On Thu, May 27, 2010 at 6:30 PM, Simon Riggs si...@2ndquadrant.com wrote:

  Would it be easier to have
  this happen in a second pair of processes WALSynchroniser (on primary)
  and WAL Acknowledger (on standby). WALAcknowledger would send back a
  stream of ack messages with latest xlog positions. WALSynchroniser would
  receive these messages and wake up sleeping backends. If we did that
  then there'd be almost no change at all to existing code, just
  additional code and processes for the sync case. Code would be separate
  and there would be no performance concerns either.

 No, this seems to be bad idea. We should not establish extra connection
 between servers. That would be a source of trouble.

 What kind of trouble? You think using an extra connection would cause
 problems; why?

Because the number of connection failure cases doubles. Likewise, the number
of process failure cases would double.

  If you do choose to make #3 important, then I'd say you need to work out
  how to make WALWriter active as well, so it can perform regular fsyncs,
  rather than having WALReceiver wait across that I/O.

 Yeah, this might be an option for optimization though I'm not sure how
 it has good effect.

 As I said, WALreceiver would not need to wait across fsync...

Right, but walreceiver still needs to wait for WAL flush by walwriter.
If currently WAL flush is the dominant workload for walreceiver,
only leaving it to walwriter might not have so good effect. I'm not sure
whether.

Regards,

-- 
Fujii Masao
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Synchronization levels in SR

2010-05-27 Thread Dimitri Fontaine
Simon Riggs si...@2ndquadrant.com writes:
 Seems strange. If you have 2 standbys and you say you would like node1
 to be the preferred candidate, then you load it so heavily that a remote
 server with by-definition much larger network delay responds first, then
 I say your preference was wrong.

There's a communication mismatch here I think. The problem with the
dynamic aspect of the system is that the admin wants to plan ahead and
choose in advance the failover server.

Other than that I much prefer the automatic and dynamic quorum idea.

 If you, Jan and Yeb wish to completely exclude standbys from being part
 of any quorum, then I guess we need to have per-standby settings to
 allow that to be defined. I'm in favour of giving people options. That
 needn't be a mandatory per-standby setting, just a non-default option,
 so that we can reduce the complexity of configuration for common
 cases.

+1

 Maximum Performance = quorum = 0
 Maximum Availability = quorum = 1, timeout_action = commit
 Maximum Protection = quorum = 1, timeout_action = shutdown

+1

Being able to say that a given server has not been granted to
participate into the vote allowing to reach the global durability quorum
will allow for choosing the failover candidates.

Now you're able to have this reporting server and know for sure that
your sync replicated transactions are not waiting for it.

To summarize, the current per-transaction approach would be :

 - transaction level replication synchronous behaviour
 - proxy/cascading in core
 - quorum setup for deciding any commit is safe
 - any server can be excluded from the sync quorum
 - timeout can still raises exception or ignore (commit)?

This last point seems to need some more discussion, or I didn't
understand well the current positions and proposals.

Regards,
-- 
dim

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Synchronization levels in SR

2010-05-27 Thread Robert Haas
On Thu, May 27, 2010 at 3:13 AM, Fujii Masao masao.fu...@gmail.com wrote:
 (1) most standard case: 1 master + 1 sync standby (near)
    When the master goes down, something like a clusterware detects that
    failure, and brings the standby online. Since we can ensure that the
    standby has all the committed transactions, failover doesn't cause
    any data loss.

How do you propose to guarantee that?  ISTM that you have to either
commit locally first, or send the commit to the remote first.  Either
way, the two events won't occur exactly simultaneously.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise Postgres Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Synchronization levels in SR

2010-05-27 Thread Simon Riggs
On Thu, 2010-05-27 at 20:13 +0900, Fujii Masao wrote:
 On Thu, May 27, 2010 at 7:33 PM, Simon Riggs si...@2ndquadrant.com wrote:
  On Thu, 2010-05-27 at 19:21 +0900, Fujii Masao wrote:
  On Thu, May 27, 2010 at 6:30 PM, Simon Riggs si...@2ndquadrant.com wrote:
 
   Would it be easier to have
   this happen in a second pair of processes WALSynchroniser (on primary)
   and WAL Acknowledger (on standby). WALAcknowledger would send back a
   stream of ack messages with latest xlog positions. WALSynchroniser would
   receive these messages and wake up sleeping backends. If we did that
   then there'd be almost no change at all to existing code, just
   additional code and processes for the sync case. Code would be separate
   and there would be no performance concerns either.
 
  No, this seems to be bad idea. We should not establish extra connection
  between servers. That would be a source of trouble.
 
  What kind of trouble? You think using an extra connection would cause
  problems; why?
 
 Because the number of connection failure cases doubles. Likewise, the number
 of process failure cases would double.

Not really. The users wait for just the synchroniser to return not for
two things.  Looks to me that other processes are independent of each
other. Very simple.

   If you do choose to make #3 important, then I'd say you need to work out
   how to make WALWriter active as well, so it can perform regular fsyncs,
   rather than having WALReceiver wait across that I/O.
 
  Yeah, this might be an option for optimization though I'm not sure how
  it has good effect.
 
  As I said, WALreceiver would not need to wait across fsync...
 
 Right, but walreceiver still needs to wait for WAL flush by walwriter.

Why does it? I just explained a design where that wasn't required.

 If currently WAL flush is the dominant workload for walreceiver,
 only leaving it to walwriter might not have so good effect. I'm not sure
 whether.

If we're not sure, we could check before agreeing a design.

WAL flush will be costly unless you have huge disk cache.

-- 
 Simon Riggs   www.2ndQuadrant.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Synchronization levels in SR

2010-05-27 Thread Simon Riggs
On Thu, 2010-05-27 at 19:50 +0900, Fujii Masao wrote:

 For now, I agree that we support a quorum commit feature for 9.1 or later.
 But I don't think that it's simpler, more intuitive and easier-to-understand
 than per-standby setting. So I think that we should include the per-standby
 setting in the first patch.

There already is a first patch to the community that implements quorum
commit, just not by you.

If you have a better way, describe it in detail and in full now, with
reference to each of the use cases you mentioned, so that people get a
chance to give their opinions on your design. Then we can let the
community decide whether or not that second way is actually better. We
may not need a second patch.

-- 
 Simon Riggs   www.2ndQuadrant.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Synchronization levels in SR

2010-05-27 Thread Fujii Masao
On Thu, May 27, 2010 at 8:28 PM, Robert Haas robertmh...@gmail.com wrote:
 On Thu, May 27, 2010 at 3:13 AM, Fujii Masao masao.fu...@gmail.com wrote:
 (1) most standard case: 1 master + 1 sync standby (near)
    When the master goes down, something like a clusterware detects that
    failure, and brings the standby online. Since we can ensure that the
    standby has all the committed transactions, failover doesn't cause
    any data loss.

 How do you propose to guarantee that?  ISTM that you have to either
 commit locally first, or send the commit to the remote first.  Either
 way, the two events won't occur exactly simultaneously.

Letting the transaction wait until the standby has received / flushed /
replayed the WAL before it returns a success indicator to a client
would guarantee that. This ensures that all transactions which a client
knows as committed exist in the memory or disk of the standby. So we
would be able to see those transactions from new master after failover.

Regards,

-- 
Fujii Masao
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Synchronization levels in SR

2010-05-27 Thread Fujii Masao
On Thu, May 27, 2010 at 8:30 PM, Simon Riggs si...@2ndquadrant.com wrote:
 Why does it? I just explained a design where that wasn't required.

Hmm.. my expression might have been ambiguous. Walreceiver still needs
to wait for WAL flush by walwriter *before* sending the ACK to the master,
in #3 case. Because, in #3, the master has to wait until the standby has
flushed the WAL.

Regards,

-- 
Fujii Masao
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Synchronization levels in SR

2010-05-27 Thread Fujii Masao
On Thu, May 27, 2010 at 8:30 PM, Simon Riggs si...@2ndquadrant.com wrote:
 There already is a first patch to the community that implements quorum
 commit, just not by you.

Yeah, AFAIK, that patch includes also per-standby setting.

 If you have a better way, describe it in detail and in full now, with
 reference to each of the use cases you mentioned, so that people get a
 chance to give their opinions on your design. Then we can let the
 community decide whether or not that second way is actually better. We
 may not need a second patch.

See http://archives.postgresql.org/pgsql-hackers/2010-05/msg01407.php

But I think that we should focus on per-standby setting at first.

Regards,

-- 
Fujii Masao
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Synchronization levels in SR

2010-05-27 Thread Robert Haas
On Thu, May 27, 2010 at 8:02 AM, Fujii Masao masao.fu...@gmail.com wrote:
 On Thu, May 27, 2010 at 8:28 PM, Robert Haas robertmh...@gmail.com wrote:
 On Thu, May 27, 2010 at 3:13 AM, Fujii Masao masao.fu...@gmail.com wrote:
 (1) most standard case: 1 master + 1 sync standby (near)
    When the master goes down, something like a clusterware detects that
    failure, and brings the standby online. Since we can ensure that the
    standby has all the committed transactions, failover doesn't cause
    any data loss.

 How do you propose to guarantee that?  ISTM that you have to either
 commit locally first, or send the commit to the remote first.  Either
 way, the two events won't occur exactly simultaneously.

 Letting the transaction wait until the standby has received / flushed /
 replayed the WAL before it returns a success indicator to a client
 would guarantee that. This ensures that all transactions which a client
 knows as committed exist in the memory or disk of the standby. So we
 would be able to see those transactions from new master after failover.

There could still be additional transactions that the original master
has committed locally but were not acked to the client.  I guess you'd
just work around that by taking a new base backup from the new master.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise Postgres Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Synchronization levels in SR

2010-05-27 Thread Fujii Masao
On Thu, May 27, 2010 at 9:48 PM, Robert Haas robertmh...@gmail.com wrote:
 There could still be additional transactions that the original master
 has committed locally but were not acked to the client.  I guess you'd
 just work around that by taking a new base backup from the new master.

Right.

Unfortunately the transaction aborted for a client might have already
been committed in the standby. In this case, we might need to eliminate
the mismatch of transaction status between a client and new master
after failover.

BTW, the similar situation might happen even when only one server is
running. If the server goes down before returning a success to a
client after flushing the commit record, the mismatch would happen
after restart of the server.

Regards,

-- 
Fujii Masao
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Synchronization levels in SR

2010-05-27 Thread Robert Haas
On Thu, May 27, 2010 at 9:09 AM, Fujii Masao masao.fu...@gmail.com wrote:
 On Thu, May 27, 2010 at 9:48 PM, Robert Haas robertmh...@gmail.com wrote:
 There could still be additional transactions that the original master
 has committed locally but were not acked to the client.  I guess you'd
 just work around that by taking a new base backup from the new master.

 Right.

 Unfortunately the transaction aborted for a client might have already
 been committed in the standby. In this case, we might need to eliminate
 the mismatch of transaction status between a client and new master
 after failover.

 BTW, the similar situation might happen even when only one server is
 running. If the server goes down before returning a success to a
 client after flushing the commit record, the mismatch would happen
 after restart of the server.

True.  But that's a slightly different case.  Clients could fail to
receive commit ACKs for a variety of reasons, like losing network
connectivity momentarily.  They had better be prepared for that no
matter whether replication is in use or not.  The new issue that
replication adds is that you've got to make sure that the two (or n)
nodes don't disagree with each other.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise Postgres Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Synchronization levels in SR

2010-05-27 Thread Bruce Momjian
Simon Riggs wrote:
 On Wed, 2010-05-26 at 18:52 +0900, Fujii Masao wrote:
 
  I guess that dropping the support of #3 doesn't reduce complexity
  since the code of #3 is almost the same as that of #2. Like
  walreceiver sends the ACK after receiving the WAL in #2 case, it has
  only to do the same thing after the WAL flush.
 
 Hmm, well the code for #3 is similar also to the code for #4. So if you
 do #2, its easy to do #2, #3 and #4 together.
 
 The comment is about whether having #3 makes sense from a user interface
 perspective. It's easy to add options, but they must have useful
 meaning.

If the slave is runing read-only queries, #3 is the most reliable option
withouth delaying the slave, so there is a usecase for #3.

-- 
  Bruce Momjian  br...@momjian.ushttp://momjian.us
  EnterpriseDB http://enterprisedb.com

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Synchronization levels in SR

2010-05-27 Thread Bruce Momjian
Heikki Linnakangas wrote:
 BTW, I think we're going to need a separate config file for listing the 
 standbys anyway. There you can write per-server rules and options, but 
 explicitly knowing about all the standbys also allows the master to 
 recycle WAL as soon as it has been streamed to all the registered 
 standbys. Currently we just keep wal_keep_segments files around, just in 
 case there's a standby out there that needs them.

Ideally we could set 'slave_sync_count' and 'slave_commit_continue_mode'
on the master, and allow the sync/async mode to be set on each slave,
e.g. if slave_sync_count = 2 and slave_commit_continue_mode = #2, then
two slaves with sync mode of #2 or stricter have to complete before the
master can continue.

Naming the slaves on the master seems very confusing because I am
unclear how we would identify named slaves, and the names have to match,
etc.  

Also, what would be cool would be if you could run a query on the master
to view the SR commit mode of each slave.

-- 
  Bruce Momjian  br...@momjian.ushttp://momjian.us
  EnterpriseDB http://enterprisedb.com

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Synchronization levels in SR

2010-05-27 Thread Robert Haas

On May 27, 2010, at 4:31 PM, Bruce Momjian br...@momjian.us wrote:

Heikki Linnakangas wrote:
BTW, I think we're going to need a separate config file for listing  
the
standbys anyway. There you can write per-server rules and options,  
but

explicitly knowing about all the standbys also allows the master to
recycle WAL as soon as it has been streamed to all the registered
standbys. Currently we just keep wal_keep_segments files around,  
just in

case there's a standby out there that needs them.


Ideally we could set 'slave_sync_count' and  
'slave_commit_continue_mode'

on the master, and allow the sync/async mode to be set on each slave,
e.g. if slave_sync_count = 2 and slave_commit_continue_mode = #2, then
two slaves with sync mode of #2 or stricter have to complete before  
the

master can continue.

Naming the slaves on the master seems very confusing because I am
unclear how we would identify named slaves, and the names have to  
match,

etc.


The names could be configured with a GUC on the slaves, or we could  
base it on the login role.


...Robert

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Synchronization levels in SR

2010-05-26 Thread Simon Riggs
On Tue, 2010-05-25 at 23:59 -0400, Robert Haas wrote:
 Quorum commit is definitely an extra knob, IMHO.

No, its about three less, as I have explained.

Explain your position, don't just demand others listen.

-- 
 Simon Riggs   www.2ndQuadrant.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Synchronization levels in SR

2010-05-26 Thread Simon Riggs
On Wed, 2010-05-26 at 13:03 +0900, Fujii Masao wrote:
 On Wed, May 26, 2010 at 1:04 AM, Simon Riggs si...@2ndquadrant.com wrote:
  On Tue, 2010-05-25 at 12:40 +0900, Fujii Masao wrote:
  On Tue, May 25, 2010 at 10:29 AM, Josh Berkus j...@agliodbs.com wrote:
   I agree that #4 should be done last, but it will be needed, not in the
   least by your employer ;-) .  I don't see any obvious way to make #4
   compatible with any significant query load on the slave, but in general
   I'd think that users of #4 are far more concerned with 0% data loss than
   they are with getting the slave to run read queries.
 
  Since #2 and #3 are enough for 0% data loss, I think that such users
  would be more concerned about what results are visible in the standby.
  No?
 
  Please add #4 also. You can do that easily at the same time as #2 and
  #3, and it will leave me free to fix the perceived conflict problems.
 
 I think that we should implement the feature in small steps rather than
 submit one big patch at a time. So I'd like to focus on #2 and #3 at first,
 and #4 later (maybe third or fourth CF).

We both know if you do #2 and #3 then doing #4 also is trivial.

If you leave it out then we'll end up missing something that is required
and have to rework everything.

-- 
 Simon Riggs   www.2ndQuadrant.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Synchronization levels in SR

2010-05-26 Thread Simon Riggs
On Wed, 2010-05-26 at 12:36 +0900, Fujii Masao wrote:
 On Wed, May 26, 2010 at 2:10 AM, Simon Riggs si...@2ndquadrant.com wrote:
  My suggestion is simply to have a single parameter (name unimportant)
 
  number_of_synch_servers_we_wait_for = N
 
  which is much easier to understand because it is phrased in terms of the
  guarantee given to the transaction, not in terms of what the admin
  thinks is the situation.
 
 How can we choose #2, #3 or #4 by using your proposed option?

 If async, the standby never sends any ACK. If recv, fsync,
 or redo, the standby sends the ACK when it has received, fsynced
 or replayed the WAL from the master, respectively.

Everything I've said about per-standby settings applies here, which
was based upon having just 2 settings: sync and async. If you have four
settings instead, things get even more complex. If we were going to
reduce complexity, it would be to reduce the number of options here to
just offering option #2 in the first phase.

AFAICS people would only ever select #2 or #4 anyway. IMHO #3 isn't
likely to be selected on its own because it performs badly for no real
benefit. Having two standbys, I might want to specify #2 to both, or if
one is down then #3 to the remaining standby instead.

Nobody else has yet tried to explain how we would specify what happens
when one of the standbys is down, with per-standby settings. Failure
modes are where the complexity is here. However we proceed, we must have
a discussion about how we specify the failure modes. This is not
something we should add on at the last minute, we should think about
that now and address it openly.

Oracle Data Guard is a great resource for what semantics we might need
to cover, but its also a lesson in complexity from its per-standby
settings. Please look at net_timeout and alternate options in
particular. See how difficult it is to specify failure modes, even
though Data Guard offers probably dozens of parameters and options - its
orientation is per-standby not towards the transaction and the user.

 On the other hand, we add new GUC max_synchronous_standbys
 (I prefer it to number_of_synch_servers_we_wait_for, but does
 anyone have better name?) as PGC_USERSET into postgresql.conf.
 It specifies the maximum number of standbys which transaction
 commit must wait for the ACK from.
 
 If max_synchronous_standbys is 0, no transaction commit waits for
 ACK even if some connected standbys set their replication_mode to
 recv, fsync or redo. If it's positive, transaction comit waits
 for N ACKs. N is the smaller number between max_synchronous_standbys
 and the actual number of connected synchronous standbys.

To summarise, I think we can get away with just 3 parameters:
synchronous_replication = N # similar in name to synchronous_commit
synch_rep_timeout = T
synch_rep_timeout_action = commit | abort

Conceptually, this is I want at least N replica copies made of my
database changes, I will wait for up to T milliseconds to get that
otherwise I will do X. Very easy and clear for an application to
understand what guarantees it is requesting. Also very easy for the
administrator to understand the guarantees requested and how to
provision for them: to deliver robustness they typically need N+1
servers, or for even higher levels of robustness and performance N+2
etc..

Making synchronous_replication into a USERSET would be an industry
first: transaction controlled robustness at every level.

-- 
 Simon Riggs   www.2ndQuadrant.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Synchronization levels in SR

2010-05-26 Thread Fujii Masao
On Wed, May 26, 2010 at 5:02 PM, Simon Riggs si...@2ndquadrant.com wrote:
 Everything I've said about per-standby settings applies here, which
 was based upon having just 2 settings: sync and async. If you have four
 settings instead, things get even more complex. If we were going to
 reduce complexity, it would be to reduce the number of options here to
 just offering option #2 in the first phase.

 AFAICS people would only ever select #2 or #4 anyway. IMHO #3 isn't
 likely to be selected on its own because it performs badly for no real
 benefit. Having two standbys, I might want to specify #2 to both, or if
 one is down then #3 to the remaining standby instead.

I guess that dropping the support of #3 doesn't reduce complexity since
the code of #3 is almost the same as that of #2. Like walreceiver sends
the ACK after receiving the WAL in #2 case, it has only to do the same
thing after the WAL flush.

 Nobody else has yet tried to explain how we would specify what happens
 when one of the standbys is down, with per-standby settings. Failure
 modes are where the complexity is here. However we proceed, we must have
 a discussion about how we specify the failure modes. This is not
 something we should add on at the last minute, we should think about
 that now and address it openly.

 Imagine having 2 standbys, 1 synch, 1 async. If the synch server goes
 down, performance will improve and robustness will have been lost. What
 good would that be?

You are concerned about the above case you described on another post?
In that case, if you want to ensure robustness, you can specify #2, #3
or #4 in both standbys. If one of standbys is in remote site, we can
additionally set max_synchronous_standbys to 1. If you don't want to
failover to the standby in remote site when the master goes down, you
can specify #1 in remote standby, so the standby in the near location
is always guaranteed to be synch with the master.

 Oracle Data Guard is a great resource for what semantics we might need
 to cover, but its also a lesson in complexity from its per-standby
 settings. Please look at net_timeout and alternate options in
 particular. See how difficult it is to specify failure modes, even
 though Data Guard offers probably dozens of parameters and options - its
 orientation is per-standby not towards the transaction and the user.

Yeah, I'll research Oracle Data Guard.

 To summarise, I think we can get away with just 3 parameters:
 synchronous_replication = N     # similar in name to synchronous_commit
 synch_rep_timeout = T
 synch_rep_timeout_action = commit | abort

I agree to add the latter two parameters, which are also listed on
my outline of SynchRep.
http://wiki.postgresql.org/wiki/Streaming_Replication#Synchronization_capability

 Conceptually, this is I want at least N replica copies made of my
 database changes, I will wait for up to T milliseconds to get that
 otherwise I will do X. Very easy and clear for an application to
 understand what guarantees it is requesting. Also very easy for the
 administrator to understand the guarantees requested and how to
 provision for them: to deliver robustness they typically need N+1
 servers, or for even higher levels of robustness and performance N+2
 etc..

I don't feel that synchronous_replication approach is intuitive for
the administrator. Even on this thread, some people seem to prefer
per-standby setting.

Without per-standby setting, when there are two standbys, one is in
the near rack and another is in remote site, synchronous_replication=1
cannot guarantee that the near standby is always synch with the master.
So when the master goes down, unfortunately we might have to failover to
the remote standby. OTOH, synchronous_replication=2 degrades the
performance on the master very much. synchronous_replication approach
doesn't seem to cover the typical use case.

Also, when synchronous_replication=1 and one of synchronous standbys
goes down, how should the surviving standby catch up with the master?
Such standby might be too far behind the master. The transaction commit
should wait for the ACK from the lagging standby immediately even if
there might be large gap? If yes, synch_rep_timeout would screw up
the replication easily.

Regards,

-- 
Fujii Masao
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Synchronization levels in SR

2010-05-26 Thread Robert Haas
On Wed, May 26, 2010 at 2:31 AM, Simon Riggs si...@2ndquadrant.com wrote:
 On Tue, 2010-05-25 at 23:59 -0400, Robert Haas wrote:
 Quorum commit is definitely an extra knob, IMHO.

 No, its about three less, as I have explained.

 Explain your position, don't just demand others listen.

OK.  In words of one syllable, your way still has all the same knobs,
plus some more.

You sketched out a design which still had a per-standby setting for
each standby, but IN ADDITION had a setting for a setting to control
quorum commit[1].  You also argued that we needed four options for
each transaction rather than three[2], and that we need a userset GUC
to control the behavior on a per-transaction basis[3].  Not one other
person has agreed that we need all of these options in the first
version of the patch.  We don't.  We can start with a sync rep patch
that does ONE thing and does it well, and we can add these other
things later. I don't think I'm going too far out on a limb when I say
that it is easier to get a smaller patch committed than it is to get a
bigger one committed, and it is less likely to have bugs.

[1] http://archives.postgresql.org/pgsql-hackers/2010-05/msg01347.php
[2] http://archives.postgresql.org/pgsql-hackers/2010-05/msg01333.php
[3] http://archives.postgresql.org/pgsql-hackers/2010-05/msg01334.php

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise Postgres Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Synchronization levels in SR

2010-05-26 Thread Alastair Turner
A suggestion, based on what I believe would be ideal default settings
for a fully developed SR capability. The thought being that as long as
the default behaviour was stable additional knobs could be added
across version boundaries without causing trouble.

Per slave the master needs to know:
 - The identity of the slave, even if only to limit who can replicate
(this will have to be specified)
 - Whether to expect an acknowledgement from the slave (as will this)
 - How long to wait for the acknowledgement (this may be a default)
 - What the slave is expected to do before acknowledging (I think this
should default to remote flush to disk - #3 in the mail which started
this thread - since it prevents data loss without exposing the master
to the possibility of locking delays)

Additionally the process on the master requires:
 - How many acknowledgments to require before declaring success
(defaulted to the number of servers expected to acknowledge since it
will cause the fewest surprises when failing over to a replica)
 - What to do if the number of acknowledgments is not received
(defaulting to abort/rollback since this is really what differentiates
synchronous from asynchronous replication - the certainty that once
data has been committed it can be recovered)

So in order to set up synchronous replication all a DBA would have to
specify is the slave server, that it is expected to send
acknowledgments and possibly a timeout.

If this is in fact a desirable state for the default behaviour or
minimum settings requirement then I would say it is also a desirable
target for the first patch.

Alastair Bell Turner
^F5

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Synchronization levels in SR

2010-05-26 Thread Simon Riggs
On Wed, 2010-05-26 at 18:52 +0900, Fujii Masao wrote:

 I guess that dropping the support of #3 doesn't reduce complexity
 since the code of #3 is almost the same as that of #2. Like
 walreceiver sends the ACK after receiving the WAL in #2 case, it has
 only to do the same thing after the WAL flush.

Hmm, well the code for #3 is similar also to the code for #4. So if you
do #2, its easy to do #2, #3 and #4 together.

The comment is about whether having #3 makes sense from a user interface
perspective. It's easy to add options, but they must have useful
meaning.

-- 
 Simon Riggs   www.2ndQuadrant.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Synchronization levels in SR

2010-05-26 Thread Simon Riggs
On Wed, 2010-05-26 at 18:52 +0900, Fujii Masao wrote:

  To summarise, I think we can get away with just 3 parameters:
  synchronous_replication = N # similar in name to synchronous_commit
  synch_rep_timeout = T
  synch_rep_timeout_action = commit | abort
 
 I agree to add the latter two parameters, which are also listed on
 my outline of SynchRep.
 http://wiki.postgresql.org/wiki/Streaming_Replication#Synchronization_capability
 
  Conceptually, this is I want at least N replica copies made of my
  database changes, I will wait for up to T milliseconds to get that
  otherwise I will do X. Very easy and clear for an application to
  understand what guarantees it is requesting. Also very easy for the
  administrator to understand the guarantees requested and how to
  provision for them: to deliver robustness they typically need N+1
  servers, or for even higher levels of robustness and performance N+2
  etc..
 
 I don't feel that synchronous_replication approach is intuitive for
 the administrator. Even on this thread, some people seem to prefer
 per-standby setting.

Maybe they do, but that is because nobody has yet explained how you
would handle failure modes with per-standby settings. When you do they
will likely change their minds. Put the whole story on the table before
trying to force a decision.

 Without per-standby setting, when there are two standbys, one is in
 the near rack and another is in remote site, synchronous_replication=1
 cannot guarantee that the near standby is always synch with the master.
 So when the master goes down, unfortunately we might have to failover to
 the remote standby. 

If the remote server responded first, then that proves it is a better
candidate for failover than the one you think of as near. If the two
standbys vary over time then you have network problems that will
directly affect the performance on the master; synch_rep = N would
respond better to any such problems.

 OTOH, synchronous_replication=2 degrades the
 performance on the master very much. 

Yes, but only because you have only one near standby. It would clearly
to be foolish to make this setting without 2+ near standbys. We would
then have 4 or more servers; how do we specify everything for that
config??

 synchronous_replication approach
 doesn't seem to cover the typical use case.

You described the failure modes for the quorum proposal, but avoided
describing the failure modes for the per-standby proposal.

Please explain what will happen when the near server is unavailable,
with per-standby settings. Please also explain what will happen if we
choose to have 4 or 5 servers to maintain performance in case of the
near server going down. How will we specify the failure modes?

 Also, when synchronous_replication=1 and one of synchronous standbys
 goes down, how should the surviving standby catch up with the master?
 Such standby might be too far behind the master. The transaction commit
 should wait for the ACK from the lagging standby immediately even if
 there might be large gap? If yes, synch_rep_timeout would screw up
 the replication easily.

That depends upon whether we send the ACK at point #2, #3 or #4. It
would only cause a problem if you waited until #4.

I've explained why I have made the proposals I've done so far: reduced
complexity in failure modes and better user control. To understand that
better, you or somebody needs to explain how we would handle the failure
modes with per-standby settings so we can compare.

-- 
 Simon Riggs   www.2ndQuadrant.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Synchronization levels in SR

2010-05-26 Thread Simon Riggs
On Wed, 2010-05-26 at 07:10 -0400, Robert Haas wrote:

 OK.  In words of one syllable, your way still has all the same knobs,
 plus some more.

I explained how the per-standby settings would take many parameters,
whereas per-transaction settings take far fewer.

 You sketched out a design which still had a per-standby setting for
 each standby, but IN ADDITION had a setting for a setting to control
 quorum commit[1].

No, you misread it. Again. The parameters were not IN ADDITION -
obviously so, otherwise I wouldn't claim there were fewer, would I?

Your reply has again avoided the subject of how we would handle failure
modes with per-standby settings. That is important.

-- 
 Simon Riggs   www.2ndQuadrant.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Synchronization levels in SR

2010-05-26 Thread Robert Haas
On Wed, May 26, 2010 at 9:54 AM, Simon Riggs si...@2ndquadrant.com wrote:
 On Wed, 2010-05-26 at 07:10 -0400, Robert Haas wrote:

 OK.  In words of one syllable, your way still has all the same knobs,
 plus some more.

 I explained how the per-standby settings would take many parameters,
 whereas per-transaction settings take far fewer.

 You sketched out a design which still had a per-standby setting for
 each standby, but IN ADDITION had a setting for a setting to control
 quorum commit[1].

 No, you misread it. Again. The parameters were not IN ADDITION -
 obviously so, otherwise I wouldn't claim there were fewer, would I?

Well, that does seem logical, but I can't figure out how to reconcile
that with what you wrote before, because as far as I can see you're
just saying over and over again that your way will need fewer
parameters without explaining which parameters your way won't need.

And frankly, I don't think it's possible for quorum commit to reduce
the number of parameters.  Even if we have that feature available, not
everyone will want to use it.  And the people who don't will
presumably need whatever parameters they would have needed if quorum
commit hadn't been available in the first place.

 Your reply has again avoided the subject of how we would handle failure
 modes with per-standby settings. That is important.

I don't think anyone is avoiding that, we just haven't discussed it.
The thing is, I don't think quorum commit actually does anything to
address that problem.  If I have a master and a standby configured for
sync rep and the standby goes down, we have to decide what impact that
has on the master.  If I have a master and two standbys configured for
sync rep with quorum commit such that I only need an ack from one of
them, and they both go down, we still have to decide what impact that
has on the master.  I agree we need to talk about, but I don't agree
that putting in quorum commit will remove the need to design that
case.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise Postgres Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Synchronization levels in SR

2010-05-26 Thread Heikki Linnakangas

On 26/05/10 18:31, Robert Haas wrote:

And frankly, I don't think it's possible for quorum commit to reduce
the number of parameters.  Even if we have that feature available, not
everyone will want to use it.  And the people who don't will
presumably need whatever parameters they would have needed if quorum
commit hadn't been available in the first place.


Agreed, quorum commit is not a panacea.

For example, suppose that you have two servers, master and a standby, 
and you want transactions to be synchronously committed to both, so that 
in the event of a meteor striking the master, you don't lose any 
transactions that have been replied to the client as committed.


Now you want to set up a temporary replica of the master at a 
development server, for testing purposes. If you set quorum to 2, your 
development server becomes critical infrastructure, which is not what 
you want. If you set quorum to 1, it also becomes critical 
infrastructure, because it's possible that a transaction has been 
replicated to the test server but not the real production standby, and a 
meteor strikes.


Per-standby settings would let you express that, but not OTOH the quorum 
behavior where you require N out of M to acknowledge the commit before 
returning to client.


There's really no limit to how complex a setup can be. For example, 
imagine that you have two data centers, with two servers in each. You 
want to replicate the master to all four servers, but for commit to 
return to the client, it's enough that the transaction has been 
replicated to one server in each data center. How do you express that in 
the config file? And it would be nice to have per-transaction control 
too, like with synchronous_commit...


So this is a tradeoff between
* flexibility, how complex a setup you can express?
* code complexity, how complicated is it to implement?
* user-friendliness, how easy is it to configure?

One way out of this is to implement something very simple in PostgreSQL, 
and build external WAL proxying tools in pgfoundry that allow you to 
cascade and disseminate the WAL in as complex scenarios as you want.



Your reply has again avoided the subject of how we would handle failure
modes with per-standby settings. That is important.


I don't think anyone is avoiding that, we just haven't discussed it.
The thing is, I don't think quorum commit actually does anything to
address that problem.  If I have a master and a standby configured for
sync rep and the standby goes down, we have to decide what impact that
has on the master.  If I have a master and two standbys configured for
sync rep with quorum commit such that I only need an ack from one of
them, and they both go down, we still have to decide what impact that
has on the master.  I agree we need to talk about, but I don't agree
that putting in quorum commit will remove the need to design that
case.


Right, failure modes need to be discussed, but how quorum commit or 
whatnot is configured is irrelevant to that.


No-one has come up with a scheme on how to abort a transaction if you 
don't get a reply from a synchronous standby (or all standbys or a 
quorum of standbys). Until someone does, a commit on the master will 
have to always succeed. The synchronous aspect will provide a 
guarantee that if a standby is connected, any transaction in the master 
will become visible (or fsync'd or just streamed to, depending on the 
level) on the standby too before it's acknowledged as committed to the 
client, nothing more, nothing less.


One way to do that would be to refrain from flushing the commit record 
to disk on the master until the standby has acknowledged it. The 
downside is that the master is in a very severe state at that point: 
until you flush the WAL, you can buffer only a small amount WAL traffic 
until you run out of wal_buffers, stalling all write activity in the 
master, with backends waiting. You can't even shut down the server 
cleanly. But if you value your transaction integrity much higher than 
availability, maybe that's what you want.


PS. I whole-heartedly agree with Simon's concern upthread that if we 
allow a standby to specify in its config file that it wants to be a 
synchronous standby, that's a bit dangerous because connecting such a 
standby to the master will suddenly make all commits on the master a lot 
slower. Adding a synchronous standby should require some action in the 
master, since it affects the behavior on master.


--
  Heikki Linnakangas
  EnterpriseDB   http://www.enterprisedb.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Synchronization levels in SR

2010-05-26 Thread Kevin Grittner
Heikki Linnakangas heikki.linnakan...@enterprisedb.com wrote:
 
 One way to do that would be to refrain from flushing the commit
 record to disk on the master until the standby has acknowledged
 it.
 
I'm not clear on the benefit of doing that, versus flushing the
commit record and then waiting for responses.  Either way some
databases will commit before others -- what is the benefit of having
the master lag?
 
 Adding a synchronous standby should require some action in the 
 master, since it affects the behavior on master.
 
+1
 
-Kevin

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Synchronization levels in SR

2010-05-26 Thread Heikki Linnakangas

On 26/05/10 20:10, Kevin Grittner wrote:

Heikki Linnakangasheikki.linnakan...@enterprisedb.com  wrote:


One way to do that would be to refrain from flushing the commit
record to disk on the master until the standby has acknowledged
it.


I'm not clear on the benefit of doing that, versus flushing the
commit record and then waiting for responses.  Either way some
databases will commit before others -- what is the benefit of having
the master lag?


Hmm, I was going to answer that that way no other transactions can see 
the transaction as committed before it has been safely replicated, but I 
now realize that you could also flush, but refrain from releasing the 
entry from procarray until the standby acknowledges the commit, so the 
transaction would look like in-progress to other transactions in the 
master until that.


Although, if the master crashes at that point, and quickly recovers, you 
could see the last transactions committed on the master before they're 
replicated to the standby.


--
  Heikki Linnakangas
  EnterpriseDB   http://www.enterprisedb.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Synchronization levels in SR

2010-05-26 Thread Simon Riggs
On Wed, 2010-05-26 at 11:31 -0400, Robert Haas wrote:
  Your reply has again avoided the subject of how we would handle failure
  modes with per-standby settings. That is important.
 
 I don't think anyone is avoiding that, we just haven't discussed it.

You haven't discussed it, but even before you do, you know its better.
Not very compelling perspective... 

 The thing is, I don't think quorum commit actually does anything to
 address that problem.  If I have a master and a standby configured for
 sync rep and the standby goes down, we have to decide what impact that
 has on the master.  If I have a master and two standbys configured for
 sync rep with quorum commit such that I only need an ack from one of
 them, and they both go down, we still have to decide what impact that
 has on the master.  

That's already been discussed, and AFAIK Masao and I already agreed on
how that would be handled in the quorum commit case.

What we haven't had explained is how you would handle all the sub cases
or failure modes for the per-standby situation.

The most common case for synch rep IMHO is this:

* 2 near standbys, 1 remote. Want to be able to ACK to first near
standby that responds, or if both are down, ACK to the remote.

I've proposed a way of specifying that with 3 simple parameters, e.g.
synch_rep_acks = 1
synch_rep_timeout = 30
synch_rep_timeout_action = commit

In Oracle this would be all of the following

* all nodes given unique names
DB_UNIQUE_NAME=master
DB_UNIQUE_NAME=near1
DB_UNIQUE_NAME=near2
DB_UNIQUE_NAME=remote

* parameter settings
LOG_ARCHIVE_CONFIG='DG_CONFIG=(master,near1, near2, remote)'

LOG_ARCHIVE_DEST_2='SERVICE=near1 SYNC AFFIRM NET_TIMEOUT=30
DB_UNIQUE_NAME=near1'
LOG_ARCHIVE_DEST_STATE_2='ENABLE'

LOG_ARCHIVE_DEST_3='SERVICE=near2 SYNC AFFIRM NET_TIMEOUT=30
DB_UNIQUE_NAME=near2'
LOG_ARCHIVE_DEST_STATE_3='ENABLE'

LOG_ARCHIVE_DEST_4='SERVICE=remote ASYNC NOAFFIRM DB_UNIQUE_NAME=remote'
LOG_ARCHIVE_DEST_STATE_4='ENABLE'

* modes
ALTER DATABASE SET STANDBY DATABASE TO MAXIMIZE AVAILABILITY;


The Oracle way doesn't allow you to specify that if near1 and near2 are
down then we should continue to SYNC via remote, nor does it allow you
to specify things from user perspective or at transaction level.

You don't need to do it that way, for sure. But we do need to say what
way you would pick, rather than just arguing against me before you've
even discussed it here or off-list.

 I agree we need to talk about, but I don't agree
 that putting in quorum commit will remove the need to design that
 case.

Yes, you need to design for that case. It's not a magic wand.

All I've said is that covering the common cases is easier and more
flexible by choosing transaction-centric style of parameters, and it
also allows user settable behaviour.

I want to do better than Oracle, if possible, using lessons learned. I
don't want to do the same thing because we're copying them or because
we're going down the same conceptual dead end they went down. We should
try to avoid doing something obvious and aim a little higher.

-- 
 Simon Riggs   www.2ndQuadrant.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Synchronization levels in SR

2010-05-26 Thread Simon Riggs
On Wed, 2010-05-26 at 12:10 -0500, Kevin Grittner wrote:
  Adding a synchronous standby should require some action in the 
  master, since it affects the behavior on master.
  
 +1

+1
 
-- 
 Simon Riggs   www.2ndQuadrant.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Synchronization levels in SR

2010-05-26 Thread Kevin Grittner
Heikki Linnakangas heikki.linnakan...@enterprisedb.com wrote:
 
 Although, if the master crashes at that point, and quickly
 recovers, you could see the last transactions committed on the
 master before they're replicated to the standby.
 
Versus having the transaction committed on one or more slaves but
not on the master?  Unless we have a transaction manager and do
proper distributed transactions, how do you avoid edge conditions
like that?
 
-Kevin

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Synchronization levels in SR

2010-05-26 Thread Simon Riggs
On Wed, 2010-05-26 at 19:55 +0300, Heikki Linnakangas wrote:

 Now you want to set up a temporary replica of the master at a 
 development server, for testing purposes. If you set quorum to 2, your
 development server becomes critical infrastructure, which is not what 
 you want.

That's a good argument for standby relays. Nobody hooks in a disposable
test machine into a critical production config without expecting it to
have some effect.

 If you set quorum to 1, it also becomes critical 
 infrastructure, because it's possible that a transaction has been 
 replicated to the test server but not the real production standby, and
 a meteor strikes.

Why would you not want to use the test server? If its the only thing
left protecting you, and you wish to be protected, then it sounds very
cool to me. In my proposal this test server only gets data ahead of
other things if the real production standby responds too slowly.

It scares the  out of people that a DBA can take down a server and
suddenly the sync protection you thought you had is turned off. That way
of doing things means an application never knows the protection level
any piece of data has had. App designers want to be able to marks things
handle with care or just do it quick, don't care much. It's a real
pain to have to handle all your data the same, and for that to be
selectable only by administrators, who may or may not have everything
configured correctly/available.

-- 
 Simon Riggs   www.2ndQuadrant.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Synchronization levels in SR

2010-05-26 Thread Robert Haas
On Wed, May 26, 2010 at 1:24 PM, Heikki Linnakangas
heikki.linnakan...@enterprisedb.com wrote:
 On 26/05/10 20:10, Kevin Grittner wrote:

 Heikki Linnakangasheikki.linnakan...@enterprisedb.com  wrote:

 One way to do that would be to refrain from flushing the commit
 record to disk on the master until the standby has acknowledged
 it.

 I'm not clear on the benefit of doing that, versus flushing the
 commit record and then waiting for responses.  Either way some
 databases will commit before others -- what is the benefit of having
 the master lag?

 Hmm, I was going to answer that that way no other transactions can see the
 transaction as committed before it has been safely replicated, but I now
 realize that you could also flush, but refrain from releasing the entry from
 procarray until the standby acknowledges the commit, so the transaction
 would look like in-progress to other transactions in the master until that.

 Although, if the master crashes at that point, and quickly recovers, you
 could see the last transactions committed on the master before they're
 replicated to the standby.

No matter what you do, there's going to be corner cases where one node
thinks the transaction committed and the other node doesn't know.  At
any given time, we're either in a state where a crash and restart on
the master will replay the commit record, or we're not.  And also, but
somewhat independently, we're in a state where a crash on the standby
will replay the commit record, or we're not.  Each of these is
dependent on a disk write, and there's no way to guarantee that both
of those disk writes succeed or both of them fail.

Now, in theory, maybe you could have a system where we don't have a
fixed definition of who the master is.  If either server crashes or if
they lose communication, both crash.  If they both come back up, they
agree on who has the higher LSN on disk and both roll forward to that
point, then designate one server to be the master.  If one comes back
up and can't reach the other, it appeals to the clusterware for help.
The clusterware is then responsible for shooting one node in the head
and telling the other node to carry on as the sole survivor.  When,
eventually, the dead node is resurrected, it *discards* any WAL
written after the point from which the new master restarted.

Short of that, I don't think abort the transaction is a recovery
mechanism for when we can't get hold of a standby.  We're going to
have to commit locally first and then we can decide how long to wait
for an ACK that a standby has also committed the same transaction
remotely.  We can wait not at all, forever, or for a while and then
declare the other guy dead.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise Postgres Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


  1   2   >