Re: [HACKERS] Sync Rep at Oct 5

2010-10-07 Thread Fujii Masao
On Wed, Oct 6, 2010 at 4:06 PM, Simon Riggs si...@2ndquadrant.com wrote:
 The problem is how much WAL is stored on (any) node. Currently that is
 wal_keep_segments, which doesn't work very well, but I've seen no better
 ideas that cover all important cases.

What about allowing the master to read and send WAL from the archive?

Regards,

-- 
Fujii Masao
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Sync Rep at Oct 5

2010-10-07 Thread Steve Singer

On 10-10-07 05:52 AM, Fujii Masao wrote:

On Wed, Oct 6, 2010 at 4:06 PM, Simon Riggssi...@2ndquadrant.com  wrote:

The problem is how much WAL is stored on (any) node. Currently that is
wal_keep_segments, which doesn't work very well, but I've seen no better
ideas that cover all important cases.


What about allowing the master to read and send WAL from the archive?

Regards,


Then you have to deal with telling the archive how long it needs to keep 
WAL segments because the master might ask for them back.  If the archive 
is remote from the master then you have some extra network copying going 
on.  It would be better to let the slave being reconfigured to read the 
missing WAL from the archive.





--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Sync Rep at Oct 5

2010-10-07 Thread Fujii Masao
On Thu, Oct 7, 2010 at 8:46 PM, Steve Singer ssin...@ca.afilias.info wrote:
 Then you have to deal with telling the archive how long it needs to keep WAL
 segments because the master might ask for them back.

Yeah, it's not easy to determine how long we should keep the archived WAL files.
We need to calculate which WAL file is deletable according to the progress of
each standby and the state of the stored base backups which still might be used
for PITR.

 If the archive is
 remote from the master then you have some extra network copying going on.

Yep, so I think that the master (i.e., walsender) should read the archived
WAL file by using restore_command specified by users. If the archive is
remote from the master, then we would need to specify something like scp in
restore_command. Also, even if you compress the archived WAL file by using
pg_compress, the master can decompress it by using pg_decompress in
restore_command and transfer it.

 It would be better to let the slave being reconfigured to read the missing
 WAL from the archive.

That's one of choices.

But I've heard that some people don't want to set up the shared archive area
which can be accessed by the master and the standby. For example, they feel
that it's complex to configure NFS server or automatic-scp-without-password
setting for sharing the archived WAL files.

Currently we have to increase wal_keep_segments to work around that problem.
But the pg_xlog disk space is usually small and not suitable to keep many
WAL files. So we might be unable to increase wal_keep_segments.

If we allow the master to stream WAL files from the archive, we don't need
to increase wal_keep_segments and set up such a complex configuration. So
this idea is one of useful choices, I think.

Regards,

-- 
Fujii Masao
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Sync Rep at Oct 5

2010-10-07 Thread Robert Haas
On Thu, Oct 7, 2010 at 9:08 AM, Fujii Masao masao.fu...@gmail.com wrote:
 On Thu, Oct 7, 2010 at 8:46 PM, Steve Singer ssin...@ca.afilias.info wrote:
 Then you have to deal with telling the archive how long it needs to keep WAL
 segments because the master might ask for them back.

 Yeah, it's not easy to determine how long we should keep the archived WAL 
 files.
 We need to calculate which WAL file is deletable according to the progress of
 each standby and the state of the stored base backups which still might be 
 used
 for PITR.

 If the archive is
 remote from the master then you have some extra network copying going on.

 Yep, so I think that the master (i.e., walsender) should read the archived
 WAL file by using restore_command specified by users. If the archive is
 remote from the master, then we would need to specify something like scp in
 restore_command. Also, even if you compress the archived WAL file by using
 pg_compress, the master can decompress it by using pg_decompress in
 restore_command and transfer it.

 It would be better to let the slave being reconfigured to read the missing
 WAL from the archive.

 That's one of choices.

 But I've heard that some people don't want to set up the shared archive area
 which can be accessed by the master and the standby. For example, they feel
 that it's complex to configure NFS server or automatic-scp-without-password
 setting for sharing the archived WAL files.

 Currently we have to increase wal_keep_segments to work around that problem.
 But the pg_xlog disk space is usually small and not suitable to keep many
 WAL files. So we might be unable to increase wal_keep_segments.

 If we allow the master to stream WAL files from the archive, we don't need
 to increase wal_keep_segments and set up such a complex configuration. So
 this idea is one of useful choices, I think.

I'm not sure anyone other than yourself has endorsed this idea, but in
any case it seems off the critical path for getting this feature
committed.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise Postgres Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Sync Rep at Oct 5

2010-10-06 Thread Simon Riggs
On Tue, 2010-10-05 at 11:30 -0400, Steve Singer wrote:

 Also on the topic of failover how do we want to deal with the master 
 failing over.   Say M-{S1,S2} and M fails and we promote S1 to M1.  Can 
 M1-S2? What if S2 was further along in processing than S1 when M 
 failed?  I don't think we want to take on this complexity for 9.1 but 
 this means that after M fails you won't have a synchronous replica until 
 you rebuild or somehow reset S2.

Those are problems that can be resolved, but that is the current state.
The trick, I guess, is to promote the correct standby.

Those are generic issues, not related to any specific patch. Thanks for
keeping those issues in the limelight.

  == Path Minimization ==
 
  We want to be able to minimize and control the path of data transfer,
  * so that the current master doesn't have initiate transfer to all
  dependent nodes, thereby reducing overhead on master
  * so that if the path from current master to descendent is expensive we
  would minimize network costs.
 
  This requirement is commonly known as relaying.
 
  In its most simply stated form, we want one standby to be able to get
  WAL data from another standby. e.g. M -  S -  S. Stating the problem in
  that way misses out on the actual requirement, since people would like
  the arrangement to be robust in case of failures of M or any S. If we
  specify the exact arrangement of paths then we need to respecify the
  arrangement of paths if a server goes down.
 
 Are we going to allow these paths to be reconfigured on a live cluster? 
 If we have M-S1-S2 and we want to reconfigure S2 to read from M then 
 S2 needs to get the data that has already been committed on S1 from 
 somewhere (either S1 or M).  This has solutions but it adds to the 
 complexity.  Maybe not for 9.1

If you switch from M - S1 - S2 to M - (S1, S2) it should work fine.
At the moment that needs a shutdown/restart, but that could easily be
done with a disconnect/reconnect following a file reload.

The problem is how much WAL is stored on (any) node. Currently that is
wal_keep_segments, which doesn't work very well, but I've seen no better
ideas that cover all important cases.

-- 
 Simon Riggs   www.2ndQuadrant.com
 PostgreSQL Development, 24x7 Support, Training and Services



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Sync Rep at Oct 5

2010-10-05 Thread Steve Singer

On 10-10-05 04:32 AM, Simon Riggs wrote:


This is an attempt to compile everybody's stated viewpoints and come to
an understanding about where we are and where we want to go. The idea
from here is that we discuss what we are trying to achieve
(requirements) and then later come back to how (design).


Great start on summarizing the discussions.  Getting a summary of the 
requirements in one place will help people who haven't been diligent in 
following all the sync-rep email threads stay involved.


snip


== Failover Configuration Minimisation ==

An important aspect of robustness is the ability to specify a
configuration that will remain in place even though 1 or more servers
have gone down.

It is desirable to specify sync rep requirements such that we do not
refer to individual servers, if possible. Each such rule necessarily
requires an else condition, possibly multiple else conditions.

It is desirable to avoid both of these
* the need to have different configuration files on each node
* the need to have configurations that only become active in case of
failure. These are known to be hard to test and very likely to be
misconfigured in the event of failover [I know a bank that was down for
a whole week when standby server's config was wrong and had never been
fully tested. The error was simple and obvious, but the fault showed
itself as a sporadic error that was difficult to trace]



Also on the topic of failover how do we want to deal with the master 
failing over.   Say M-{S1,S2} and M fails and we promote S1 to M1.  Can 
M1-S2? What if S2 was further along in processing than S1 when M 
failed?  I don't think we want to take on this complexity for 9.1 but 
this means that after M fails you won't have a synchronous replica until 
you rebuild or somehow reset S2.






== Sync Rep Performance ==

Sync Rep is a potential performance hit, and that hit is known to
increase as geographical distance increases.

We want to be able to specify the performance of some nodes so that we
have 4 levels of robustness:
async - doesn't wait for sync
recv - syncs when messages received by standby
fsync - syncs when messages written to disk by standby
apply - sync when messages applied to standby


Will read-only queries running on a slave hold up transactions from 
being applied on that slave?   I suspect that for most people running 
with 'apply' they would want the answer to be 'no'.  Are we going to 
revisit the standby query cancellation discussion?





== Path Minimization ==

We want to be able to minimize and control the path of data transfer,
* so that the current master doesn't have initiate transfer to all
dependent nodes, thereby reducing overhead on master
* so that if the path from current master to descendent is expensive we
would minimize network costs.

This requirement is commonly known as relaying.

In its most simply stated form, we want one standby to be able to get
WAL data from another standby. e.g. M -  S -  S. Stating the problem in
that way misses out on the actual requirement, since people would like
the arrangement to be robust in case of failures of M or any S. If we
specify the exact arrangement of paths then we need to respecify the
arrangement of paths if a server goes down.


Are we going to allow these paths to be reconfigured on a live cluster? 
If we have M-S1-S2 and we want to reconfigure S2 to read from M then 
S2 needs to get the data that has already been committed on S1 from 
somewhere (either S1 or M).  This has solutions but it adds to the 
complexity.  Maybe not for 9.1






--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Sync Rep at Oct 5

2010-10-05 Thread Simon Riggs
On Tue, 2010-10-05 at 11:30 -0400, Steve Singer wrote:

 Will read-only queries running on a slave hold up transactions from 
 being applied on that slave?   I suspect that for most people running 
 with 'apply' they would want the answer to be 'no'.  Are we going to 
 revisit the standby query cancellation discussion?

Once we have a feedback channel from standby to master its a simple
matter to add some feedback to avoid many query cancellations. 

That was the original plan for 9.0 but we changed from sync rep to
streaming rep so late in the cycle that there was no time to do it that
way.

-- 
 Simon Riggs   www.2ndQuadrant.com
 PostgreSQL Development, 24x7 Support, Training and Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers