Re: [HACKERS] Sync Rep at Oct 5
On Wed, Oct 6, 2010 at 4:06 PM, Simon Riggs si...@2ndquadrant.com wrote: The problem is how much WAL is stored on (any) node. Currently that is wal_keep_segments, which doesn't work very well, but I've seen no better ideas that cover all important cases. What about allowing the master to read and send WAL from the archive? Regards, -- Fujii Masao NIPPON TELEGRAPH AND TELEPHONE CORPORATION NTT Open Source Software Center -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Sync Rep at Oct 5
On 10-10-07 05:52 AM, Fujii Masao wrote: On Wed, Oct 6, 2010 at 4:06 PM, Simon Riggssi...@2ndquadrant.com wrote: The problem is how much WAL is stored on (any) node. Currently that is wal_keep_segments, which doesn't work very well, but I've seen no better ideas that cover all important cases. What about allowing the master to read and send WAL from the archive? Regards, Then you have to deal with telling the archive how long it needs to keep WAL segments because the master might ask for them back. If the archive is remote from the master then you have some extra network copying going on. It would be better to let the slave being reconfigured to read the missing WAL from the archive. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Sync Rep at Oct 5
On Thu, Oct 7, 2010 at 8:46 PM, Steve Singer ssin...@ca.afilias.info wrote: Then you have to deal with telling the archive how long it needs to keep WAL segments because the master might ask for them back. Yeah, it's not easy to determine how long we should keep the archived WAL files. We need to calculate which WAL file is deletable according to the progress of each standby and the state of the stored base backups which still might be used for PITR. If the archive is remote from the master then you have some extra network copying going on. Yep, so I think that the master (i.e., walsender) should read the archived WAL file by using restore_command specified by users. If the archive is remote from the master, then we would need to specify something like scp in restore_command. Also, even if you compress the archived WAL file by using pg_compress, the master can decompress it by using pg_decompress in restore_command and transfer it. It would be better to let the slave being reconfigured to read the missing WAL from the archive. That's one of choices. But I've heard that some people don't want to set up the shared archive area which can be accessed by the master and the standby. For example, they feel that it's complex to configure NFS server or automatic-scp-without-password setting for sharing the archived WAL files. Currently we have to increase wal_keep_segments to work around that problem. But the pg_xlog disk space is usually small and not suitable to keep many WAL files. So we might be unable to increase wal_keep_segments. If we allow the master to stream WAL files from the archive, we don't need to increase wal_keep_segments and set up such a complex configuration. So this idea is one of useful choices, I think. Regards, -- Fujii Masao NIPPON TELEGRAPH AND TELEPHONE CORPORATION NTT Open Source Software Center -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Sync Rep at Oct 5
On Thu, Oct 7, 2010 at 9:08 AM, Fujii Masao masao.fu...@gmail.com wrote: On Thu, Oct 7, 2010 at 8:46 PM, Steve Singer ssin...@ca.afilias.info wrote: Then you have to deal with telling the archive how long it needs to keep WAL segments because the master might ask for them back. Yeah, it's not easy to determine how long we should keep the archived WAL files. We need to calculate which WAL file is deletable according to the progress of each standby and the state of the stored base backups which still might be used for PITR. If the archive is remote from the master then you have some extra network copying going on. Yep, so I think that the master (i.e., walsender) should read the archived WAL file by using restore_command specified by users. If the archive is remote from the master, then we would need to specify something like scp in restore_command. Also, even if you compress the archived WAL file by using pg_compress, the master can decompress it by using pg_decompress in restore_command and transfer it. It would be better to let the slave being reconfigured to read the missing WAL from the archive. That's one of choices. But I've heard that some people don't want to set up the shared archive area which can be accessed by the master and the standby. For example, they feel that it's complex to configure NFS server or automatic-scp-without-password setting for sharing the archived WAL files. Currently we have to increase wal_keep_segments to work around that problem. But the pg_xlog disk space is usually small and not suitable to keep many WAL files. So we might be unable to increase wal_keep_segments. If we allow the master to stream WAL files from the archive, we don't need to increase wal_keep_segments and set up such a complex configuration. So this idea is one of useful choices, I think. I'm not sure anyone other than yourself has endorsed this idea, but in any case it seems off the critical path for getting this feature committed. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise Postgres Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Sync Rep at Oct 5
On Tue, 2010-10-05 at 11:30 -0400, Steve Singer wrote: Also on the topic of failover how do we want to deal with the master failing over. Say M-{S1,S2} and M fails and we promote S1 to M1. Can M1-S2? What if S2 was further along in processing than S1 when M failed? I don't think we want to take on this complexity for 9.1 but this means that after M fails you won't have a synchronous replica until you rebuild or somehow reset S2. Those are problems that can be resolved, but that is the current state. The trick, I guess, is to promote the correct standby. Those are generic issues, not related to any specific patch. Thanks for keeping those issues in the limelight. == Path Minimization == We want to be able to minimize and control the path of data transfer, * so that the current master doesn't have initiate transfer to all dependent nodes, thereby reducing overhead on master * so that if the path from current master to descendent is expensive we would minimize network costs. This requirement is commonly known as relaying. In its most simply stated form, we want one standby to be able to get WAL data from another standby. e.g. M - S - S. Stating the problem in that way misses out on the actual requirement, since people would like the arrangement to be robust in case of failures of M or any S. If we specify the exact arrangement of paths then we need to respecify the arrangement of paths if a server goes down. Are we going to allow these paths to be reconfigured on a live cluster? If we have M-S1-S2 and we want to reconfigure S2 to read from M then S2 needs to get the data that has already been committed on S1 from somewhere (either S1 or M). This has solutions but it adds to the complexity. Maybe not for 9.1 If you switch from M - S1 - S2 to M - (S1, S2) it should work fine. At the moment that needs a shutdown/restart, but that could easily be done with a disconnect/reconnect following a file reload. The problem is how much WAL is stored on (any) node. Currently that is wal_keep_segments, which doesn't work very well, but I've seen no better ideas that cover all important cases. -- Simon Riggs www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Training and Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Sync Rep at Oct 5
On 10-10-05 04:32 AM, Simon Riggs wrote: This is an attempt to compile everybody's stated viewpoints and come to an understanding about where we are and where we want to go. The idea from here is that we discuss what we are trying to achieve (requirements) and then later come back to how (design). Great start on summarizing the discussions. Getting a summary of the requirements in one place will help people who haven't been diligent in following all the sync-rep email threads stay involved. snip == Failover Configuration Minimisation == An important aspect of robustness is the ability to specify a configuration that will remain in place even though 1 or more servers have gone down. It is desirable to specify sync rep requirements such that we do not refer to individual servers, if possible. Each such rule necessarily requires an else condition, possibly multiple else conditions. It is desirable to avoid both of these * the need to have different configuration files on each node * the need to have configurations that only become active in case of failure. These are known to be hard to test and very likely to be misconfigured in the event of failover [I know a bank that was down for a whole week when standby server's config was wrong and had never been fully tested. The error was simple and obvious, but the fault showed itself as a sporadic error that was difficult to trace] Also on the topic of failover how do we want to deal with the master failing over. Say M-{S1,S2} and M fails and we promote S1 to M1. Can M1-S2? What if S2 was further along in processing than S1 when M failed? I don't think we want to take on this complexity for 9.1 but this means that after M fails you won't have a synchronous replica until you rebuild or somehow reset S2. == Sync Rep Performance == Sync Rep is a potential performance hit, and that hit is known to increase as geographical distance increases. We want to be able to specify the performance of some nodes so that we have 4 levels of robustness: async - doesn't wait for sync recv - syncs when messages received by standby fsync - syncs when messages written to disk by standby apply - sync when messages applied to standby Will read-only queries running on a slave hold up transactions from being applied on that slave? I suspect that for most people running with 'apply' they would want the answer to be 'no'. Are we going to revisit the standby query cancellation discussion? == Path Minimization == We want to be able to minimize and control the path of data transfer, * so that the current master doesn't have initiate transfer to all dependent nodes, thereby reducing overhead on master * so that if the path from current master to descendent is expensive we would minimize network costs. This requirement is commonly known as relaying. In its most simply stated form, we want one standby to be able to get WAL data from another standby. e.g. M - S - S. Stating the problem in that way misses out on the actual requirement, since people would like the arrangement to be robust in case of failures of M or any S. If we specify the exact arrangement of paths then we need to respecify the arrangement of paths if a server goes down. Are we going to allow these paths to be reconfigured on a live cluster? If we have M-S1-S2 and we want to reconfigure S2 to read from M then S2 needs to get the data that has already been committed on S1 from somewhere (either S1 or M). This has solutions but it adds to the complexity. Maybe not for 9.1 -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Sync Rep at Oct 5
On Tue, 2010-10-05 at 11:30 -0400, Steve Singer wrote: Will read-only queries running on a slave hold up transactions from being applied on that slave? I suspect that for most people running with 'apply' they would want the answer to be 'no'. Are we going to revisit the standby query cancellation discussion? Once we have a feedback channel from standby to master its a simple matter to add some feedback to avoid many query cancellations. That was the original plan for 9.0 but we changed from sync rep to streaming rep so late in the cycle that there was no time to do it that way. -- Simon Riggs www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Training and Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers