Re: [HACKERS] Streaming replication and a disk full in primary
On Tue, Apr 20, 2010 at 9:55 AM, Robert Haas robertmh...@gmail.com wrote: How about wal_keep_segments? +1 Here's the patch. Seems OK. Regards, -- Fujii Masao NIPPON TELEGRAPH AND TELEPHONE CORPORATION NTT Open Source Software Center -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Streaming replication and a disk full in primary
On Tue, Apr 20, 2010 at 5:53 AM, Fujii Masao masao.fu...@gmail.com wrote: On Tue, Apr 20, 2010 at 9:55 AM, Robert Haas robertmh...@gmail.com wrote: How about wal_keep_segments? +1 Here's the patch. Seems OK. Thanks, committed. ...Robert -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Streaming replication and a disk full in primary
On Fri, Apr 16, 2010 at 9:47 PM, Robert Haas robertmh...@gmail.com wrote: On Thu, Apr 15, 2010 at 6:13 PM, Robert Haas robertmh...@gmail.com wrote: On Thu, Apr 15, 2010 at 2:54 AM, Heikki Linnakangas heikki.linnakan...@enterprisedb.com wrote: Robert Haas wrote: I've realized another problem with this patch. standby_keep_segments only controls the number of segments that we keep around for purposes of streaming: it doesn't affect archiving at all. And of course, a standby server based on archiving is every bit as much of a standby server as one that uses streaming replication. So at a minimum, the name of this GUC is very confusing. Hmm, I guess streaming_keep_segments would be more accurate. Somehow doesn't feel as good otherwise, though. Any other suggestions? I sort of feel like the correct description is something like num_extra_retained_wal_segments, but that's sort of long. The actual behavior is not tied to streaming, although the use case is. thinks more How about wal_keep_segments? Here's the patch. ...Robert wal_keep_segments.patch Description: Binary data -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Streaming replication and a disk full in primary
On Thu, Apr 15, 2010 at 6:13 PM, Robert Haas robertmh...@gmail.com wrote: On Thu, Apr 15, 2010 at 2:54 AM, Heikki Linnakangas heikki.linnakan...@enterprisedb.com wrote: Robert Haas wrote: I've realized another problem with this patch. standby_keep_segments only controls the number of segments that we keep around for purposes of streaming: it doesn't affect archiving at all. And of course, a standby server based on archiving is every bit as much of a standby server as one that uses streaming replication. So at a minimum, the name of this GUC is very confusing. Hmm, I guess streaming_keep_segments would be more accurate. Somehow doesn't feel as good otherwise, though. Any other suggestions? I sort of feel like the correct description is something like num_extra_retained_wal_segments, but that's sort of long. The actual behavior is not tied to streaming, although the use case is. thinks more How about wal_keep_segments? ...Robert -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Streaming replication and a disk full in primary
Robert Haas wrote: I've realized another problem with this patch. standby_keep_segments only controls the number of segments that we keep around for purposes of streaming: it doesn't affect archiving at all. And of course, a standby server based on archiving is every bit as much of a standby server as one that uses streaming replication. So at a minimum, the name of this GUC is very confusing. Hmm, I guess streaming_keep_segments would be more accurate. Somehow doesn't feel as good otherwise, though. Any other suggestions? We should also probably think a little bit about why we feel like it's OK to throw away data that is needed for SR to work, but we don't feel like we ever want to throw away WAL segments that we can't manage to archive. Failure to archive is considered more serious, because your continuous archiving backup becomes invalid if we delete a segment before it's archived. And a streaming standby server can catch up using the archive if it falls behind too much. Plus the primary doesn't know how many standby servers there is, so it doesn't know which segments are still needed for SR. In the department of minor nits, I also don't like the fact that the GUC is called standby_keep_segments and the variable is called StandbySegments. If we really have to capitalize them differently, we should at least make it StandbyKeepSegments, but personally I think we should use standby_keep_segments in both places so that it doesn't take quite so many greps to find all the references. Well, it's consistent with checkpoint_segments/CheckPointSegments. There is no consistent style on naming the global variables behind GUCs. If you feel like changing it though, I won't object. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Streaming replication and a disk full in primary
Robert Haas escribió: In the department of minor nits, I also don't like the fact that the GUC is called standby_keep_segments and the variable is called StandbySegments. If we really have to capitalize them differently, we should at least make it StandbyKeepSegments, but personally I think we should use standby_keep_segments in both places so that it doesn't take quite so many greps to find all the references. +1, using both names capitalized identically makes the code easier to navigate. -- Alvaro Herrerahttp://www.CommandPrompt.com/ The PostgreSQL Company - Command Prompt, Inc. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Streaming replication and a disk full in primary
On Thu, Apr 15, 2010 at 2:54 AM, Heikki Linnakangas heikki.linnakan...@enterprisedb.com wrote: Robert Haas wrote: I've realized another problem with this patch. standby_keep_segments only controls the number of segments that we keep around for purposes of streaming: it doesn't affect archiving at all. And of course, a standby server based on archiving is every bit as much of a standby server as one that uses streaming replication. So at a minimum, the name of this GUC is very confusing. Hmm, I guess streaming_keep_segments would be more accurate. Somehow doesn't feel as good otherwise, though. Any other suggestions? I sort of feel like the correct description is something like num_extra_retained_wal_segments, but that's sort of long. The actual behavior is not tied to streaming, although the use case is. ...Robert -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Streaming replication and a disk full in primary
On Tue, Apr 13, 2010 at 11:56 AM, Heikki Linnakangas heikki.linnakan...@enterprisedb.com wrote: Robert Haas wrote: On Mon, Apr 12, 2010 at 6:41 AM, Heikki Linnakangas heikki.linnakan...@enterprisedb.com wrote: Why is standby_keep_segments used even if max_wal_senders is zero? In that case, ISTM we don't need to keep any WAL files in pg_xlog for the standby. True. I don't think we should second guess the admin on that, though. Perhaps he only set max_wal_senders=0 temporarily, and will be disappointed if the the logs are no longer there when he sets it back to non-zero and restarts the server. If archive_mode is off and max_wal_senders = 0, then the WAL that's being generated won't be usable for streaming anyway, right? I think this is another manifestation of the problem I was complaining about over the weekend: there's no longer a single GUC that controls what type of information we emit as WAL. In previous releases, archive_mode served that function, but now it's much more complicated and, IMHO, not very comprehensible. http://archives.postgresql.org/pgsql-hackers/2010-04/msg00509.php Agreed. We've been trying to deduce from other settings what information needs to be WAL-logged, but it hasn't been a great success so it would be better to make it explicit than try to hide it. I've realized another problem with this patch. standby_keep_segments only controls the number of segments that we keep around for purposes of streaming: it doesn't affect archiving at all. And of course, a standby server based on archiving is every bit as much of a standby server as one that uses streaming replication. So at a minimum, the name of this GUC is very confusing. We should also probably think a little bit about why we feel like it's OK to throw away data that is needed for SR to work, but we don't feel like we ever want to throw away WAL segments that we can't manage to archive. In the department of minor nits, I also don't like the fact that the GUC is called standby_keep_segments and the variable is called StandbySegments. If we really have to capitalize them differently, we should at least make it StandbyKeepSegments, but personally I think we should use standby_keep_segments in both places so that it doesn't take quite so many greps to find all the references. ...Robert -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Streaming replication and a disk full in primary
Robert Haas wrote: On Mon, Apr 12, 2010 at 6:41 AM, Heikki Linnakangas heikki.linnakan...@enterprisedb.com wrote: Why is standby_keep_segments used even if max_wal_senders is zero? In that case, ISTM we don't need to keep any WAL files in pg_xlog for the standby. True. I don't think we should second guess the admin on that, though. Perhaps he only set max_wal_senders=0 temporarily, and will be disappointed if the the logs are no longer there when he sets it back to non-zero and restarts the server. If archive_mode is off and max_wal_senders = 0, then the WAL that's being generated won't be usable for streaming anyway, right? I think this is another manifestation of the problem I was complaining about over the weekend: there's no longer a single GUC that controls what type of information we emit as WAL. In previous releases, archive_mode served that function, but now it's much more complicated and, IMHO, not very comprehensible. http://archives.postgresql.org/pgsql-hackers/2010-04/msg00509.php Agreed. We've been trying to deduce from other settings what information needs to be WAL-logged, but it hasn't been a great success so it would be better to make it explicit than try to hide it. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Streaming replication and a disk full in primary
Fujii Masao wrote: doc/src/sgml/config.sgml -archival or to recover from a checkpoint. If standby_keep_segments +archival or to recover from a checkpoint. If varnamestandby_keep_segments/ The word standby_keep_segments always needs the varname tag, I think. Thanks, fixed. We should remove the document 25.2.5.2. Monitoring? I updated it to no longer claim that the primary can run out of disk space because of a hung WAL sender. The information about calculating the lag between primary and standby still seems valuable, so I didn't remove the whole section. Why is standby_keep_segments used even if max_wal_senders is zero? In that case, ISTM we don't need to keep any WAL files in pg_xlog for the standby. True. I don't think we should second guess the admin on that, though. Perhaps he only set max_wal_senders=0 temporarily, and will be disappointed if the the logs are no longer there when he sets it back to non-zero and restarts the server. When XLogRead() reads two WAL files and only the older of them is recycled during being read, it might fail in checking whether the read data is valid. This is because the variable recptr can advance to the newer WAL file before the check. Thanks, fixed. When walreceiver has gotten stuck for some reason, walsender would be unable to pass through the send() system call, and also get stuck. In the patch, such a walsender cannot exit forever because it cannot call XLogRead(). So I think that the bgwriter needs to send the exit-signal to such a too lagged walsender. Thought? Any backend can get stuck like that. The shmem of latest recycled WAL file is updated before checking whether it's already been archived. If archiving is not working for some reason, the WAL file which that shmem indicates might not actually have been recycled yet. In this case, the standby cannot obtain the WAL file from the primary because it's been marked as latest recycled, and from the archive because it's not been archived yet. This seems to be a big problem. How about moving the update of the shmem to after calling XLogArchiveCheckDone() in RemoveOldXlogFiles()? Good point. It's particularly important considering that if a segment hasn't been archived yet, it's not available to the standby from the archive either. I changed that. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Streaming replication and a disk full in primary
On Mon, Apr 12, 2010 at 7:41 PM, Heikki Linnakangas heikki.linnakan...@enterprisedb.com wrote: We should remove the document 25.2.5.2. Monitoring? I updated it to no longer claim that the primary can run out of disk space because of a hung WAL sender. The information about calculating the lag between primary and standby still seems valuable, so I didn't remove the whole section. Yes. ! An important health indicator of streaming replication is the amount ! of WAL records generated in the primary, but not yet applied in the ! standby. Since pg_last_xlog_receive_location doesn't let us know the WAL location not yet applied, we should use pg_last_xlog_replay_location instead. How How about?: An important health indicator of streaming replication is the amount of WAL records generated in the primary, but not yet applied in the standby. You can calculate this lag by comparing the current WAL write - location on the primary with the last WAL location received by the + location on the primary with the last WAL location replayed by the standby. They can be retrieved using functionpg_current_xlog_location/ on the primary and the - functionpg_last_xlog_receive_location/ on the standby, + functionpg_last_xlog_replay_location/ on the standby, respectively (see xref linkend=functions-admin-backup-table and xref linkend=functions-recovery-info-table for details). - The last WAL receive location in the standby is also displayed in the - process status of the WAL receiver process, displayed using the - commandps/ command (see xref linkend=monitoring-ps for details). /para /sect3 Why is standby_keep_segments used even if max_wal_senders is zero? In that case, ISTM we don't need to keep any WAL files in pg_xlog for the standby. True. I don't think we should second guess the admin on that, though. Perhaps he only set max_wal_senders=0 temporarily, and will be disappointed if the the logs are no longer there when he sets it back to non-zero and restarts the server. OK. Since the behavior is not intuitive for me, I'd like to add the note into the end of the description about standby_keep_segments. How about?: This setting has effect if max_wal_senders is zero. When walreceiver has gotten stuck for some reason, walsender would be unable to pass through the send() system call, and also get stuck. In the patch, such a walsender cannot exit forever because it cannot call XLogRead(). So I think that the bgwriter needs to send the exit-signal to such a too lagged walsender. Thought? Any backend can get stuck like that. OK. + }, + + { + {standby_keep_segments, PGC_SIGHUP, WAL_CHECKPOINTS, + gettext_noop(Sets the number of WAL files held for standby servers), + NULL + }, + StandbySegments, + 0, 0, INT_MAX, NULL, NULL We should s/WAL_CHECKPOINTS/WAL_REPLICATION ? Regards, -- Fujii Masao NIPPON TELEGRAPH AND TELEPHONE CORPORATION NTT Open Source Software Center -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Streaming replication and a disk full in primary
On Mon, Apr 12, 2010 at 6:41 AM, Heikki Linnakangas heikki.linnakan...@enterprisedb.com wrote: Why is standby_keep_segments used even if max_wal_senders is zero? In that case, ISTM we don't need to keep any WAL files in pg_xlog for the standby. True. I don't think we should second guess the admin on that, though. Perhaps he only set max_wal_senders=0 temporarily, and will be disappointed if the the logs are no longer there when he sets it back to non-zero and restarts the server. If archive_mode is off and max_wal_senders = 0, then the WAL that's being generated won't be usable for streaming anyway, right? I think this is another manifestation of the problem I was complaining about over the weekend: there's no longer a single GUC that controls what type of information we emit as WAL. In previous releases, archive_mode served that function, but now it's much more complicated and, IMHO, not very comprehensible. http://archives.postgresql.org/pgsql-hackers/2010-04/msg00509.php ...Robert -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Streaming replication and a disk full in primary
Thanks for the great patch! I apologize for leaving the issue half-finished for long time :( On Wed, Apr 7, 2010 at 7:02 PM, Heikki Linnakangas heikki.linnakan...@enterprisedb.com wrote: In your version of this patch, the default was still the current behavior where the primary retains WAL files that are still needed by connected stadby servers indefinitely. I think that's a dangerous default, so I changed it so that if you don't set standby_keep_segments, the primary doesn't retain any extra segments; the number of WAL segments available for standby servers is determined only by the location of the previous checkpoint, and the status of WAL archiving. That makes the code a bit simpler too, as we never care how far the walsenders are. In fact, the GetOldestWALSenderPointer() function is now dead code. It's OK for me to change the default behavior. We can remove the GetOldestWALSenderPointer() function. doc/src/sgml/config.sgml -archival or to recover from a checkpoint. If standby_keep_segments +archival or to recover from a checkpoint. If varnamestandby_keep_segments/ The word standby_keep_segments always needs the varname tag, I think. We should remove the document 25.2.5.2. Monitoring? Why is standby_keep_segments used even if max_wal_senders is zero? In that case, ISTM we don't need to keep any WAL files in pg_xlog for the standby. When XLogRead() reads two WAL files and only the older of them is recycled during being read, it might fail in checking whether the read data is valid. This is because the variable recptr can advance to the newer WAL file before the check. When walreceiver has gotten stuck for some reason, walsender would be unable to pass through the send() system call, and also get stuck. In the patch, such a walsender cannot exit forever because it cannot call XLogRead(). So I think that the bgwriter needs to send the exit-signal to such a too lagged walsender. Thought? The shmem of latest recycled WAL file is updated before checking whether it's already been archived. If archiving is not working for some reason, the WAL file which that shmem indicates might not actually have been recycled yet. In this case, the standby cannot obtain the WAL file from the primary because it's been marked as latest recycled, and from the archive because it's not been archived yet. This seems to be a big problem. How about moving the update of the shmem to after calling XLogArchiveCheckDone() in RemoveOldXlogFiles()? Regards, -- Fujii Masao NIPPON TELEGRAPH AND TELEPHONE CORPORATION NTT Open Source Software Center -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Streaming replication and a disk full in primary
This task has been languishing for a long time, so I took a shot at it. I took the approach I suggested before, keeping a variable in shared memory to track the latest removed WAL segment. After walsender has read a bunch of WAL records from a WAL file, it checks that what it read is after the latest removed WAL segment, otherwise the data it read might have came from a file that was already recycled and overwritten with new data, and an error is thrown. This changes the behavior so that if a standby server doing streaming replication falls behind too much, the primary will remove/recycle a WAL segment needed by the standby server. The previous behavior was that WAL segments still needed by any connected standby server were never removed, at the risk of filling the disk in the primary if a standby server behaves badly. In your version of this patch, the default was still the current behavior where the primary retains WAL files that are still needed by connected stadby servers indefinitely. I think that's a dangerous default, so I changed it so that if you don't set standby_keep_segments, the primary doesn't retain any extra segments; the number of WAL segments available for standby servers is determined only by the location of the previous checkpoint, and the status of WAL archiving. That makes the code a bit simpler too, as we never care how far the walsenders are. In fact, the GetOldestWALSenderPointer() function is now dead code. Fujii Masao wrote: Thanks for the review! And, sorry for the delay. On Thu, Jan 21, 2010 at 11:10 PM, Heikki Linnakangas heikki.linnakan...@enterprisedb.com wrote: I don't think we should do the check XLogWrite(). There's really no reason to kill the standby connections before the next checkpoint, when the old WAL files are recycled. XLogWrite() is in the critical path of normal operations, too. OK. I'll remove that check from XLogWrite(). There's another important reason for that: If archiving is not working for some reason, the standby can't obtain the old segments from the archive either. If we refuse to stream such old segments, and they're not getting archived, the standby has no way to catch up until archiving is fixed. Allowing streaming of such old segments is free wrt. disk space, because we're keeping the files around anyway. OK. We should terminate the walsender whose currently-opened WAL file has been already archived, isn't required for crash recovery AND is 'max-lag' older than the currently-written one. I'll change so. Walreceiver will get an error if it tries to open a segment that's been deleted or recycled already. The dangerous situation we need to avoid is when walreceiver holds a file open while bgwriter recycles it. Walreceiver will merrily continue streaming data from it, even though it's be overwritten by new data already. s/walreceiver/walsender ? Yes, that's the problem that I'll have to fix. A straightforward fix is to keep an newest recycled XLogRecPtr in shared memory that RemoveOldXlogFiles() updates. Walreceiver checks it right after read()ing from a file, before sending it to the client, and throws an error if the data it read() was already recycled. I prefer this. But I don't think such an aggressive check of a newest recycled XLogRecPtr is required if the bgwriter always doesn't delete the WAL file which is newer than or equal to the walsenders' oldest WAL file. In other words, the WAL files which the walsender is reading (or will read) are not removed at the moment. Regards, -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com *** a/doc/src/sgml/config.sgml --- b/doc/src/sgml/config.sgml *** *** 1823,1828 archive_command = 'copy %p C:\\server\\archivedir\\%f' # Windows --- 1823,1856 /para /listitem /varlistentry + + varlistentry id=guc-replication-lag-segments xreflabel=replication_lag_segments +termvarnamestandby_keep_segments/varname (typeinteger/type)/term +indexterm + primaryvarnamestandby_keep_segments/ configuration parameter/primary +/indexterm +listitem +para + Specifies the number of log file segments kept in filenamepg_xlog/ + directory, in case a standby server needs to fetch them via streaming + replciation. Each segment is normally 16 megabytes. If a standby + server connected to the primary falls behind more than + varnamestandby_keep_segments/ segments, the primary might remove + a WAL segment still needed by the standby and the replication + connection will be terminated. + + This sets only the minimum number of segments retained for standby + purposes, the system might need to retain more segments for WAL + archival or to recover from a checkpoint. If standby_keep_segments + is zero (the default), the system doesn't keep any extra segments + for
Re: [HACKERS] Streaming replication and a disk full in primary
On Wed, Apr 7, 2010 at 6:02 AM, Heikki Linnakangas heikki.linnakan...@enterprisedb.com wrote: This task has been languishing for a long time, so I took a shot at it. I took the approach I suggested before, keeping a variable in shared memory to track the latest removed WAL segment. After walsender has read a bunch of WAL records from a WAL file, it checks that what it read is after the latest removed WAL segment, otherwise the data it read might have came from a file that was already recycled and overwritten with new data, and an error is thrown. This changes the behavior so that if a standby server doing streaming replication falls behind too much, the primary will remove/recycle a WAL segment needed by the standby server. The previous behavior was that WAL segments still needed by any connected standby server were never removed, at the risk of filling the disk in the primary if a standby server behaves badly. In your version of this patch, the default was still the current behavior where the primary retains WAL files that are still needed by connected stadby servers indefinitely. I think that's a dangerous default, so I changed it so that if you don't set standby_keep_segments, the primary doesn't retain any extra segments; the number of WAL segments available for standby servers is determined only by the location of the previous checkpoint, and the status of WAL archiving. That makes the code a bit simpler too, as we never care how far the walsenders are. In fact, the GetOldestWALSenderPointer() function is now dead code. This seems like a very useful feature, but I can't speak to the code quality without a good deal more study. ...Robert -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Streaming replication and a disk full in primary
Thanks for the review! And, sorry for the delay. On Thu, Jan 21, 2010 at 11:10 PM, Heikki Linnakangas heikki.linnakan...@enterprisedb.com wrote: I don't think we should do the check XLogWrite(). There's really no reason to kill the standby connections before the next checkpoint, when the old WAL files are recycled. XLogWrite() is in the critical path of normal operations, too. OK. I'll remove that check from XLogWrite(). There's another important reason for that: If archiving is not working for some reason, the standby can't obtain the old segments from the archive either. If we refuse to stream such old segments, and they're not getting archived, the standby has no way to catch up until archiving is fixed. Allowing streaming of such old segments is free wrt. disk space, because we're keeping the files around anyway. OK. We should terminate the walsender whose currently-opened WAL file has been already archived, isn't required for crash recovery AND is 'max-lag' older than the currently-written one. I'll change so. Walreceiver will get an error if it tries to open a segment that's been deleted or recycled already. The dangerous situation we need to avoid is when walreceiver holds a file open while bgwriter recycles it. Walreceiver will merrily continue streaming data from it, even though it's be overwritten by new data already. s/walreceiver/walsender ? Yes, that's the problem that I'll have to fix. A straightforward fix is to keep an newest recycled XLogRecPtr in shared memory that RemoveOldXlogFiles() updates. Walreceiver checks it right after read()ing from a file, before sending it to the client, and throws an error if the data it read() was already recycled. I prefer this. But I don't think such an aggressive check of a newest recycled XLogRecPtr is required if the bgwriter always doesn't delete the WAL file which is newer than or equal to the walsenders' oldest WAL file. In other words, the WAL files which the walsender is reading (or will read) are not removed at the moment. Regards, -- Fujii Masao NIPPON TELEGRAPH AND TELEPHONE CORPORATION NTT Open Source Software Center -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
[HACKERS] Streaming replication and a disk full in primary
Hi, If the primary has a connected standby, the WAL files required for the standby cannot be deleted. So if it has fallen too far behind for some reasons, a disk full failure might occur on the primary. This is one of the problems that should be fixed for v9.0. We can cope with that case by carefully monitoring the standby lag. In addition to this, I think that we should put an upper limit on the number of WAL files held in pg_xlog for the standby (i.e., the maximum delay of the standby) as a safeguard against a disk full error. The attached patch introduces new GUC 'replication_lag_segments' which specifies the maximum number of WAL files held in pg_xlog to send to the standby. The replication to the standby which falls more than the upper limit behind is automatically terminated, which would avoid a disk full erro on the primary. This GUC is also useful to hold some WAL files for the incoming standby. This would avoid the problem that a WAL file required for the standby doesn't exist in the primary at the start of replication, to some extent. The code is also available in the 'replication' branch in my git repository. git://git.postgresql.org/git/users/fujii/postgres.git branch: replication Comment? Objection? Review? Regards, -- Fujii Masao NIPPON TELEGRAPH AND TELEPHONE CORPORATION NTT Open Source Software Center *** a/doc/src/sgml/config.sgml --- b/doc/src/sgml/config.sgml *** *** 1789,1794 archive_command = 'copy %p C:\\server\\archivedir\\%f' # Windows --- 1789,1812 /para /listitem /varlistentry + varlistentry id=guc-replication-lag-segments xreflabel=replication_lag_segments +termvarnamereplication_lag_segments/varname (typeinteger/type)/term +indexterm + primaryvarnamereplication_lag_segments/ configuration parameter/primary +/indexterm +listitem +para + Specifies the maximum number of log file segments held in filenamepg_xlog/ + directory to send to the standby server (each segment is normally 16 megabytes). + The replication to the standby server which falls more than varname + replication_lag_segments/ behind is terminated. This is useful for + avoiding a disk full error on the primary and holding the segments required for + the incoming standby server. The default value is zero, which disables that + upper limit. This parameter can only be set in the filenamepostgresql.conf/ + file or on the server command line. +/para +/listitem + /varlistentry /variablelist /sect2 sect2 id=runtime-config-standby *** a/src/backend/access/transam/xlog.c --- b/src/backend/access/transam/xlog.c *** *** 1725,1730 XLogWrite(XLogwrtRqst WriteRqst, bool flexible, bool xlog_switch) --- 1725,1736 if (XLogArchivingActive()) XLogArchiveNotifySeg(openLogId, openLogSeg); + /* + * Check for the standbys' delay and terminate replication + * if needed. + */ + CheckStandbysDelay(LogwrtResult.Write); + Write-lastSegSwitchTime = (pg_time_t) time(NULL); /* *** *** 7213,7240 CreateCheckPoint(int flags) smgrpostckpt(); /* ! * If there's connected standby servers doing XLOG streaming, don't ! * delete XLOG files that have not been streamed to all of them yet. ! * This does nothing to prevent them from being deleted when the ! * standby is disconnected (e.g because of network problems), but at ! * least it avoids an open replication connection from failing because * of that. */ if ((_logId || _logSeg) MaxWalSenders 0) { - XLogRecPtr oldest; uint32 log; uint32 seg; ! oldest = GetOldestWALSendPointer(); ! if (oldest.xlogid != 0 || oldest.xrecoff != 0) { ! XLByteToSeg(oldest, log, seg); ! if (log _logId || (log == _logId seg _logSeg)) ! { ! _logId = log; ! _logSeg = seg; ! } } } --- 7219,7260 smgrpostckpt(); /* ! * Don't delete XLOG files which could be still required for ! * connected or incoming standbys, under the given upper limit. ! * This avoids a replication connection from failing because * of that. */ if ((_logId || _logSeg) MaxWalSenders 0) { uint32 log; uint32 seg; + bool need_comp = true; ! if (RepLagSegs 0) { ! /* ! * Ensure that there is no too lagged standbys before ! * deleting XLOG files. ! */ ! CheckStandbysDelay(recptr); ! XLByteToSeg(recptr, log, seg); ! PrevLogSegs(log, seg, RepLagSegs - 1); ! } ! else ! { ! XLogRecPtr oldest; ! ! oldest = GetOldestWALSendPointer(); ! if (oldest.xlogid == 0 || oldest.xrecoff == 0) ! need_comp = false; ! else ! XLByteToSeg(oldest, log, seg); ! } ! ! if (need_comp (log _logId || (log == _logId seg _logSeg))) ! { ! _logId = log; ! _logSeg = seg; } }
Re: [HACKERS] Streaming replication and a disk full in primary
Fujii Masao wrote: If the primary has a connected standby, the WAL files required for the standby cannot be deleted. So if it has fallen too far behind for some reasons, a disk full failure might occur on the primary. This is one of the problems that should be fixed for v9.0. We can cope with that case by carefully monitoring the standby lag. In addition to this, I think that we should put an upper limit on the number of WAL files held in pg_xlog for the standby (i.e., the maximum delay of the standby) as a safeguard against a disk full error. The attached patch introduces new GUC 'replication_lag_segments' which specifies the maximum number of WAL files held in pg_xlog to send to the standby. The replication to the standby which falls more than the upper limit behind is automatically terminated, which would avoid a disk full erro on the primary. Thanks! I don't think we should do the check XLogWrite(). There's really no reason to kill the standby connections before the next checkpoint, when the old WAL files are recycled. XLogWrite() is in the critical path of normal operations, too. There's another important reason for that: If archiving is not working for some reason, the standby can't obtain the old segments from the archive either. If we refuse to stream such old segments, and they're not getting archived, the standby has no way to catch up until archiving is fixed. Allowing streaming of such old segments is free wrt. disk space, because we're keeping the files around anyway. Walreceiver will get an error if it tries to open a segment that's been deleted or recycled already. The dangerous situation we need to avoid is when walreceiver holds a file open while bgwriter recycles it. Walreceiver will merrily continue streaming data from it, even though it's be overwritten by new data already. A straightforward fix is to keep an newest recycled XLogRecPtr in shared memory that RemoveOldXlogFiles() updates. Walreceiver checks it right after read()ing from a file, before sending it to the client, and throws an error if the data it read() was already recycled. Or you could do it entirely in walreceiver, by calling fstat() on the open file instead of checking the variable in shared memory. If the filename isn't what you expect, indicating that it's been recycled, throw an error. But that needs an extra fstat() call for every read(). -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers