Re: [HACKERS] Keepalive for max_standby_delay
Heikki Linnakangas writes: > On 03/07/10 18:32, Tom Lane wrote: >> That would not do what you want at all in the case where you're >> recovering from archive --- XLogReceiptTime would never advance >> at all for the duration of the recovery. > Do you mean when using something like pg_standby, which does the waiting > itself? No, I'm thinking about recovery starting from an old base backup, or any situation where you're trying to process a significant number of archived WAL segments as quickly as possible. >> It might be useful if you knew that it was a standby-with-log-shipping >> situation, but we have no way to tell the difference. > With pg_standby etc. you use standby_mode=off. Same with traditional > archive recovery. In standby mode, it's on. Uh, no, that parameter is not what I'm talking about. What I tried to say is that if you're using log shipping for replication instead of walsender/walreceiver, you might want to treat data as live even though the database thinks it is being pulled "from archive". But we'd need a way for the database to tell the cases apart ... right now it cannot. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Keepalive for max_standby_delay
On 03/07/10 18:32, Tom Lane wrote: Heikki Linnakangas writes: It would seem logical to use the same logic for archive recovery as we do for streaming replication, and only set XLogReceiptTime when you have to wait for a WAL segment to arrive into the archive, ie. when restore_command fails. That would not do what you want at all in the case where you're recovering from archive --- XLogReceiptTime would never advance at all for the duration of the recovery. Do you mean when using something like pg_standby, which does the waiting itself? It might be useful if you knew that it was a standby-with-log-shipping situation, but we have no way to tell the difference. With pg_standby etc. you use standby_mode=off. Same with traditional archive recovery. In standby mode, it's on. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Keepalive for max_standby_delay
Heikki Linnakangas writes: > It would seem logical to use the same logic for archive recovery as we > do for streaming replication, and only set XLogReceiptTime when you have > to wait for a WAL segment to arrive into the archive, ie. when > restore_command fails. That would not do what you want at all in the case where you're recovering from archive --- XLogReceiptTime would never advance at all for the duration of the recovery. It might be useful if you knew that it was a standby-with-log-shipping situation, but we have no way to tell the difference. The restore_command might know the difference, though. Is it worth modifying its API somehow so that the command could tell us whether to advance XLogReceiptTime? How would we do that? regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Keepalive for max_standby_delay
On 02/07/10 23:36, Tom Lane wrote: Robert Haas writes: I haven't been able to wrap my head around why the delay should be LESS in the archive case than in the streaming case. Can you attempt to hit me with the clue-by-four? In the archive case, you're presumably trying to catch up, and so it makes sense to kill queries faster so you can catch up. The existing code essentially forces instant kill when reading from archive, for any reasonable value of max_standby_delay (because the archived timestamps will probably be older than that). That's overenthusiastic in my view, but you can get that behavior if you want it with this patch by setting max_standby_archive_delay to zero. If you don't want it, you can use something larger. If you don't think that max_standby_archive_delay should be less than max_standby_streaming_delay, you can set them the same, too. It would seem logical to use the same logic for archive recovery as we do for streaming replication, and only set XLogReceiptTime when you have to wait for a WAL segment to arrive into the archive, ie. when restore_command fails. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Keepalive for max_standby_delay
On Fri, Jul 2, 2010 at 4:52 PM, Tom Lane wrote: > Robert Haas writes: >> On Fri, Jul 2, 2010 at 4:36 PM, Tom Lane wrote: >>> In the archive case, you're presumably trying to catch up, and so it >>> makes sense to kill queries faster so you can catch up. > >> On the flip side, the timeout for the WAL segment is for 16MB of WAL, >> whereas the timeout for SR is normally going to be for a much smaller >> chunk (right?). So even with the same value for both, it seems like >> queries will be killed more aggressively during archive recovery. > > True, but I suspect useful settings will be significantly larger than > those times anyway, so it kind of comes out in the wash. For > walreceiver the expected time to process each new chunk is less than > wal_sender_delay (if it's not, you better get a faster slave machine). > For the archive case, it probably takes more than 200ms to process a > 16MB segment, but still only a few seconds. So if you have both the > max-delay parameters set to 30 seconds, the minimum "normal" grace > periods are 29.8 sec vs maybe 25 sec, not all that different. I guess I was thinking more in terms of conflicts-per-WAL-record than actual replay time. If we assume that ever 100th WAL record produces a conflict, SR will pause every once in a while, and then catch up (either because it canceled some queries or because they finished within the allowable grace period), whereas archive recovery will be patient for a bit and then start nuking 'em from orbit. Maybe my mental model is all wrong though. Anyway, I don't object to having two separate settings, and maybe we'll know more about how to tune them after this gets out in the field. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise Postgres Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Keepalive for max_standby_delay
Robert Haas writes: > On Fri, Jul 2, 2010 at 4:36 PM, Tom Lane wrote: >> In the archive case, you're presumably trying to catch up, and so it >> makes sense to kill queries faster so you can catch up. > On the flip side, the timeout for the WAL segment is for 16MB of WAL, > whereas the timeout for SR is normally going to be for a much smaller > chunk (right?). So even with the same value for both, it seems like > queries will be killed more aggressively during archive recovery. True, but I suspect useful settings will be significantly larger than those times anyway, so it kind of comes out in the wash. For walreceiver the expected time to process each new chunk is less than wal_sender_delay (if it's not, you better get a faster slave machine). For the archive case, it probably takes more than 200ms to process a 16MB segment, but still only a few seconds. So if you have both the max-delay parameters set to 30 seconds, the minimum "normal" grace periods are 29.8 sec vs maybe 25 sec, not all that different. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Keepalive for max_standby_delay
On Fri, Jul 2, 2010 at 4:36 PM, Tom Lane wrote: > Robert Haas writes: >> I haven't been able to wrap my head around why the delay should be >> LESS in the archive case than in the streaming case. Can you attempt >> to hit me with the clue-by-four? > > In the archive case, you're presumably trying to catch up, and so it > makes sense to kill queries faster so you can catch up. On the flip side, the timeout for the WAL segment is for 16MB of WAL, whereas the timeout for SR is normally going to be for a much smaller chunk (right?). So even with the same value for both, it seems like queries will be killed more aggressively during archive recovery. Even so, it seems useful to have both. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise Postgres Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Keepalive for max_standby_delay
Robert Haas writes: > I haven't been able to wrap my head around why the delay should be > LESS in the archive case than in the streaming case. Can you attempt > to hit me with the clue-by-four? In the archive case, you're presumably trying to catch up, and so it makes sense to kill queries faster so you can catch up. The existing code essentially forces instant kill when reading from archive, for any reasonable value of max_standby_delay (because the archived timestamps will probably be older than that). That's overenthusiastic in my view, but you can get that behavior if you want it with this patch by setting max_standby_archive_delay to zero. If you don't want it, you can use something larger. If you don't think that max_standby_archive_delay should be less than max_standby_streaming_delay, you can set them the same, too. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Keepalive for max_standby_delay
On Fri, Jul 2, 2010 at 4:11 PM, Tom Lane wrote: > [ Apologies for the very slow turnaround on this --- I got hit with > another batch of non-postgres security issues this week. ] > > Attached is a draft patch for revising the max_standby_delay behavior into > something I think is a bit saner. There is some unfinished business: > > * I haven't touched the documentation yet > > * The code in xlog.c needs to be reverted to its former behavior so that > recoveryLastXTime means what it's supposed to mean, ie just the last > commit/abort timestamp. > > However neither of these affects the testability of the patch. > > Basically the way that it works is that the standby delay is computed with > reference to XLogReceiptTime rather than any timestamp obtained from WAL. > XLogReceiptTime is set to current time whenever we obtain a WAL segment > from the archives or when we begin processing a fresh batch of WAL from > walreceiver. There's a subtlety in the streaming case: we don't advance > XLogReceiptTime if we are not caught up, that is if the startup process > is more than one flush cycle behind walreceiver. In the normal case > we'll advance XLogReceiptTime on each flush cycle, but once we start > falling behind, it doesn't move so the grace time alloted to conflicting > queries begins to decrease. > > I split max_standby_delay into two GUC variables, as previously > threatened: max_standby_archive_delay and max_standby_streaming_delay. > The former applies when processing data read from archive, the latter > when processing data received from walreceiver. I think this is really > quite important given the new behavior, because max_standby_archive_delay > ought to be set with reference to the expected time to process one WAL > segment, whereas max_standby_streaming_delay doesn't depend on that value > at all. I'm not sure what good defaults are for these values, so I left > them at 30 seconds for the moment. I'm inclined to think > max_standby_archive_delay ought to be quite a bit less though. > > It might be worth adding a minimum-grace-time limit as we previously > discussed, but this patch doesn't attempt to do that. > > Comments? I haven't been able to wrap my head around why the delay should be LESS in the archive case than in the streaming case. Can you attempt to hit me with the clue-by-four? -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise Postgres Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Keepalive for max_standby_delay
[ Apologies for the very slow turnaround on this --- I got hit with another batch of non-postgres security issues this week. ] Attached is a draft patch for revising the max_standby_delay behavior into something I think is a bit saner. There is some unfinished business: * I haven't touched the documentation yet * The code in xlog.c needs to be reverted to its former behavior so that recoveryLastXTime means what it's supposed to mean, ie just the last commit/abort timestamp. However neither of these affects the testability of the patch. Basically the way that it works is that the standby delay is computed with reference to XLogReceiptTime rather than any timestamp obtained from WAL. XLogReceiptTime is set to current time whenever we obtain a WAL segment from the archives or when we begin processing a fresh batch of WAL from walreceiver. There's a subtlety in the streaming case: we don't advance XLogReceiptTime if we are not caught up, that is if the startup process is more than one flush cycle behind walreceiver. In the normal case we'll advance XLogReceiptTime on each flush cycle, but once we start falling behind, it doesn't move so the grace time alloted to conflicting queries begins to decrease. I split max_standby_delay into two GUC variables, as previously threatened: max_standby_archive_delay and max_standby_streaming_delay. The former applies when processing data read from archive, the latter when processing data received from walreceiver. I think this is really quite important given the new behavior, because max_standby_archive_delay ought to be set with reference to the expected time to process one WAL segment, whereas max_standby_streaming_delay doesn't depend on that value at all. I'm not sure what good defaults are for these values, so I left them at 30 seconds for the moment. I'm inclined to think max_standby_archive_delay ought to be quite a bit less though. It might be worth adding a minimum-grace-time limit as we previously discussed, but this patch doesn't attempt to do that. Comments? regards, tom lane Index: src/backend/access/transam/xlog.c === RCS file: /cvsroot/pgsql/src/backend/access/transam/xlog.c,v retrieving revision 1.427 diff -c -r1.427 xlog.c *** src/backend/access/transam/xlog.c 28 Jun 2010 19:46:19 - 1.427 --- src/backend/access/transam/xlog.c 2 Jul 2010 20:01:24 - *** *** 72,78 bool XLogArchiveMode = false; char *XLogArchiveCommand = NULL; bool EnableHotStandby = false; - int MaxStandbyDelay = 30 * 1000; bool fullPageWrites = true; bool log_checkpoints = false; int sync_method = DEFAULT_SYNC_METHOD; --- 72,77 *** *** 487,493 * Keeps track of which sources we've tried to read the current WAL * record from and failed. */ ! static int failedSources = 0; /* Buffer for currently read page (XLOG_BLCKSZ bytes) */ static char *readBuf = NULL; --- 487,502 * Keeps track of which sources we've tried to read the current WAL * record from and failed. */ ! static int failedSources = 0; /* OR of XLOG_FROM_* codes */ ! ! /* ! * These variables track when we last obtained some WAL data to process, ! * and where we got it from. (XLogReceiptSource is initially the same as ! * readSource, but readSource gets reset to zero when we don't have data ! * to process right now.) ! */ ! static TimestampTz XLogReceiptTime = 0; ! static int XLogReceiptSource = 0; /* XLOG_FROM_* code */ /* Buffer for currently read page (XLOG_BLCKSZ bytes) */ static char *readBuf = NULL; *** *** 2626,2632 * Open a logfile segment for reading (during recovery). * * If source = XLOG_FROM_ARCHIVE, the segment is retrieved from archive. ! * If source = XLOG_FROM_PG_XLOG, it's read from pg_xlog. */ static int XLogFileRead(uint32 log, uint32 seg, int emode, TimeLineID tli, --- 2635,2641 * Open a logfile segment for reading (during recovery). * * If source = XLOG_FROM_ARCHIVE, the segment is retrieved from archive. ! * Otherwise, it's assumed to be already available in pg_xlog. */ static int XLogFileRead(uint32 log, uint32 seg, int emode, TimeLineID tli, *** *** 2655,2660 --- 2664,2670 break; case XLOG_FROM_PG_XLOG: + case XLOG_FROM_STREAM: XLogFilePath(path, tli, log, seg); restoredFromArchive = false; break; *** *** 2674,2680 --- 2684,2696 xlogfname); set_ps_display(activitymsg, false); + /* Track source of data in assorted state variables */ readSource = source; + XLogReceiptSource = source; + /* In FROM_STREAM case, caller tracks receipt time, not me */ + if (source != XLOG_FROM_STREAM) + XLogReceiptTime = GetCurrentTimestamp(); + return fd; } if (errno != ENOENT || !notfoundOk) /* unexpected failure? */ *** *** 5568,5574 /*
Re: [HACKERS] Keepalive for max_standby_delay
Simon Riggs wrote: > On Mon, 2010-06-28 at 10:09 -0700, Josh Berkus wrote: > > > It will get done. It is not the very first thing on my to-do list. > > > > ??? What is then? > > > > If it's not the first thing on your priority list, with 9.0 getting > > later by the day, maybe we should leave it to Robert and Simon, who *do* > > seem to have it first on *their* list? > > > > I swear, when Simon was keeping his branch to himself in August everyone > > was on his case. It sure seems like Tom is doing exactly the same thing. > > Hmmm, yes, looks that way. At that time I was actively working on the > code, not just locking it to prevent other activity. > > The only urgency on my part here was to fulfil my responsibility to the > project. Simon, you have a very legitimate concern. I phoned Tom and he is planning to start working on the max_standby_delay tomorrow. I am unclear how it is different from your version, but I hope once Tom is done we can review his work and decide how to proceed. The fact that we allowed Tom this huge amount of time to submit an alternative patch is unusual and hopefully rare. FYI, Tom and I are hoping to work through all the outstanding issues before we package up 9.0 beta3 on Thursday, July 8. -- Bruce Momjian http://momjian.us EnterpriseDB http://enterprisedb.com + None of us is going to be here forever. + -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Keepalive for max_standby_delay
On Mon, 2010-06-28 at 10:09 -0700, Josh Berkus wrote: > > It will get done. It is not the very first thing on my to-do list. > > ??? What is then? > > If it's not the first thing on your priority list, with 9.0 getting > later by the day, maybe we should leave it to Robert and Simon, who *do* > seem to have it first on *their* list? > > I swear, when Simon was keeping his branch to himself in August everyone > was on his case. It sure seems like Tom is doing exactly the same thing. Hmmm, yes, looks that way. At that time I was actively working on the code, not just locking it to prevent other activity. The only urgency on my part here was to fulfil my responsibility to the project. -- Simon Riggs www.2ndQuadrant.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Keepalive for max_standby_delay
On Mon, Jun 28, 2010 at 2:26 PM, Tom Lane wrote: > Robert Haas writes: >> ... It is even more unreasonable to commit to >> providing a timely patch (twice) and then fail to do so. We are >> trying to finalize a release here, and you've made it clear you think >> this code needs revision before then. I respect your opinion, but not >> your right to make the project release timetable dependent on your own >> schedule, and not your right to shut other people out of working on >> the issues you've raised. > > Since nobody has put forward a proposed beta3 release date, I don't feel > that I'm holding anything up. In the meantime, I have many > responsibilities and am facing Red Hat internal deadlines. See here, last paragraph: http://archives.postgresql.org/pgsql-hackers/2010-06/msg01093.php On a related note, this list is getting pretty darn short: http://wiki.postgresql.org/wiki/PostgreSQL_9.0_Open_Items ...partly because, in my desire to get another beta out, I have been devoting a lot of time to clearing it out. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise Postgres Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Keepalive for max_standby_delay
Robert Haas writes: > ... It is even more unreasonable to commit to > providing a timely patch (twice) and then fail to do so. We are > trying to finalize a release here, and you've made it clear you think > this code needs revision before then. I respect your opinion, but not > your right to make the project release timetable dependent on your own > schedule, and not your right to shut other people out of working on > the issues you've raised. Since nobody has put forward a proposed beta3 release date, I don't feel that I'm holding anything up. In the meantime, I have many responsibilities and am facing Red Hat internal deadlines. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Keepalive for max_standby_delay
On Mon, Jun 28, 2010 at 10:19 AM, Tom Lane wrote: > Simon Riggs writes: >> On Wed, 2010-06-16 at 21:56 -0400, Tom Lane wrote: >>> Sorry, I've been a bit distracted by other responsibilities (libtiff >>> security issues for Red Hat, if you must know). I'll get on it shortly. > >> I don't think the PostgreSQL project should wait any longer on this. If >> it does we risk loss of quality in final release, assuming no slippage. > > It will get done. It is not the very first thing on my to-do list. That's apparent. On June 9th, you wrote "Yes, I'll get with it ..."; on June 16th, you wrote "I'll get on it shortly." Two weeks later you're backing off from "shortly" to "eventually". It is unfair to the community to assert a vigorous opinion of how something should be handled in the code and then refuse to commit to a time frame for providing a patch. It is even more unreasonable to commit to providing a timely patch (twice) and then fail to do so. We are trying to finalize a release here, and you've made it clear you think this code needs revision before then. I respect your opinion, but not your right to make the project release timetable dependent on your own schedule, and not your right to shut other people out of working on the issues you've raised. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise Postgres Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Keepalive for max_standby_delay
It will get done. It is not the very first thing on my to-do list. ??? What is then? If it's not the first thing on your priority list, with 9.0 getting later by the day, maybe we should leave it to Robert and Simon, who *do* seem to have it first on *their* list? I swear, when Simon was keeping his branch to himself in August everyone was on his case. It sure seems like Tom is doing exactly the same thing. -- -- Josh Berkus PostgreSQL Experts Inc. http://www.pgexperts.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Keepalive for max_standby_delay
Simon Riggs writes: > On Wed, 2010-06-16 at 21:56 -0400, Tom Lane wrote: >> Sorry, I've been a bit distracted by other responsibilities (libtiff >> security issues for Red Hat, if you must know). I'll get on it shortly. > I don't think the PostgreSQL project should wait any longer on this. If > it does we risk loss of quality in final release, assuming no slippage. It will get done. It is not the very first thing on my to-do list. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Keepalive for max_standby_delay
On Mon, Jun 28, 2010 at 3:17 AM, Simon Riggs wrote: > On Wed, 2010-06-16 at 21:56 -0400, Tom Lane wrote: >> Robert Haas writes: >> > On Wed, Jun 9, 2010 at 8:01 PM, Tom Lane wrote: >> >> Yes, I'll get with it ... >> >> > Any update on this? >> >> Sorry, I've been a bit distracted by other responsibilities (libtiff >> security issues for Red Hat, if you must know). I'll get on it shortly. > > I don't think the PostgreSQL project should wait any longer on this. If > it does we risk loss of quality in final release, assuming no slippage. I agree, and had actually been intending to post a similar message in the next day or two. We've been waiting for this for nearly a month. > >From here, I will rework my patch of 31 May to > * use arrival time on standby as base for max_standby_delay Assuming that by this you mean, in the case of SR, the time of receipt of the current WAL chunk, and in the case of the archive, the time of its acquisition from the archive, +1. > * make delay apply to both streaming and file cases +1. > * min_standby_grace_period - min grace on every query, default 0 I could go either way on this. +0.5, I guess. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise Postgres Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Keepalive for max_standby_delay
On Wed, 2010-06-16 at 21:56 -0400, Tom Lane wrote: > Robert Haas writes: > > On Wed, Jun 9, 2010 at 8:01 PM, Tom Lane wrote: > >> Yes, I'll get with it ... > > > Any update on this? > > Sorry, I've been a bit distracted by other responsibilities (libtiff > security issues for Red Hat, if you must know). I'll get on it shortly. I don't think the PostgreSQL project should wait any longer on this. If it does we risk loss of quality in final release, assuming no slippage. >From here, I will rework my patch of 31 May to * use arrival time on standby as base for max_standby_delay * make delay apply to both streaming and file cases * min_standby_grace_period - min grace on every query, default 0 Decision time, so thoughts please? -- Simon Riggs www.2ndQuadrant.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Keepalive for max_standby_delay
On Mon, Jun 21, 2010 at 12:20 AM, Ron Mayer wrote: > Robert Haas wrote: >> On Wed, Jun 16, 2010 at 9:56 PM, Tom Lane wrote: >>> Sorry, I've been a bit distracted by other responsibilities (libtiff >>> security issues for Red Hat, if you must know). I'll get on it shortly. >> >> What? You have other things to do besides hack on PostgreSQL? Shocking! >> :-) > > I suspect you're kidding, but in case some on the list didn't realize, > Tom's probably as famous (if not moreso) in the image compression > community as he is in the database community: > > http://www.jpeg.org/jpeg/index.html > "Probably the largest and most important contribution however was the work > of the Independent JPEG Group (IJG), and Tom Lane in particular." > > http://www.w3.org/TR/PNG-Credits.html , http://www.w3.org/TR/PNG/ > "PNG (Portable Network Graphics) Specification > Version 1.0 > ... > Contributing Editor > Tom Lane, t...@sss.pgh.pa.us" > > http://www.fileformat.info/format/tiff/egff.htm > "... by Dr. Tom Lane of the Independent JPEG Group, a member of the > TIFF Advisory Committee" Yes, I was joking, hence the smiley. I did know he was involved in the above, although I confess I didn't know to what degree... or that he had a doctorate. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise Postgres Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Keepalive for max_standby_delay
Robert Haas wrote: > On Wed, Jun 16, 2010 at 9:56 PM, Tom Lane wrote: >> Sorry, I've been a bit distracted by other responsibilities (libtiff >> security issues for Red Hat, if you must know). I'll get on it shortly. > > What? You have other things to do besides hack on PostgreSQL? Shocking! :-) I suspect you're kidding, but in case some on the list didn't realize, Tom's probably as famous (if not moreso) in the image compression community as he is in the database community: http://www.jpeg.org/jpeg/index.html "Probably the largest and most important contribution however was the work of the Independent JPEG Group (IJG), and Tom Lane in particular." http://www.w3.org/TR/PNG-Credits.html , http://www.w3.org/TR/PNG/ "PNG (Portable Network Graphics) Specification Version 1.0 ... Contributing Editor Tom Lane, t...@sss.pgh.pa.us" http://www.fileformat.info/format/tiff/egff.htm "... by Dr. Tom Lane of the Independent JPEG Group, a member of the TIFF Advisory Committee" -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Keepalive for max_standby_delay
On Wed, Jun 16, 2010 at 9:56 PM, Tom Lane wrote: > Robert Haas writes: >> On Wed, Jun 9, 2010 at 8:01 PM, Tom Lane wrote: >>> Yes, I'll get with it ... > >> Any update on this? > > Sorry, I've been a bit distracted by other responsibilities (libtiff > security issues for Red Hat, if you must know). I'll get on it shortly. What? You have other things to do besides hack on PostgreSQL? Shocking! :-) -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise Postgres Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Keepalive for max_standby_delay
Robert Haas writes: > On Wed, Jun 9, 2010 at 8:01 PM, Tom Lane wrote: >> Yes, I'll get with it ... > Any update on this? Sorry, I've been a bit distracted by other responsibilities (libtiff security issues for Red Hat, if you must know). I'll get on it shortly. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Keepalive for max_standby_delay
On Wed, Jun 9, 2010 at 8:01 PM, Tom Lane wrote: > Simon Riggs writes: >> On Thu, 2010-06-03 at 19:02 -0400, Tom Lane wrote: >>> I decided there wasn't time to get anything useful done on it before the >>> beta2 deadline (which is, more or less, right now). I will take another >>> look over the next few days. > >> We all really need you to fix up max_standby_delay, or, let me do it. > > Yes, I'll get with it ... Tom, Any update on this? -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise Postgres Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Keepalive for max_standby_delay
On Thu, 2010-06-03 at 19:02 -0400, Tom Lane wrote: > Simon Riggs writes: > > On Thu, 2010-06-03 at 18:18 +0100, Simon Riggs wrote: > >> Are you planning to work on these things now as you said? > > > Are you? Or do you want me to? > > I decided there wasn't time to get anything useful done on it before the > beta2 deadline (which is, more or less, right now). I will take another > look over the next few days. Esteemed colleague Tom, We all really need you to fix up max_standby_delay, or, let me do it. It appears we both agree that more testing would be beneficial. The only way we are going to get that is by finishing this off and declaring a new HS-beta. Please let me know how you wish to proceed. (No answer means I do it.) -- Simon Riggs www.2ndQuadrant.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Keepalive for max_standby_delay
Simon Riggs writes: > On Thu, 2010-06-03 at 19:02 -0400, Tom Lane wrote: >> I decided there wasn't time to get anything useful done on it before the >> beta2 deadline (which is, more or less, right now). I will take another >> look over the next few days. > We all really need you to fix up max_standby_delay, or, let me do it. Yes, I'll get with it ... regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Keepalive for max_standby_delay
Simon Riggs writes: > On Thu, 2010-06-03 at 18:18 +0100, Simon Riggs wrote: >> Are you planning to work on these things now as you said? > Are you? Or do you want me to? I decided there wasn't time to get anything useful done on it before the beta2 deadline (which is, more or less, right now). I will take another look over the next few days. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Keepalive for max_standby_delay
On Thu, 2010-06-03 at 18:18 +0100, Simon Riggs wrote: > Are you planning to work on these things now as you said? Are you? Or do you want me to? -- Simon Riggs www.2ndQuadrant.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Keepalive for max_standby_delay
Greg Stark writes: > On Thu, Jun 3, 2010 at 12:11 AM, Tom Lane wrote: >> It is off-base. The receiver does not "request" data, the sender is >> what determines how much WAL is sent when. > Hm, so what happens if the slave blocks, doesn't the sender block when > the kernel buffers fill up? Well, if the slave can't keep up, that's a separate problem. It will not fail to keep up as a result of the transmission mechanism. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Keepalive for max_standby_delay
On Thu, Jun 3, 2010 at 4:18 PM, Tom Lane wrote: > Greg Stark writes: >> On Thu, Jun 3, 2010 at 12:11 AM, Tom Lane wrote: >>> It is off-base. The receiver does not "request" data, the sender is >>> what determines how much WAL is sent when. > >> Hm, so what happens if the slave blocks, doesn't the sender block when >> the kernel buffers fill up? > > Well, if the slave can't keep up, that's a separate problem. It will > not fail to keep up as a result of the transmission mechanism. No, I mean if the slave is paused due to a conflict. Does it stop reading data from the master or does it buffer it up on disk? If it stops reading it from the master then the effect is the same as if the slave stopped "requesting" data even if there's no actual request. -- greg -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Keepalive for max_standby_delay
On Thu, 2010-06-03 at 13:32 -0400, Tom Lane wrote: > Simon Riggs writes: > > On Thu, 2010-06-03 at 12:47 -0400, Tom Lane wrote: > >> But in any case the current behavior is > >> still quite broken as regards reading stale timestamps from WAL. > > > Agreed. That wasn't the objective of this patch or a priority. > > > If you're reading from an archive, you need to set max_standby_delay > > appropriately, current recommendation is -1. > > That's not a fix; the problem is that there *is no* appropriate setting > for max_standby_delay when you're reading from archive. It is unlikely > that the values of the timestamps you're reading should be considered to > have any bearing on what grace period to allow; but nonetheless it's > desirable to be able to have a non-infinite grace time. I'm not saying that's a fix; I agree we should change that also. > Basically what I think is that we want what you called "apply" semantics > always for reading from archive (and I still think the DBA should be > able to set that grace period separately from the one that applies in > SR operation). Paying attention to the master's timestamps is only > reasonable in the SR context. Agreed. > And yes, I want the dependency on WAL timestamps to be gone completely, > not just mostly. I think that driving the delay logic off of SR receipt > time (and/or the timestamp we agreed to add to the SR protocol) is the > appropriate and sufficient way to deal with the problem. Looking at the > embedded timestamps just confuses the feedback loop behavior. I disagree with "sufficient", with good reason. Please look at the code comments and see what will happen if we don't make the correction suggested. If after that you still disagree, then do it your way and we'll wait for comments in the beta; I think there will be some, but I am tired and prone to agreement to allow this to go through. -- Simon Riggs www.2ndQuadrant.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Keepalive for max_standby_delay
Simon Riggs writes: > On Thu, 2010-06-03 at 12:47 -0400, Tom Lane wrote: >> But in any case the current behavior is >> still quite broken as regards reading stale timestamps from WAL. > Agreed. That wasn't the objective of this patch or a priority. > If you're reading from an archive, you need to set max_standby_delay > appropriately, current recommendation is -1. That's not a fix; the problem is that there *is no* appropriate setting for max_standby_delay when you're reading from archive. It is unlikely that the values of the timestamps you're reading should be considered to have any bearing on what grace period to allow; but nonetheless it's desirable to be able to have a non-infinite grace time. Basically what I think is that we want what you called "apply" semantics always for reading from archive (and I still think the DBA should be able to set that grace period separately from the one that applies in SR operation). Paying attention to the master's timestamps is only reasonable in the SR context. And yes, I want the dependency on WAL timestamps to be gone completely, not just mostly. I think that driving the delay logic off of SR receipt time (and/or the timestamp we agreed to add to the SR protocol) is the appropriate and sufficient way to deal with the problem. Looking at the embedded timestamps just confuses the feedback loop behavior. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Keepalive for max_standby_delay
On Thu, 2010-06-03 at 12:47 -0400, Tom Lane wrote: > Simon Riggs writes: > > On Wed, 2010-06-02 at 13:14 -0400, Tom Lane wrote: > >> This patch seems to me to be going in fundamentally the wrong direction. > >> It's adding complexity and overhead (far more than is needed), and it's > >> failing utterly to resolve the objections that I raised to start with. > > > Having read your proposal, it seems changing from time-on-sender to > > time-on-receiver is a one line change to the patch. > > > What else are you thinking of removing, if anything? > > Basically, we need to get rid of everything that feeds timestamps from > the WAL content into the kill-delay logic. I understand your wish to do this, though it isn't always accurate to do that in the case where there is a backlog. The patch already does that *mostly* for the streaming case. The only time it does use the WAL content timestamp is when the WAL content timestamp is later than the oldest receive time, so in that case it is used as a correction. (see code comments and comments below also) > >> In particular, Simon seems to be basically refusing to do anything about > >> the complaint that the code fails unless master and standby clocks are > >> in close sync. I do not believe that this is acceptable, and since he > >> won't fix it, I guess I'll have to. > > > Syncing two servers in replication is common practice, as has been > > explained here; I'm still surprised people think otherwise. > > Doesn't affect the complaint in the least: I do not find it acceptable > to have that be *mandated* in order for our code to work sensibly. OK, accepted. > I would be OK with having something approaching what you want as a > non-default optional behavior (with a clearly-documented dependency > on having synchronized clocks). Yes, that's what I'd been thinking also. So that gives us a way forwards. standby_delay_base = apply (default) | send determines whether the standby_delay is calculated solely with reference to the standby server (apply) or whether times from the sending server are used (send). Use of send implies that the clocks on primary and standby are synchronised to within a useful accuracy, in which case it is usual to enforce that with time synchronisation software such as ntp. > But in any case the current behavior is > still quite broken as regards reading stale timestamps from WAL. Agreed. That wasn't the objective of this patch or a priority. If you're reading from an archive, you need to set max_standby_delay appropriately, current recommendation is -1. We can address that concern once the main streaming case is covered, or we can add that now. Are you planning to work on these things now as you said? How about I apply my patch now, then you do another set of changes afterwards to add the other items you mentioned, since that is now 2 additional parameters and related code? -- Simon Riggs www.2ndQuadrant.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Keepalive for max_standby_delay
Simon Riggs writes: > On Wed, 2010-06-02 at 16:00 -0400, Tom Lane wrote: >> the current situation that query grace periods go to zero > Possibly a better way to handle this concern is to make the second > parameter: min_standby_grace_period - the minimum time a query will be > given in which to execute, even if max_standby_delay has been reached or > exceeded. > Would that more directly address you concerns? > min_standby_grace_period (ms) SIGHUP A minimum grace period seems like a good idea to me, but I think it's somewhat orthogonal to the core problem here. I think we all intuitively feel that there should be a way to dial back the grace period when a slave is "far behind" on applying WAL. The problem is first how to figure out what "far behind" means, and second how to adjust the grace period in a way that doesn't have surprising misbehaviors. A minimum grace period would prevent some of the very worst misbehaviors but it's not really addressing the core problem. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Keepalive for max_standby_delay
Simon Riggs writes: > On Wed, 2010-06-02 at 13:14 -0400, Tom Lane wrote: >> This patch seems to me to be going in fundamentally the wrong direction. >> It's adding complexity and overhead (far more than is needed), and it's >> failing utterly to resolve the objections that I raised to start with. > Having read your proposal, it seems changing from time-on-sender to > time-on-receiver is a one line change to the patch. > What else are you thinking of removing, if anything? Basically, we need to get rid of everything that feeds timestamps from the WAL content into the kill-delay logic. >> In particular, Simon seems to be basically refusing to do anything about >> the complaint that the code fails unless master and standby clocks are >> in close sync. I do not believe that this is acceptable, and since he >> won't fix it, I guess I'll have to. > Syncing two servers in replication is common practice, as has been > explained here; I'm still surprised people think otherwise. Doesn't affect the complaint in the least: I do not find it acceptable to have that be *mandated* in order for our code to work sensibly. I would be OK with having something approaching what you want as a non-default optional behavior (with a clearly-documented dependency on having synchronized clocks). But in any case the current behavior is still quite broken as regards reading stale timestamps from WAL. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Keepalive for max_standby_delay
On Thu, Jun 3, 2010 at 4:34 PM, Tom Lane wrote: > The data keeps coming in and getting dumped into the slave's pg_xlog. > walsender/walreceiver are not at all tied to the slave's application > of WAL. In principle we could have the code around max_standby_delay > notice just how far behind it's gotten and adjust the delay tolerance > based on that; but I think designing a feedback loop for that is 9.1 > material. Er, no. In that case my first concern was misguided. I'm happy there's no feedback loop -- my fear was that there was and it would mean the "time received" could be decoupled from the time the wal was generated. But as you describe it then the time received might be slightly delayed from the time the wal was generated but to some constant degree -- not in a way that will be influenced by the log application being blocked on the slave. -- greg -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Keepalive for max_standby_delay
Greg Stark writes: >> Well, if the slave can't keep up, that's a separate problem. It will >> not fail to keep up as a result of the transmission mechanism. > No, I mean if the slave is paused due to a conflict. Does it stop > reading data from the master or does it buffer it up on disk? If it > stops reading it from the master then the effect is the same as if the > slave stopped "requesting" data even if there's no actual request. The data keeps coming in and getting dumped into the slave's pg_xlog. walsender/walreceiver are not at all tied to the slave's application of WAL. In principle we could have the code around max_standby_delay notice just how far behind it's gotten and adjust the delay tolerance based on that; but I think designing a feedback loop for that is 9.1 material. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Keepalive for max_standby_delay
On Thu, Jun 3, 2010 at 12:11 AM, Tom Lane wrote: > Greg Stark writes: >> I was assuming the walreceiver only requests more wal in relatively >> small chunks and only when replay has caught up and needs more data. I >> haven't actually read this code so if that assumption is wrong then >> I'm off-base. > > It is off-base. The receiver does not "request" data, the sender is > what determines how much WAL is sent when. Hm, so what happens if the slave blocks, doesn't the sender block when the kernel buffers fill up? >> So I think this isn't necessarily such a blue moon event. As I >> understand it, all it would take is a single long-running report and a >> vacuum or HOT cleanup occurring on the master. > > I think this is mostly FUD too. How often do you see vacuum blocked for > an hour now? No, that's not comparable. On the master vacuum will just ignore tuples that are still visible to live transactions. On the slave it doesn't have a choice, it sees the cleanup record and must pause recovery until those transactions finish. -- greg -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Keepalive for max_standby_delay
On Thu, Jun 3, 2010 at 5:00 AM, Tom Lane wrote: >> I stand by my suggestion from yesterday: Let's define max_standby_delay >> as the difference between a piece of WAL becoming available in the >> standby, and applying it. > > My proposal is essentially the same as yours plus allowing the DBA to > choose different max delays for the caught-up and not-caught-up cases. > Maybe everybody will end up setting the two delays the same, but I think > we don't have enough experience to decide that for them now. Applying WAL that arrives via SR is not always the sign of the caught-up or not-caught-up. OTOH, applying WAL restored from archive is not always the sign of either of them. So isn't it nonsense to separate the delay in order to control the behavior of a recovery for those cases? Regards, -- Fujii Masao NIPPON TELEGRAPH AND TELEPHONE CORPORATION NTT Open Source Software Center -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Keepalive for max_standby_delay
On Thu, 2010-06-03 at 18:39 +0900, Fujii Masao wrote: > What purpose would that serve? Tom has already explained this and it makes sense for me. -- Simon Riggs www.2ndQuadrant.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Keepalive for max_standby_delay
On Thu, Jun 3, 2010 at 6:07 PM, Simon Riggs wrote: > On Thu, 2010-06-03 at 17:56 +0900, Fujii Masao wrote: >> On Thu, Jun 3, 2010 at 4:41 AM, Heikki Linnakangas >> wrote: >> > I don't understand why you want to use a different delay when you're >> > restoring from archive vs. when you're streaming (what about existing WAL >> > files found in pg_xlog, BTW?). The source of WAL shouldn't make a >> > difference. >> >> Yes. The pace of a recovery has nothing to do with that of log shipping. >> So to hurry up a recovery when restoring from archive seems to be useless. > > When streaming drops for some reason we revert to scanning the archive > for files. There is clearly two modes of operation. Yes. > So it makes sense > that you might want to set different times for the parameter in each > case. What purpose would that serve? Regards, -- Fujii Masao NIPPON TELEGRAPH AND TELEPHONE CORPORATION NTT Open Source Software Center -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Keepalive for max_standby_delay
On Thu, 2010-06-03 at 17:56 +0900, Fujii Masao wrote: > On Thu, Jun 3, 2010 at 4:41 AM, Heikki Linnakangas > wrote: > > I don't understand why you want to use a different delay when you're > > restoring from archive vs. when you're streaming (what about existing WAL > > files found in pg_xlog, BTW?). The source of WAL shouldn't make a > > difference. > > Yes. The pace of a recovery has nothing to do with that of log shipping. > So to hurry up a recovery when restoring from archive seems to be useless. When streaming drops for some reason we revert to scanning the archive for files. There is clearly two modes of operation. So it makes sense that you might want to set different times for the parameter in each case. We might think of those modes as "connected" and "degraded" modes rather than streaming and file shipping. -- Simon Riggs www.2ndQuadrant.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Keepalive for max_standby_delay
On Thu, Jun 3, 2010 at 4:41 AM, Heikki Linnakangas wrote: > I don't understand why you want to use a different delay when you're > restoring from archive vs. when you're streaming (what about existing WAL > files found in pg_xlog, BTW?). The source of WAL shouldn't make a > difference. Yes. The pace of a recovery has nothing to do with that of log shipping. So to hurry up a recovery when restoring from archive seems to be useless. Regards, -- Fujii Masao NIPPON TELEGRAPH AND TELEPHONE CORPORATION NTT Open Source Software Center -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Keepalive for max_standby_delay
On Wed, 2010-06-02 at 16:00 -0400, Tom Lane wrote: > the current situation that query grace periods go to zero Possibly a better way to handle this concern is to make the second parameter: min_standby_grace_period - the minimum time a query will be given in which to execute, even if max_standby_delay has been reached or exceeded. Would that more directly address you concerns? min_standby_grace_period (ms) SIGHUP -- Simon Riggs www.2ndQuadrant.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Keepalive for max_standby_delay
* Tom Lane (t...@sss.pgh.pa.us) wrote: > Greg Stark writes: > > So I think this isn't necessarily such a blue moon event. As I > > understand it, all it would take is a single long-running report and a > > vacuum or HOT cleanup occurring on the master. > > I think this is mostly FUD too. How often do you see vacuum blocked for > an hour now? It probably can happen, which is why we need to be able to > kick queries off the locks with max_standby_delay, but it's far from > common. What we're concerned about here is how long a buffer lock on a > single page is held, not how long heavyweight locks are held. The > normal hold times are measured in microseconds. Maybe I'm missing something, but I think Greg's point was that if you have a long-running query running against the standby/slave/whatever, which is holding locks on various relations to implement that report, and then a vacuum or HOT update happens on the master, the long-running report query will get killed off unless you bump max_streaming_delay up pretty high (eg: 60 mins). That being said, I'm not sure that there's really another solution. Yes, in this case, the slave can end up being an hour behind, but that's the trade-off you have to make if you want to run an hour-long query on the slave. The other answer is to make the master not update those tuples, etc, which might be possible by starting a transaction on the master which grabs things enough to prevent the vacuum/hot/etc update from happening. That may be possible manually, but it's not fun and it certainly isn't something we'll have built-in support for in 9.0. Thanks, Stephen signature.asc Description: Digital signature
Re: [HACKERS] Keepalive for max_standby_delay
Greg Stark writes: > I was assuming the walreceiver only requests more wal in relatively > small chunks and only when replay has caught up and needs more data. I > haven't actually read this code so if that assumption is wrong then > I'm off-base. It is off-base. The receiver does not "request" data, the sender is what determines how much WAL is sent when. > So I think this isn't necessarily such a blue moon event. As I > understand it, all it would take is a single long-running report and a > vacuum or HOT cleanup occurring on the master. I think this is mostly FUD too. How often do you see vacuum blocked for an hour now? It probably can happen, which is why we need to be able to kick queries off the locks with max_standby_delay, but it's far from common. What we're concerned about here is how long a buffer lock on a single page is held, not how long heavyweight locks are held. The normal hold times are measured in microseconds. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Keepalive for max_standby_delay
On Wed, Jun 2, 2010 at 8:14 PM, Tom Lane wrote: > Indeed, but nothing we do can prevent that, if the slave is just plain > slower than the master. You have to assume that the slave is capable of > keeping up in the absence of query-caused delays, or you're hosed. I was assuming the walreceiver only requests more wal in relatively small chunks and only when replay has caught up and needs more data. I haven't actually read this code so if that assumption is wrong then I'm off-base. But if my assumption is right then it's not merely the master running faster than the slave that can cause you to fall arbitrarily far behind. The "receive time" is delayed by how long the slave waited on a previous lock, so if we wait for a few minutes, then proceed and find we need more wal data we'll read data from the master that it could have generated those few minutes ago. > The sticky point is that once in a blue moon you do have a conflicting > query sitting on a buffer lock for a long time, or even more likely a > series of queries keeping the WAL replay process from obtaining buffer > cleanup lock. So I think this isn't necessarily such a blue moon event. As I understand it, all it would take is a single long-running report and a vacuum or HOT cleanup occurring on the master. If I want to set max_standby_delay to 60min to allow reports to run for up to an hour then any random HOT cleanup on the master will propagate to the slave and cause a WAL stall until the transactions which have that xid in their snapshot end. Even the buffer locks are could potentially be blocked for a long time if you happen to run the right kind of query (say a nested loop with the buffer in question on the outside and a large table on the inside). That's a rarer event though; is that what you were thinking of? -- greg -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Keepalive for max_standby_delay
Heikki Linnakangas writes: > The problem with defining max_archive_delay that way is again that you > can fall behind indefinitely. See my response to Greg Stark --- I don't think this is really an issue. It's certainly far less of an issue than the current situation that query grace periods go to zero under not-at-all-extreme conditions. > I don't understand why you want to use a different delay when you're > restoring from archive vs. when you're streaming (what about existing > WAL files found in pg_xlog, BTW?). You're missing the point. I want the DBA to be able to control what happens in those two cases. In the current implementation he has no control over what happens while restoring from archive: it's going to effectively act like max_archive_delay = 0 all the time. If you're of the opinion that's good, you can set the parameter that way and be happy. I'm of the opinion you'll soon find out it isn't so good, because it'll kill standby queries too easily. > I stand by my suggestion from yesterday: Let's define max_standby_delay > as the difference between a piece of WAL becoming available in the > standby, and applying it. My proposal is essentially the same as yours plus allowing the DBA to choose different max delays for the caught-up and not-caught-up cases. Maybe everybody will end up setting the two delays the same, but I think we don't have enough experience to decide that for them now. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Keepalive for max_standby_delay
On 02/06/10 20:14, Tom Lane wrote: For realistic values of max_standby_delay ... Hang on right there. What do you consider a realistic value for max_standby_delay? Because I'm not sure I have a grip on that myself. 5 seconds? 5 minutes? 5 hours? I can see use cases for all of those... What I think might be a realistic compromise is this: 1. Separate max_standby_delay into two GUCs, say "max_streaming_delay" and "max_archive_delay". 2. When applying WAL that came across SR, use max_streaming_delay and let the time measurement be current time minus time of receipt of the current WAL send chunk. 3. When applying WAL that came from archive, use max_archive_delay and let the time measurement be current time minus time of acquisition of the current WAL segment from the archive. The current code's behavior in the latter case could effectively be modeled by setting max_archive_delay to zero, but that isn't the only plausible setting. More likely DBAs would set max_archive_delay to something smaller than max_streaming_delay, but still positive so as to not kill conflicting queries instantly. The problem with defining max_archive_delay that way is again that you can fall behind indefinitely. If you set it to 5 minutes, it means that you'll wait a maximum of 5 minutes *per WAL segment*, even if WAL is being generated faster. I don't understand why you want to use a different delay when you're restoring from archive vs. when you're streaming (what about existing WAL files found in pg_xlog, BTW?). The source of WAL shouldn't make a difference. If it's because you assume that restoring from archive is a sign that you've fallen behind a lot, surely you've exceeded max_standby_delay then and I still don't see a need for a separate GUC. I stand by my suggestion from yesterday: Let's define max_standby_delay as the difference between a piece of WAL becoming available in the standby, and applying it. To approximate "piece of WAL becoming available" for SR, we can use the mechanism with send/applyChunks from Simon's latest patch, or go with the simpler scheme of just resetting a "last caughtup timestamp" to current time whenever we have to wait for new WAL to arrive. When restoring from archive, likewise reset "last caughtup timestamp" whenever restore_command returns non-0, i.e we have to wait for the next WAL file to arrive. That works the same for both SR and file-based log shipping, there's only one knob to set, is simple to implement and doesn't require synchronized clocks. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Keepalive for max_standby_delay
Greg Stark writes: > On Wed, Jun 2, 2010 at 6:14 PM, Tom Lane wrote: >> I believe that the motivation for treating archived timestamps as live >> is, essentially, to force rapid catchup if a slave falls behind so far >> that it's reading from archive instead of SR. > Huh, I think this is the first mention of this that I've seen. I > always assumed the motivation was just that you wanted to control how > much data loss could occur on failover and how long recovery would > take. I think separating the two delays is an interesting idea but I > don't see how it counts as a simplification. Well, it isn't a simplification: it's bringing it up to the minimum complication level where it'll actually work sanely. The current implementation doesn't work sanely because it confuses stale timestamps read from WAL with real live time. > This also still allows a slave to become arbitrarily far behind the > master. Indeed, but nothing we do can prevent that, if the slave is just plain slower than the master. You have to assume that the slave is capable of keeping up in the absence of query-caused delays, or you're hosed. The real reason this is at issue is the fact that the max_standby_delay kill mechanism applies to certain buffer-level locking operations. On the master we just wait, and it's not a problem, because in practice the conflicting queries almost always release these locks pretty quick. On the slave, though, instant kill as a result of a buffer-level lock conflict would result in a very serious degradation in standby query reliability (while also doing practically nothing for the speed of WAL application, most of the time). This morning on the phone Bruce and I were seriously discussing the idea of ripping the max_standby_delay mechanism out of the buffer-level locking paths, and just let them work like they do on the master, ie, wait forever. If we did that then simplifying max_standby_delay to a boolean would be reasonable again (because it really would only come into play for DDL on the master). The sticky point is that once in a blue moon you do have a conflicting query sitting on a buffer lock for a long time, or even more likely a series of queries keeping the WAL replay process from obtaining buffer cleanup lock. So it seems that we have to have max_standby_delay-like logic for those locks, and also that zero grace period before kill isn't a very practical setting. However, there isn't a lot of point in obsessing over exactly how long the grace period ought to be, as long as it's more than a few milliseconds. It *isn't* going to have any real effect on whether the slave can stay caught up. You could make a fairly decent case for just measuring the grace period from when the replay process starts to wait, as I think I proposed awhile back. The value of measuring delay from a receipt time is that if you do happen to have a bunch of delays within a short interval you'll get more willing to kill queries --- but I really believe that that is a corner case and will have nothing to do with ordinary performance. > I propose an alternate way out of the problem of syncing two clocks. > Instead of comparing timestamps compare time intervals. So as it reads > xlog records it only ever compares the master timestamps with previous > master timestamps to determine how much time has elapsed on the > master. It compares that time elapsed with the time elapsed on the > slave to determine if it's falling behind. I think this would just add complexity and uncertainty, to deal with something that won't be much of a problem in practice. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Keepalive for max_standby_delay
On Wed, Jun 2, 2010 at 2:27 PM, Simon Riggs wrote: > Syncing two servers in replication is common practice, as has been > explained here; I'm still surprised people think otherwise. Measuring > the time between two servers is the very purpose of the patch, so the > synchronisation is not a design flaw, it is its raison d'etre. I think the purpose of the patch should not be to measure the time difference between servers, but rather the replication delay. While I don't necessarily agree with Tom's statement that this is must-fix, I do agree that it would be nicer if we could avoid depending on time sync. Yeah, I keep my servers time synced, too. But, shit happens. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise Postgres Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Keepalive for max_standby_delay
On Wed, 2010-06-02 at 13:14 -0400, Tom Lane wrote: > This patch seems to me to be going in fundamentally the wrong direction. > It's adding complexity and overhead (far more than is needed), and it's > failing utterly to resolve the objections that I raised to start with. Having read your proposal, it seems changing from time-on-sender to time-on-receiver is a one line change to the patch. What else are you thinking of removing, if anything? Adding an extra parameter adds more obviously and is something I now agree with. > In particular, Simon seems to be basically refusing to do anything about > the complaint that the code fails unless master and standby clocks are > in close sync. I do not believe that this is acceptable, and since he > won't fix it, I guess I'll have to. Syncing two servers in replication is common practice, as has been explained here; I'm still surprised people think otherwise. Measuring the time between two servers is the very purpose of the patch, so the synchronisation is not a design flaw, it is its raison d'etre. There's been a few spleens emptied on that topic, not all of them mine, and certainly no consensus on that. So I'm not refusing to do anything that's been agreed... -- Simon Riggs www.2ndQuadrant.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Keepalive for max_standby_delay
On Wed, Jun 2, 2010 at 6:14 PM, Tom Lane wrote: > I believe that the motivation for treating archived timestamps as live > is, essentially, to force rapid catchup if a slave falls behind so far > that it's reading from archive instead of SR. Huh, I think this is the first mention of this that I've seen. I always assumed the motivation was just that you wanted to control how much data loss could occur on failover and how long recovery would take. I think separating the two delays is an interesting idea but I don't see how it counts as a simplification. This also still allows a slave to become arbitrarily far behind the master. The master might take a long time to fill up a log file, and if the slave is blocked for a few minutes, then requests the next bit of log file and blocks for a few more minutes, then requests the next log file and blocks for a few more minutes, etc. As long as the slave is reading more slowly than the master is filling those segments it will fall further and further behind but never be more than a few minutes behind receiving the logs. I propose an alternate way out of the problem of syncing two clocks. Instead of comparing timestamps compare time intervals. So as it reads xlog records it only ever compares the master timestamps with previous master timestamps to determine how much time has elapsed on the master. It compares that time elapsed with the time elapsed on the slave to determine if it's falling behind. I'm assuming it periodically catches up and can throw out its accumulated time elapsed and start fresh. If not then the accumulated error might be a problem, but hopefully one we can find a solution to. -- greg -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Keepalive for max_standby_delay
On Wed, Jun 2, 2010 at 2:03 PM, Simon Riggs wrote: > On Wed, 2010-06-02 at 13:45 -0400, Tom Lane wrote: >> Stephen Frost writes: >> > * Tom Lane (t...@sss.pgh.pa.us) wrote: >> >> Comments? >> >> > I'm not really a huge fan of adding another GUC, to be honest. I'm more >> > inclined to say we treat 'max_archive_delay' as '0', and turn >> > max_streaming_delay into what you've described. If we fall back so far >> > that we have to go back to reading WALs, then we need to hurry up and >> > catch-up and damn the torpedos. >> >> If I thought that 0 were a generally acceptable value, I'd still be >> pushing the "simplify it to a boolean" agenda ;-). The problem is that >> that will sometimes kill standby queries even when they are quite short >> and doing nothing objectionable. > > OK, now I understand. I was just thinking the same as Stephen, but now I > agree we need a second parameter. I too was thinking the same as Stephen, but now I also agree we need a second parameter. I think the whole design Tom proposed seems good. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise Postgres Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Keepalive for max_standby_delay
On Wed, 2010-06-02 at 13:45 -0400, Tom Lane wrote: > Stephen Frost writes: > > * Tom Lane (t...@sss.pgh.pa.us) wrote: > >> Comments? > > > I'm not really a huge fan of adding another GUC, to be honest. I'm more > > inclined to say we treat 'max_archive_delay' as '0', and turn > > max_streaming_delay into what you've described. If we fall back so far > > that we have to go back to reading WALs, then we need to hurry up and > > catch-up and damn the torpedos. > > If I thought that 0 were a generally acceptable value, I'd still be > pushing the "simplify it to a boolean" agenda ;-). The problem is that > that will sometimes kill standby queries even when they are quite short > and doing nothing objectionable. OK, now I understand. I was just thinking the same as Stephen, but now I agree we need a second parameter. -- Simon Riggs www.2ndQuadrant.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Keepalive for max_standby_delay
Stephen Frost writes: > * Tom Lane (t...@sss.pgh.pa.us) wrote: >> Comments? > I'm not really a huge fan of adding another GUC, to be honest. I'm more > inclined to say we treat 'max_archive_delay' as '0', and turn > max_streaming_delay into what you've described. If we fall back so far > that we have to go back to reading WALs, then we need to hurry up and > catch-up and damn the torpedos. If I thought that 0 were a generally acceptable value, I'd still be pushing the "simplify it to a boolean" agenda ;-). The problem is that that will sometimes kill standby queries even when they are quite short and doing nothing objectionable. > I'd also prefer that we only wait the > delay time once until we're fully caught up again (and have gotten > back around to waiting for new data). The delays will be measured from a receipt instant to current time, which means that the longer it takes to apply a WAL segment or WAL send chunk, the less grace period there will be. (Which is the same as what CVS HEAD does --- I'm just arguing about where we get the start time from.) I believe this does what you suggest and more. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Keepalive for max_standby_delay
Tom Lane wrote: I'm still inclined to apply the part of Simon's patch that adds a transmit timestamp to each SR send chunk. That would actually be completely unused by the slave given my proposal above, but I think that it is an important step to take to future-proof the SR protocol against possible changes in the slave-side timing logic. +1. From a radically different perspective, I had to do something similar in the buildfarm years ago to protect us from machines reporting with grossly inaccurate timestamps. This was part of the solution. The client adds its current timestamp setting just before transmitting the data to the server. cheers andrew -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Keepalive for max_standby_delay
* Tom Lane (t...@sss.pgh.pa.us) wrote: > An important property of this design is that all relevant timestamps > are taken on the slave, so clock skew isn't an issue. I agree that this is important, and I do run NTP on all my servers and even monitor it using Nagios. It's still not a cure-all for time skew issues. > Comments? I'm not really a huge fan of adding another GUC, to be honest. I'm more inclined to say we treat 'max_archive_delay' as '0', and turn max_streaming_delay into what you've described. If we fall back so far that we have to go back to reading WALs, then we need to hurry up and catch-up and damn the torpedos. I'd also prefer that we only wait the delay time once until we're fully caught up again (and have gotten back around to waiting for new data). Thanks, Stephen signature.asc Description: Digital signature
Re: [HACKERS] Keepalive for max_standby_delay
Simon Riggs writes: > OK, here's v4. I've been trying to stay out of this thread, but with beta2 approaching and no resolution in sight, I'm afraid I have to get involved. This patch seems to me to be going in fundamentally the wrong direction. It's adding complexity and overhead (far more than is needed), and it's failing utterly to resolve the objections that I raised to start with. In particular, Simon seems to be basically refusing to do anything about the complaint that the code fails unless master and standby clocks are in close sync. I do not believe that this is acceptable, and since he won't fix it, I guess I'll have to. The other basic problem that I had with the current code, which this does nothing about, is that it believes that timestamps pulled from WAL archive should be treated on the same footing as timestamps of "live" actions. That might work in certain limited scenarios but it's going to be a disaster in general. I believe that the motivation for treating archived timestamps as live is, essentially, to force rapid catchup if a slave falls behind so far that it's reading from archive instead of SR. There are certainly use-cases where that's appropriate (though, again, not all); but even when you do want it it's a pretty inflexible implementation. For realistic values of max_standby_delay the behavior is going to pretty much be "instant kill whenever we're reading from archive". I have backed off my original suggestion that we should reduce max_standby_delay to a boolean: that was based on the idea that delays would only occur in response to DDL on the master, but since vacuum and btree page splits can also trigger delays, it seems clear that a "kill queries immediately" policy isn't really very useful in practice. So we need to make max_standby_delay work rather than just get rid of it. What I think might be a realistic compromise is this: 1. Separate max_standby_delay into two GUCs, say "max_streaming_delay" and "max_archive_delay". 2. When applying WAL that came across SR, use max_streaming_delay and let the time measurement be current time minus time of receipt of the current WAL send chunk. 3. When applying WAL that came from archive, use max_archive_delay and let the time measurement be current time minus time of acquisition of the current WAL segment from the archive. The current code's behavior in the latter case could effectively be modeled by setting max_archive_delay to zero, but that isn't the only plausible setting. More likely DBAs would set max_archive_delay to something smaller than max_streaming_delay, but still positive so as to not kill conflicting queries instantly. An important property of this design is that all relevant timestamps are taken on the slave, so clock skew isn't an issue. I'm still inclined to apply the part of Simon's patch that adds a transmit timestamp to each SR send chunk. That would actually be completely unused by the slave given my proposal above, but I think that it is an important step to take to future-proof the SR protocol against possible changes in the slave-side timing logic. I don't however see the value of transmitting "keepalive" records when we'd otherwise not have anything to send. The value of adding timestamps to the SR protocol is really to let the slave determine the current amount of clock skew between it and the master; which is a number that should not change so rapidly that it has to be updated every 100ms even in an idle system. Comments? regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Keepalive for max_standby_delay
Simon Riggs wrote: > On Mon, 2010-05-31 at 14:40 -0400, Bruce Momjian wrote: > >> Uh, we have three days before we package 9.0beta2. It would be >> good if we could decide on the max_standby_delay issue soon. > > I've heard something from Heikki, not from anyone else. Those > comments amount to "lets replace max_standby_delay with > max_apply_delay". > > Got a beta idea? Given the incessant ticking of the clock, I have a hard time believing we have any real options besides max_standby_delay or a boolean which corresponds to the -1 and 0 settings of max_standby_delay. I think it's pretty clear that there's a use case for the positive values, although there are bound to be some who try it and are surprised by behavior at transition from idle to active. The whole debate seems to boil down to how important a middle ground is versus how damaging the surprise factor is. (I don't really buy the argument that we won't be able to remove it later if we replace it with something better.) I know there were initially some technical problems, too; have those been resolved? -Kevin -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Keepalive for max_standby_delay
On Mon, 2010-05-31 at 14:40 -0400, Bruce Momjian wrote: > Uh, we have three days before we package 9.0beta2. It would be good if > we could decide on the max_standby_delay issue soon. I've heard something from Heikki, not from anyone else. Those comments amount to "lets replace max_standby_delay with max_apply_delay". Got a beta idea? -- Simon Riggs www.2ndQuadrant.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Keepalive for max_standby_delay
Thanks for the review. On Tue, 2010-06-01 at 13:36 +0300, Heikki Linnakangas wrote: > If we really want to try to salvage max_standby_delay with a meaning > similar to what it has now, I think we should go with the idea some > people bashed around earlier and define the grace period as the > difference between a WAL record becoming available to the standby for > replay, and between replaying it. An approximation of that is to do > "lastIdle = gettimeofday()" in XLogPageRead() whenever it needs to wait > for new WAL to arrive, whether that's via streaming replication or by a > success return code from restore_command, and compare the difference of > that with current timestamp in WaitExceedsMaxStandbyDelay(). That wouldn't cope with a continuous stream of records arriving, unless you also include the second half of the patch. > That's very simple, doesn't require synchronized clocks, and works the > same with file- and stream-based setups. Nor does it provide a mechanism for monitoring of SR. standby_delay is explicitly defined in terms of the gap between two servers, so is a useful real world concept. apply_delay is somewhat less interesting. I'm sure most people would rather have monitoring and therefore the requirement for synchronised-ish clocks, than no monitoring. If you think no monitoring is OK, I don't, but there are other ways, so its not a point to fight about. > This certainly alleviates some of the problems. You still need to ensure > that master and standby have synchronized clocks, and you still get zero > grace time after a long period of inactivity when not using streaming > replication, however. Second issue can be added once we approve the rest of this if you like. > Sending a keep-alive message every 100ms seems overly aggressive to me. It's sent every wal_sender_delay. Why is that a negative? -- Simon Riggs www.2ndQuadrant.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Keepalive for max_standby_delay
On 27/05/10 20:26, Simon Riggs wrote: On Wed, 2010-05-26 at 16:22 -0700, Josh Berkus wrote: Just this second posted about that, as it turns out. I have a v3 *almost* ready of the keepalive patch. It still makes sense to me after a few days reflection, so is worth discussion and review. In or out, I want this settled within a week. Definitely need some R&R here. Does the keepalive fix all the issues with max_standby_delay? Tom? OK, here's v4. Summary * WALSender adds a timestamp onto the header of every WAL chunk sent. * Each WAL record now has a conceptual "send timestamp" that remains constant while that record is replayed. This is used as the basis from which max_standby_delay is calculated when required during replay. * Send timestamp is calculated as the later of the timestamp of chunk in which WAL record was sent and the latest XLog time. * WALSender sends an empty message as a keepalive when nothing else to send. (No longer a special message type for the keepalive). I think its close, but if there's a gaping hole here somewhere then I'll punt for this release. This certainly alleviates some of the problems. You still need to ensure that master and standby have synchronized clocks, and you still get zero grace time after a long period of inactivity when not using streaming replication, however. Sending a keep-alive message every 100ms seems overly aggressive to me. If we really want to try to salvage max_standby_delay with a meaning similar to what it has now, I think we should go with the idea some people bashed around earlier and define the grace period as the difference between a WAL record becoming available to the standby for replay, and between replaying it. An approximation of that is to do "lastIdle = gettimeofday()" in XLogPageRead() whenever it needs to wait for new WAL to arrive, whether that's via streaming replication or by a success return code from restore_command, and compare the difference of that with current timestamp in WaitExceedsMaxStandbyDelay(). That's very simple, doesn't require synchronized clocks, and works the same with file- and stream-based setups. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Keepalive for max_standby_delay
Uh, we have three days before we package 9.0beta2. It would be good if we could decide on the max_standby_delay issue soon. --- Simon Riggs wrote: > On Wed, 2010-05-26 at 16:22 -0700, Josh Berkus wrote: > > > Just this second posted about that, as it turns out. > > > > > > I have a v3 *almost* ready of the keepalive patch. It still makes sense > > > to me after a few days reflection, so is worth discussion and review. In > > > or out, I want this settled within a week. Definitely need some R&R > > > here. > > > > Does the keepalive fix all the issues with max_standby_delay? Tom? > > OK, here's v4. > > Summary > > * WALSender adds a timestamp onto the header of every WAL chunk sent. > > * Each WAL record now has a conceptual "send timestamp" that remains > constant while that record is replayed. This is used as the basis from > which max_standby_delay is calculated when required during replay. > > * Send timestamp is calculated as the later of the timestamp of chunk in > which WAL record was sent and the latest XLog time. > > * WALSender sends an empty message as a keepalive when nothing else to > send. (No longer a special message type for the keepalive). > > I think its close, but if there's a gaping hole here somewhere then I'll > punt for this release. > > -- > Simon Riggs www.2ndQuadrant.com [ Attachment, skipping... ] > > -- > Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) > To make changes to your subscription: > http://www.postgresql.org/mailpref/pgsql-hackers -- Bruce Momjian http://momjian.us EnterpriseDB http://enterprisedb.com + None of us is going to be here forever. + -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Keepalive for max_standby_delay
On Wed, 2010-05-26 at 16:22 -0700, Josh Berkus wrote: > > Just this second posted about that, as it turns out. > > > > I have a v3 *almost* ready of the keepalive patch. It still makes sense > > to me after a few days reflection, so is worth discussion and review. In > > or out, I want this settled within a week. Definitely need some R&R > > here. > > Does the keepalive fix all the issues with max_standby_delay? Tom? OK, here's v4. Summary * WALSender adds a timestamp onto the header of every WAL chunk sent. * Each WAL record now has a conceptual "send timestamp" that remains constant while that record is replayed. This is used as the basis from which max_standby_delay is calculated when required during replay. * Send timestamp is calculated as the later of the timestamp of chunk in which WAL record was sent and the latest XLog time. * WALSender sends an empty message as a keepalive when nothing else to send. (No longer a special message type for the keepalive). I think its close, but if there's a gaping hole here somewhere then I'll punt for this release. -- Simon Riggs www.2ndQuadrant.com *** a/doc/src/sgml/protocol.sgml --- b/doc/src/sgml/protocol.sgml *** *** 4222,4247 The commands accepted in walsender mode are: ! Byten ! Data that forms part of WAL data stream. ! ! ! A single WAL record is never split across two CopyData messages. When a WAL record crosses a WAL page boundary, however, and is therefore already split using continuation records, it can be split at the page boundary. In other words, the first main WAL record and its continuation records can be split across different CopyData messages. --- 4222,4257 ! Byte8 ! Message timestamp. ! ! ! Byten ! ! ! ! Data that forms part of WAL data stream. (May be zero length). ! ! A single WAL record is never split across two CopyData messages. When a WAL record crosses a WAL page boundary, however, and is therefore already split using continuation records, it can be split at the page boundary. In other words, the first main WAL record and its continuation records can be split across different CopyData messages. + + + + *** a/src/backend/access/transam/xlog.c --- b/src/backend/access/transam/xlog.c *** *** 1938,1944 UpdateMinRecoveryPoint(XLogRecPtr lsn, bool force) UpdateControlFile(); minRecoveryPoint = newMinRecoveryPoint; ! ereport(DEBUG2, (errmsg("updated min recovery point to %X/%X", minRecoveryPoint.xlogid, minRecoveryPoint.xrecoff))); } --- 1938,1944 UpdateControlFile(); minRecoveryPoint = newMinRecoveryPoint; ! ereport(DEBUG3, (errmsg("updated min recovery point to %X/%X", minRecoveryPoint.xlogid, minRecoveryPoint.xrecoff))); } *** *** 9210,9218 retry: { /* * While walreceiver is active, wait for new WAL to arrive ! * from primary. */ ! receivedUpto = GetWalRcvWriteRecPtr(); if (XLByteLT(*RecPtr, receivedUpto)) { /* --- 9210,9218 { /* * While walreceiver is active, wait for new WAL to arrive ! * from primary. Get next applychunk and do other bookkeeping. */ ! receivedUpto = GetWalRcvNextApplyChunk(); if (XLByteLT(*RecPtr, receivedUpto)) { /* *** a/src/backend/replication/walreceiver.c --- b/src/backend/replication/walreceiver.c *** *** 394,410 XLogWalRcvProcessMsg(unsigned char type, char *buf, Size len) case 'w':/* WAL records */ { XLogRecPtr recptr; ! if (len < sizeof(XLogRecPtr)) ereport(ERROR, (errcode(ERRCODE_PROTOCOL_VIOLATION), errmsg_internal("invalid WAL message received from primary"))); memcpy(&recptr, buf, sizeof(XLogRecPtr)); buf += sizeof(XLogRecPtr); len -= sizeof(XLogRecPtr); ! XLogWalRcvWrite(buf, len, recptr); break; } default: --- 394,427 case 'w':/* WAL records */ { XLogRecPtr recptr; + TimestampTz chunk_timestamp; ! if (len < (sizeof(XLogRecPtr) + sizeof(TimestampTz))) ereport(ERROR, (errcode(ERRCODE_PROTOCOL_VIOLATION), errmsg_internal("invalid WAL message received from primary"))); + /* + * Extract starting XLogRecPtr of message header + */ memcpy(&recptr, buf, sizeof(XLogRecPtr)); buf += sizeof(XLogRecPtr); len -= sizeof(XLogRecPtr
Re: [HACKERS] Keepalive for max_standby_delay
> Just this second posted about that, as it turns out. > > I have a v3 *almost* ready of the keepalive patch. It still makes sense > to me after a few days reflection, so is worth discussion and review. In > or out, I want this settled within a week. Definitely need some R&R > here. Does the keepalive fix all the issues with max_standby_delay? Tom? -- -- Josh Berkus PostgreSQL Experts Inc. http://www.pgexperts.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Keepalive for max_standby_delay
On Wed, 2010-05-26 at 15:45 -0700, Josh Berkus wrote: > > Committed with chunk size of 128 kB. I hope that's a reasonable > > compromise, in the absence of any performance test data either way. > > So where are we with max_standby_delay? Status check? Just this second posted about that, as it turns out. I have a v3 *almost* ready of the keepalive patch. It still makes sense to me after a few days reflection, so is worth discussion and review. In or out, I want this settled within a week. Definitely need some R&R here. -- Simon Riggs www.2ndQuadrant.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Keepalive for max_standby_delay
On Sun, 2010-05-16 at 17:11 +0100, Simon Riggs wrote: > New version, with some other cleanup of wait processing. > > New logic is that when Startup asks for next applychunk of WAL it saves > the lastChunkTimestamp. That is then the base time used by > WaitExceedsMaxStandbyDelay(), except when latestXLogTime is later. > Since multiple receivechunks can arrive from primary before Startup asks > for next applychunk we use the oldest receivechunk timestamp, not the > latest. Doing it this way means the lastChunkTimestamp doesn't change > when new keepalives arrive, so we have a stable and reasonably accurate > recordSendTimestamp for each WAL record. > > The keepalive is sent as the first part of a new message, if any. So > partial chunks of data always have an accurate timestamp, even if that > is slightly older as a result. Doesn't make much difference except with > very large chunks. > > I think that addresses the points raised on this thread and others. Was about to post v3 after your last commit, but just found a minor bug in my v2->v3 changes, even though they were fairly light. Too tired to fix now. The general thinking underlying this patch is still the same though and is worth discussing over next few days. -- Simon Riggs www.2ndQuadrant.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Keepalive for max_standby_delay
On Thu, 2010-05-27 at 01:34 +0300, Heikki Linnakangas wrote: > Committed with chunk size of 128 kB. Thanks. I'm sure that's fine. -- Simon Riggs www.2ndQuadrant.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Keepalive for max_standby_delay
> Committed with chunk size of 128 kB. I hope that's a reasonable > compromise, in the absence of any performance test data either way. So where are we with max_standby_delay? Status check? -- -- Josh Berkus PostgreSQL Experts Inc. http://www.pgexperts.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Keepalive for max_standby_delay
On 19/05/10 00:37, Simon Riggs wrote: On Tue, 2010-05-18 at 17:25 -0400, Heikki Linnakangas wrote: On 18/05/10 17:17, Simon Riggs wrote: There's no reason that the buffer size we use for XLogRead() should be the same as the send buffer, if you're worried about that. My point is that pq_putmessage contains internal flushes so at the libpq level you gain nothing by big batches. The read() will be buffered anyway with readahead so not sure what the issue is. We'll have to do this for sync rep anyway, so what's the big deal? Just do it now, once. Do we really want 9.1 code to differ here? Do what? What exactly is it that you want instead, then? Read and write smaller messages, so the latency is minimised. Libpq will send in 8192 byte packets, so writing anything larger gains nothing when the WAL data is also chunked at exactly the same size. Committed with chunk size of 128 kB. I hope that's a reasonable compromise, in the absence of any performance test data either way. I'm weary of setting it as low as 8k, as there is some per-message overhead. Some of that could be avoided by rearranging the loops so that the ps display is not updated at every message as you suggested, but I don't feel doing any extra rearrangements at this point. It would not be hard, but it also certainly wouldn't make the code simpler. I believe in practice 128kB is just as good as 8k from the responsiveness point of view. If a standby is not responding, walsender will be stuck trying to send no matter what the block size is. If it responding, no matter how slowly, 128kB should get transferred pretty quickly. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Keepalive for max_standby_delay
On Tue, 2010-05-18 at 17:25 -0400, Heikki Linnakangas wrote: > On 18/05/10 17:17, Simon Riggs wrote: > > There's no reason that the buffer size we use for XLogRead() should be > > the same as the send buffer, if you're worried about that. My point is > > that pq_putmessage contains internal flushes so at the libpq level you > > gain nothing by big batches. The read() will be buffered anyway with > > readahead so not sure what the issue is. We'll have to do this for sync > > rep anyway, so what's the big deal? Just do it now, once. Do we really > > want 9.1 code to differ here? > > Do what? What exactly is it that you want instead, then? Read and write smaller messages, so the latency is minimised. Libpq will send in 8192 byte packets, so writing anything larger gains nothing when the WAL data is also chunked at exactly the same size. -- Simon Riggs www.2ndQuadrant.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Keepalive for max_standby_delay
On 18/05/10 17:17, Simon Riggs wrote: There's no reason that the buffer size we use for XLogRead() should be the same as the send buffer, if you're worried about that. My point is that pq_putmessage contains internal flushes so at the libpq level you gain nothing by big batches. The read() will be buffered anyway with readahead so not sure what the issue is. We'll have to do this for sync rep anyway, so what's the big deal? Just do it now, once. Do we really want 9.1 code to differ here? Do what? What exactly is it that you want instead, then? -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Keepalive for max_standby_delay
On Tue, 2010-05-18 at 17:08 -0400, Heikki Linnakangas wrote: > On 17/05/10 12:36, Jim Nasby wrote: > > On May 15, 2010, at 12:05 PM, Heikki Linnakangas wrote: > >> What exactly is the user trying to monitor? If it's "how far behind is > >> the standby", the difference between pg_current_xlog_insert_location() > >> in the master and pg_last_xlog_replay_location() in the standby seems > >> more robust and well-defined to me. It's a measure of XLOG location (ie. > >> bytes) instead of time, but time is a complicated concept. > > > > I can tell you that end users *will* want a time-based indication of how > > far behind we are. DBAs will understand "we're this many transactions > > behind", but managers and end users won't. Unless it's unreasonable to > > provide that info, we should do so. > > No doubt about that, the problem is that it's hard to provide a reliable > time-based indication. I think I have one now. -- Simon Riggs www.2ndQuadrant.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Keepalive for max_standby_delay
On Tue, 2010-05-18 at 17:06 -0400, Heikki Linnakangas wrote: > On 17/05/10 04:40, Simon Riggs wrote: > > On Sun, 2010-05-16 at 16:53 +0100, Simon Riggs wrote: > >>> > >>> Attached patch rearranges the walsender loops slightly to fix the above. > >>> XLogSend() now only sends up to MAX_SEND_SIZE bytes (== XLOG_SEG_SIZE / > >>> 2) in one round and returns to the main loop after that even if there's > >>> unsent WAL, and the main loop no longer sleeps if there's unsent WAL. > >>> That way the main loop gets to respond to signals quickly, and we also > >>> get to update the shared memory status and PS display more often when > >>> there's a lot of catching up to do. > >>> > >>> Comments > >> > >> 8MB at a time still seems like a large batch to me. > >> > >> libpq is going to send it in smaller chunks anyway, so I don't see the > >> importance of trying to keep the batch too large. It just introduces > >> delay into the sending process. We should be sending chunks that matches > >> libpq better. > > > > More to the point the logic will fail if XLOG_BLCKSZ> PQ_BUFFER_SIZE > > because it will send partial pages. > > I don't see a failure. We rely on not splitting WAL records across > messages, but we're talking about libpq-level CopyData messages, not TCP > messages. OK > > Having MAX_SEND_SIZE> PQ_BUFFER_SIZE is pointless, as libpq currently > > stands. > > Well, it does affect the size of the read() in walsender, and I'm sure > there's some overhead in setting the ps display and the other small > stuff we do once per message. But you're probably right that we could > easily make MAX_SEND_SIZE much smaller with no noticeable affect on > performance, while making walsender more responsive to signals. I'll > decrease it to, say, 512 kB. I'm pretty certain we don't need to set the ps display once per message. ps doesn't need an update 5 times per second on average. There's no reason that the buffer size we use for XLogRead() should be the same as the send buffer, if you're worried about that. My point is that pq_putmessage contains internal flushes so at the libpq level you gain nothing by big batches. The read() will be buffered anyway with readahead so not sure what the issue is. We'll have to do this for sync rep anyway, so what's the big deal? Just do it now, once. Do we really want 9.1 code to differ here? -- Simon Riggs www.2ndQuadrant.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Keepalive for max_standby_delay
On 17/05/10 12:36, Jim Nasby wrote: On May 15, 2010, at 12:05 PM, Heikki Linnakangas wrote: What exactly is the user trying to monitor? If it's "how far behind is the standby", the difference between pg_current_xlog_insert_location() in the master and pg_last_xlog_replay_location() in the standby seems more robust and well-defined to me. It's a measure of XLOG location (ie. bytes) instead of time, but time is a complicated concept. I can tell you that end users *will* want a time-based indication of how far behind we are. DBAs will understand "we're this many transactions behind", but managers and end users won't. Unless it's unreasonable to provide that info, we should do so. No doubt about that, the problem is that it's hard to provide a reliable time-based indication. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Keepalive for max_standby_delay
On 17/05/10 04:40, Simon Riggs wrote: On Sun, 2010-05-16 at 16:53 +0100, Simon Riggs wrote: Attached patch rearranges the walsender loops slightly to fix the above. XLogSend() now only sends up to MAX_SEND_SIZE bytes (== XLOG_SEG_SIZE / 2) in one round and returns to the main loop after that even if there's unsent WAL, and the main loop no longer sleeps if there's unsent WAL. That way the main loop gets to respond to signals quickly, and we also get to update the shared memory status and PS display more often when there's a lot of catching up to do. Comments 8MB at a time still seems like a large batch to me. libpq is going to send it in smaller chunks anyway, so I don't see the importance of trying to keep the batch too large. It just introduces delay into the sending process. We should be sending chunks that matches libpq better. More to the point the logic will fail if XLOG_BLCKSZ> PQ_BUFFER_SIZE because it will send partial pages. I don't see a failure. We rely on not splitting WAL records across messages, but we're talking about libpq-level CopyData messages, not TCP messages. Having MAX_SEND_SIZE> PQ_BUFFER_SIZE is pointless, as libpq currently stands. Well, it does affect the size of the read() in walsender, and I'm sure there's some overhead in setting the ps display and the other small stuff we do once per message. But you're probably right that we could easily make MAX_SEND_SIZE much smaller with no noticeable affect on performance, while making walsender more responsive to signals. I'll decrease it to, say, 512 kB. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Keepalive for max_standby_delay
On May 15, 2010, at 12:05 PM, Heikki Linnakangas wrote: > What exactly is the user trying to monitor? If it's "how far behind is > the standby", the difference between pg_current_xlog_insert_location() > in the master and pg_last_xlog_replay_location() in the standby seems > more robust and well-defined to me. It's a measure of XLOG location (ie. > bytes) instead of time, but time is a complicated concept. I can tell you that end users *will* want a time-based indication of how far behind we are. DBAs will understand "we're this many transactions behind", but managers and end users won't. Unless it's unreasonable to provide that info, we should do so. -- Jim C. Nasby, Database Architect j...@nasby.net 512.569.9461 (cell) http://jim.nasby.net -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Keepalive for max_standby_delay
On Sun, 2010-05-16 at 16:53 +0100, Simon Riggs wrote: > > > > Attached patch rearranges the walsender loops slightly to fix the above. > > XLogSend() now only sends up to MAX_SEND_SIZE bytes (== XLOG_SEG_SIZE / > > 2) in one round and returns to the main loop after that even if there's > > unsent WAL, and the main loop no longer sleeps if there's unsent WAL. > > That way the main loop gets to respond to signals quickly, and we also > > get to update the shared memory status and PS display more often when > > there's a lot of catching up to do. > > > > Comments > > 8MB at a time still seems like a large batch to me. > > libpq is going to send it in smaller chunks anyway, so I don't see the > importance of trying to keep the batch too large. It just introduces > delay into the sending process. We should be sending chunks that matches > libpq better. More to the point the logic will fail if XLOG_BLCKSZ > PQ_BUFFER_SIZE because it will send partial pages. Having MAX_SEND_SIZE > PQ_BUFFER_SIZE is pointless, as libpq currently stands. -- Simon Riggs www.2ndQuadrant.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Keepalive for max_standby_delay
On Mon, 2010-05-17 at 11:51 +0900, Fujii Masao wrote: > Is it OK that this keepalive message cannot be used by HS in file-based > log-shipping? Even in SR, the startup process cannot use the keepalive > until walreceiver has been started up. Yes, I see those things. We already have archive_timeout to handle the keepalive case in file-based. When starting up the delay is high anyway, so doesn't really matter about accuracy - though we do use latestXLogTime in that cases. > WalSndKeepAlive() always calls initStringInfo(), which seems to cause > memory-leak. Thanks. -- Simon Riggs www.2ndQuadrant.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Keepalive for max_standby_delay
On Mon, May 17, 2010 at 1:11 AM, Simon Riggs wrote: > On Sat, 2010-05-15 at 19:50 +0100, Simon Riggs wrote: >> On Sat, 2010-05-15 at 18:24 +0100, Simon Riggs wrote: >> >> > I will recode using that concept. > >> Startup gets new pointer when it runs out of data to replay. That might >> or might not include an updated keepalive timestamp, since there's no >> exact relationship between chunks sent and chunks received. Startup >> might ask for a new chunk when half a chunk has been received, or when >> multiple chunks have been received. > > New version, with some other cleanup of wait processing. > > New logic is that when Startup asks for next applychunk of WAL it saves > the lastChunkTimestamp. That is then the base time used by > WaitExceedsMaxStandbyDelay(), except when latestXLogTime is later. > Since multiple receivechunks can arrive from primary before Startup asks > for next applychunk we use the oldest receivechunk timestamp, not the > latest. Doing it this way means the lastChunkTimestamp doesn't change > when new keepalives arrive, so we have a stable and reasonably accurate > recordSendTimestamp for each WAL record. > > The keepalive is sent as the first part of a new message, if any. So > partial chunks of data always have an accurate timestamp, even if that > is slightly older as a result. Doesn't make much difference except with > very large chunks. > > I think that addresses the points raised on this thread and others. Is it OK that this keepalive message cannot be used by HS in file-based log-shipping? Even in SR, the startup process cannot use the keepalive until walreceiver has been started up. WalSndKeepAlive() always calls initStringInfo(), which seems to cause memory-leak. Regards, -- Fujii Masao NIPPON TELEGRAPH AND TELEPHONE CORPORATION NTT Open Source Software Center -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Keepalive for max_standby_delay
On Sun, May 16, 2010 at 6:05 AM, Heikki Linnakangas wrote: > Heikki Linnakangas wrote: >> Simon Riggs wrote: >>> WALSender sleeps even when it might have more WAL to send, it doesn't >>> check it just unconditionally sleeps. At least WALReceiver loops until >>> it has no more to receive. I just can't imagine why that's useful >>> behaviour. >> >> Good catch. That should be fixed. >> >> I also note that walsender doesn't respond to signals, while it's >> sending a large batch. That's analogous to the issue that was addressed >> recently in the archiver process. > > Attached patch rearranges the walsender loops slightly to fix the above. > XLogSend() now only sends up to MAX_SEND_SIZE bytes (== XLOG_SEG_SIZE / > 2) in one round and returns to the main loop after that even if there's > unsent WAL, and the main loop no longer sleeps if there's unsent WAL. > That way the main loop gets to respond to signals quickly, and we also > get to update the shared memory status and PS display more often when > there's a lot of catching up to do. > > Comments The main loop needs to respond to also the 'X' message from the standby quickly? Regards, -- Fujii Masao NIPPON TELEGRAPH AND TELEPHONE CORPORATION NTT Open Source Software Center -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Keepalive for max_standby_delay
On Sat, 2010-05-15 at 19:50 +0100, Simon Riggs wrote: > On Sat, 2010-05-15 at 18:24 +0100, Simon Riggs wrote: > > > I will recode using that concept. > Startup gets new pointer when it runs out of data to replay. That might > or might not include an updated keepalive timestamp, since there's no > exact relationship between chunks sent and chunks received. Startup > might ask for a new chunk when half a chunk has been received, or when > multiple chunks have been received. New version, with some other cleanup of wait processing. New logic is that when Startup asks for next applychunk of WAL it saves the lastChunkTimestamp. That is then the base time used by WaitExceedsMaxStandbyDelay(), except when latestXLogTime is later. Since multiple receivechunks can arrive from primary before Startup asks for next applychunk we use the oldest receivechunk timestamp, not the latest. Doing it this way means the lastChunkTimestamp doesn't change when new keepalives arrive, so we have a stable and reasonably accurate recordSendTimestamp for each WAL record. The keepalive is sent as the first part of a new message, if any. So partial chunks of data always have an accurate timestamp, even if that is slightly older as a result. Doesn't make much difference except with very large chunks. I think that addresses the points raised on this thread and others. -- Simon Riggs www.2ndQuadrant.com *** a/doc/src/sgml/protocol.sgml --- b/doc/src/sgml/protocol.sgml *** *** 4232,4247 The commands accepted in walsender mode are: ! ! ! ! ! A single WAL record is never split across two CopyData messages. When a WAL record crosses a WAL page boundary, however, and is therefore already split using continuation records, it can be split at the page boundary. In other words, the first main WAL record and its continuation records can be split across different CopyData messages. --- 4232,4283 ! A single WAL record is never split across two CopyData messages. When a WAL record crosses a WAL page boundary, however, and is therefore already split using continuation records, it can be split at the page boundary. In other words, the first main WAL record and its continuation records can be split across different CopyData messages. + + + + + + Keepalive (B) + + + + + + + Byte1('k') + + + + Identifies the message as a keepalive. + + + + + + TimestampTz + + + + The current timestamp on the primary server when the keepalive was sent. + + + + + + +If wal_level is set to hot_standby then a keepalive +is sent once per wal_sender_delay. The keepalive is sent after +WAL data has been sent, if any. + + + + *** a/src/backend/access/transam/xlog.c --- b/src/backend/access/transam/xlog.c *** *** 1938,1944 UpdateMinRecoveryPoint(XLogRecPtr lsn, bool force) UpdateControlFile(); minRecoveryPoint = newMinRecoveryPoint; ! ereport(DEBUG2, (errmsg("updated min recovery point to %X/%X", minRecoveryPoint.xlogid, minRecoveryPoint.xrecoff))); } --- 1938,1944 UpdateControlFile(); minRecoveryPoint = newMinRecoveryPoint; ! ereport(DEBUG3, (errmsg("updated min recovery point to %X/%X", minRecoveryPoint.xlogid, minRecoveryPoint.xrecoff))); } *** *** 9212,9218 retry: * While walreceiver is active, wait for new WAL to arrive * from primary. */ ! receivedUpto = GetWalRcvWriteRecPtr(); if (XLByteLT(*RecPtr, receivedUpto)) { /* --- 9212,9218 * While walreceiver is active, wait for new WAL to arrive * from primary. */ ! receivedUpto = GetWalRcvNextChunk(); if (XLByteLT(*RecPtr, receivedUpto)) { /* *** a/src/backend/replication/walreceiver.c --- b/src/backend/replication/walreceiver.c *** *** 407,412 XLogWalRcvProcessMsg(unsigned char type, char *buf, Size len) --- 407,428 XLogWalRcvWrite(buf, len, recptr); break; } + case 'k':/* keepalive */ + { + TimestampTz keepalive; + + if (len != sizeof(TimestampTz)) + ereport(ERROR, + (errcode(ERRCODE_PROTOCOL_VIOLATION), + errmsg_internal("invalid keepalive message received from primary"))); + + memcpy(&keepalive, buf, sizeof(TimestampTz)); + buf += sizeof(TimestampTz); + len -= sizeof(TimestampTz); + + SetWalRcvChunkTimestamp(keepalive); + break; + } default: ereport(ERROR,
Re: [HACKERS] Keepalive for max_standby_delay
On Sun, 2010-05-16 at 00:05 +0300, Heikki Linnakangas wrote: > Heikki Linnakangas wrote: > > Simon Riggs wrote: > >> WALSender sleeps even when it might have more WAL to send, it doesn't > >> check it just unconditionally sleeps. At least WALReceiver loops until > >> it has no more to receive. I just can't imagine why that's useful > >> behaviour. > > > > Good catch. That should be fixed. > > > > I also note that walsender doesn't respond to signals, while it's > > sending a large batch. That's analogous to the issue that was addressed > > recently in the archiver process. > > Attached patch rearranges the walsender loops slightly to fix the above. > XLogSend() now only sends up to MAX_SEND_SIZE bytes (== XLOG_SEG_SIZE / > 2) in one round and returns to the main loop after that even if there's > unsent WAL, and the main loop no longer sleeps if there's unsent WAL. > That way the main loop gets to respond to signals quickly, and we also > get to update the shared memory status and PS display more often when > there's a lot of catching up to do. > > Comments 8MB at a time still seems like a large batch to me. libpq is going to send it in smaller chunks anyway, so I don't see the importance of trying to keep the batch too large. It just introduces delay into the sending process. We should be sending chunks that matches libpq better. -- Simon Riggs www.2ndQuadrant.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Keepalive for max_standby_delay
Heikki Linnakangas wrote: > Simon Riggs wrote: >> WALSender sleeps even when it might have more WAL to send, it doesn't >> check it just unconditionally sleeps. At least WALReceiver loops until >> it has no more to receive. I just can't imagine why that's useful >> behaviour. > > Good catch. That should be fixed. > > I also note that walsender doesn't respond to signals, while it's > sending a large batch. That's analogous to the issue that was addressed > recently in the archiver process. Attached patch rearranges the walsender loops slightly to fix the above. XLogSend() now only sends up to MAX_SEND_SIZE bytes (== XLOG_SEG_SIZE / 2) in one round and returns to the main loop after that even if there's unsent WAL, and the main loop no longer sleeps if there's unsent WAL. That way the main loop gets to respond to signals quickly, and we also get to update the shared memory status and PS display more often when there's a lot of catching up to do. Comments, have I screwed up anything? -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com *** a/src/backend/replication/walsender.c --- b/src/backend/replication/walsender.c *** *** 100,106 static void InitWalSnd(void); static void WalSndHandshake(void); static void WalSndKill(int code, Datum arg); static void XLogRead(char *buf, XLogRecPtr recptr, Size nbytes); ! static bool XLogSend(StringInfo outMsg); static void CheckClosedConnection(void); /* --- 100,106 static void WalSndHandshake(void); static void WalSndKill(int code, Datum arg); static void XLogRead(char *buf, XLogRecPtr recptr, Size nbytes); ! static bool XLogSend(StringInfo outMsg, bool *caughtup); static void CheckClosedConnection(void); /* *** *** 360,365 static int --- 360,366 WalSndLoop(void) { StringInfoData output_message; + bool caughtup = false; initStringInfo(&output_message); *** *** 387,393 WalSndLoop(void) */ if (ready_to_stop) { ! XLogSend(&output_message); shutdown_requested = true; } --- 388,394 */ if (ready_to_stop) { ! XLogSend(&output_message, &caughtup); shutdown_requested = true; } *** *** 402,432 WalSndLoop(void) } /* ! * Nap for the configured time or until a message arrives. * * On some platforms, signals won't interrupt the sleep. To ensure we * respond reasonably promptly when someone signals us, break down the * sleep into NAPTIME_PER_CYCLE increments, and check for * interrupts after each nap. */ ! remain = WalSndDelay * 1000L; ! while (remain > 0) { ! if (got_SIGHUP || shutdown_requested || ready_to_stop) ! break; ! /* ! * Check to see whether a message from the standby or an interrupt ! * from other processes has arrived. ! */ ! pg_usleep(remain > NAPTIME_PER_CYCLE ? NAPTIME_PER_CYCLE : remain); ! CheckClosedConnection(); ! remain -= NAPTIME_PER_CYCLE; } - /* Attempt to send the log once every loop */ ! if (!XLogSend(&output_message)) goto eof; } --- 403,434 } /* ! * If we had sent all accumulated WAL in last round, nap for the ! * configured time before retrying. * * On some platforms, signals won't interrupt the sleep. To ensure we * respond reasonably promptly when someone signals us, break down the * sleep into NAPTIME_PER_CYCLE increments, and check for * interrupts after each nap. */ ! if (caughtup) { ! remain = WalSndDelay * 1000L; ! while (remain > 0) ! { ! /* Check for interrupts */ ! if (got_SIGHUP || shutdown_requested || ready_to_stop) ! break; ! /* Sleep and check that the connection is still alive */ ! pg_usleep(remain > NAPTIME_PER_CYCLE ? NAPTIME_PER_CYCLE : remain); ! CheckClosedConnection(); ! remain -= NAPTIME_PER_CYCLE; ! } } /* Attempt to send the log once every loop */ ! if (!XLogSend(&output_message, &caughtup)) goto eof; } *** *** 623,637 XLogRead(char *buf, XLogRecPtr recptr, Size nbytes) } /* ! * Read all WAL that's been written (and flushed) since last cycle, and send ! * it to client. * * Returns true if OK, false if trouble. */ static bool ! XLogSend(StringInfo outMsg) { XLogRecPtr SendRqstPtr; char activitymsg[50]; /* use volatile pointer to prevent code rearrangement */ --- 625,644 } /* ! * Read up to MAX_SEND_SIZE bytes of WAL that's been written (and flushed), ! * but not yet sent to the client, and send it. If there is no unsent WAL, ! * *caughtup is set to true and nothing is sent, otherwise *caughtup is set ! * to false. * * Returns true if OK, false if trouble. */ static bool ! XLogSend(StringInfo outMsg, bool *caughtup) { XLogRecPtr SendRqstPtr; + XLogRecPtr startptr; + XLogRecPtr endptr; + Size nbytes; char acti
Re: [HACKERS] Keepalive for max_standby_delay
Simon Riggs wrote: > WALSender sleeps even when it might have more WAL to send, it doesn't > check it just unconditionally sleeps. At least WALReceiver loops until > it has no more to receive. I just can't imagine why that's useful > behaviour. Good catch. That should be fixed. I also note that walsender doesn't respond to signals, while it's sending a large batch. That's analogous to the issue that was addressed recently in the archiver process. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Keepalive for max_standby_delay
On Sat, 2010-05-15 at 18:24 +0100, Simon Riggs wrote: > I will recode using that concept. There's some behaviours that aren't helpful here: Startup gets new pointer when it runs out of data to replay. That might or might not include an updated keepalive timestamp, since there's no exact relationship between chunks sent and chunks received. Startup might ask for a new chunk when half a chunk has been received, or when multiple chunks have been received. WALSender doesn't chunk up what it sends, though libpq does at a lower level. So there's no way to make a keepalive happen every X bytes without doing this from within libpq. WALSender sleeps even when it might have more WAL to send, it doesn't check it just unconditionally sleeps. At least WALReceiver loops until it has no more to receive. I just can't imagine why that's useful behaviour. All in all, I think we should be calling this "burst replication" not streaming replication. That does cause an issue in trying to monitor what's going on cos there's so much jitter. -- Simon Riggs www.2ndQuadrant.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Keepalive for max_standby_delay
On Sat, 2010-05-15 at 20:05 +0300, Heikki Linnakangas wrote: > Simon Riggs wrote: > > On Sat, 2010-05-15 at 19:30 +0300, Heikki Linnakangas wrote: > >> Doesn't feel right to me either. If you want to expose the > >> keepalive-time to queries, it should be a separate field, something like > >> lastMasterKeepaliveTime and a pg_last_master_keepalive() function to > >> read it. > > > > That wouldn't be good because then you couldn't easily monitor the > > delay? You'd have to run two different functions depending on the state > > of replication (for which we would need yet another function). Users > > would just wrap that back up into a single function. > > What exactly is the user trying to monitor? If it's "how far behind is > the standby", the difference between pg_current_xlog_insert_location() > in the master and pg_last_xlog_replay_location() in the standby seems > more robust and well-defined to me. It's a measure of XLOG location (ie. > bytes) instead of time, but time is a complicated concept. Maybe, but its meaningful to users and that is the point. > Also note that as the patch stands, if you receive a keep-alive from the > master at point X, it doesn't mean that the standby is fully up-to-date. > It's possible that the walsender just finished sending a huge batch of > accumulated WAL, say 1 GB, and it took 1 hour for the batch to be sent. > During that time, a lot more WAL has accumulated, yet walsender sends a > keep-alive with the current timestamp. Not at all. The timestamp for the keepalive is calculated after the pq_flush for the main WAL data. So it takes 10 years to send the next blob of WAL data the timestamp will be current. However, a point you made in an earlier thread is still true here. It sounds like it would be best if startup process didn't re-access shared memory for the next location until it had fully replayed all the WAL up to the point it last accessed. That way the changing value of the shared timestamp would have no effect on the calculated value at any point in time. I will recode using that concept. > In general, the purpose of a keep-alive is to keep the connection alive, > but you're trying to accomplish something else too, and I don't fully > understand what it is. That surprises me. If nothing else, its in the title of the thread, though since you personally added this to the Hot Standby todo more than 6 months ago I'd hope you of all people would understand the purpose. -- Simon Riggs www.2ndQuadrant.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Keepalive for max_standby_delay
Simon Riggs wrote: > On Sat, 2010-05-15 at 19:30 +0300, Heikki Linnakangas wrote: >> Doesn't feel right to me either. If you want to expose the >> keepalive-time to queries, it should be a separate field, something like >> lastMasterKeepaliveTime and a pg_last_master_keepalive() function to >> read it. > > That wouldn't be good because then you couldn't easily monitor the > delay? You'd have to run two different functions depending on the state > of replication (for which we would need yet another function). Users > would just wrap that back up into a single function. What exactly is the user trying to monitor? If it's "how far behind is the standby", the difference between pg_current_xlog_insert_location() in the master and pg_last_xlog_replay_location() in the standby seems more robust and well-defined to me. It's a measure of XLOG location (ie. bytes) instead of time, but time is a complicated concept. Also note that as the patch stands, if you receive a keep-alive from the master at point X, it doesn't mean that the standby is fully up-to-date. It's possible that the walsender just finished sending a huge batch of accumulated WAL, say 1 GB, and it took 1 hour for the batch to be sent. During that time, a lot more WAL has accumulated, yet walsender sends a keep-alive with the current timestamp. In general, the purpose of a keep-alive is to keep the connection alive, but you're trying to accomplish something else too, and I don't fully understand what it is. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Keepalive for max_standby_delay
On Sat, 2010-05-15 at 19:30 +0300, Heikki Linnakangas wrote: > Simon Riggs wrote: > > On Sat, 2010-05-15 at 11:45 -0400, Tom Lane wrote: > >> I'm also extremely dubious that it's a good idea to set > >> recoveryLastXTime from this. Using both that and the timestamps from > >> the wal log flies in the face of everything I remember about control > >> theory. We should be doing only one or only the other, I think. > > > > I can change it so that the recoveryLastXTime will not be updated if we > > are using the value from the keepalives. So we have one, or the other. > > Remember that replication can switch backwards and forwards between > > modes, so it seems sensible to have a common timing value whichever mode > > we're in. > > That means that recoveryLastXTime can jump forwards and backwards. That behaviour would be bad, so why not just prevent that from happening? > Doesn't feel right to me either. If you want to expose the > keepalive-time to queries, it should be a separate field, something like > lastMasterKeepaliveTime and a pg_last_master_keepalive() function to > read it. That wouldn't be good because then you couldn't easily monitor the delay? You'd have to run two different functions depending on the state of replication (for which we would need yet another function). Users would just wrap that back up into a single function. -- Simon Riggs www.2ndQuadrant.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Keepalive for max_standby_delay
Simon Riggs wrote: > On Sat, 2010-05-15 at 11:45 -0400, Tom Lane wrote: >> I'm also extremely dubious that it's a good idea to set >> recoveryLastXTime from this. Using both that and the timestamps from >> the wal log flies in the face of everything I remember about control >> theory. We should be doing only one or only the other, I think. > > I can change it so that the recoveryLastXTime will not be updated if we > are using the value from the keepalives. So we have one, or the other. > Remember that replication can switch backwards and forwards between > modes, so it seems sensible to have a common timing value whichever mode > we're in. That means that recoveryLastXTime can jump forwards and backwards. Doesn't feel right to me either. If you want to expose the keepalive-time to queries, it should be a separate field, something like lastMasterKeepaliveTime and a pg_last_master_keepalive() function to read it. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Keepalive for max_standby_delay
On Sat, 2010-05-15 at 11:45 -0400, Tom Lane wrote: > Simon Riggs writes: > > Patch adds a keepalive message to ensure max_standby_delay is useful. > > The proposed placement of the keepalive-send is about the worst it could > possibly be. It needs to be done right before pq_flush to ensure > minimum transfer delay. Otherwise any attempt to measure clock skew > using the timestamp will be seriously off. Where you've placed it > guarantees maximum delay not minimum. I don't understand. WalSndKeepAlive() contains a pq_flush() immediately after the timestamp is set. I did that way for exactly the same reasons you've said. Do you mean you only want to see one pq_flush()? I used two so that the actual data is never delayed by a keepalive. WAL Sender was going to sleep anyway, so shouldn't be a problem. > I'm also extremely dubious that it's a good idea to set > recoveryLastXTime from this. Using both that and the timestamps from > the wal log flies in the face of everything I remember about control > theory. We should be doing only one or only the other, I think. I can change it so that the recoveryLastXTime will not be updated if we are using the value from the keepalives. So we have one, or the other. Remember that replication can switch backwards and forwards between modes, so it seems sensible to have a common timing value whichever mode we're in. -- Simon Riggs www.2ndQuadrant.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Keepalive for max_standby_delay
Simon Riggs writes: > Patch adds a keepalive message to ensure max_standby_delay is useful. The proposed placement of the keepalive-send is about the worst it could possibly be. It needs to be done right before pq_flush to ensure minimum transfer delay. Otherwise any attempt to measure clock skew using the timestamp will be seriously off. Where you've placed it guarantees maximum delay not minimum. I'm also extremely dubious that it's a good idea to set recoveryLastXTime from this. Using both that and the timestamps from the wal log flies in the face of everything I remember about control theory. We should be doing only one or only the other, I think. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
[HACKERS] Keepalive for max_standby_delay
Patch adds a keepalive message to ensure max_standby_delay is useful. No WAL format changes, no libpq changes. Just an additional message type for the streaming replication protocol, sent once per main loop in WALsender. Plus docs. Comments? -- Simon Riggs www.2ndQuadrant.com diff --git a/doc/src/sgml/protocol.sgml b/doc/src/sgml/protocol.sgml index c63d003..391d990 100644 --- a/doc/src/sgml/protocol.sgml +++ b/doc/src/sgml/protocol.sgml @@ -4232,16 +4232,52 @@ The commands accepted in walsender mode are: - - - - - + A single WAL record is never split across two CopyData messages. When a WAL record crosses a WAL page boundary, however, and is therefore already split using continuation records, it can be split at the page boundary. In other words, the first main WAL record and its continuation records can be split across different CopyData messages. + + + + + + Keepalive (B) + + + + + + + Byte1('k') + + + + Identifies the message as a keepalive. + + + + + + TimestampTz + + + + The current timestamp on the primary server when the keepalive was sent. + + + + + + + If wal_level is set to hot_standby then a keepalive + is sent once per wal_sender_delay. The keepalive is sent after + WAL data has been sent, if any. + + + + diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c index 607d57e..ee383af 100644 --- a/src/backend/access/transam/xlog.c +++ b/src/backend/access/transam/xlog.c @@ -5566,6 +5566,17 @@ GetLatestXLogTime(void) return recoveryLastXTime; } +void +SetLatestXLogTime(TimestampTz newLastXTime) +{ + /* use volatile pointer to prevent code rearrangement */ + volatile XLogCtlData *xlogctl = XLogCtl; + + SpinLockAcquire(&xlogctl->info_lck); + xlogctl->recoveryLastXTime = newLastXTime; + SpinLockRelease(&xlogctl->info_lck); +} + /* * Note that text field supplied is a parameter name and does not require * translation diff --git a/src/backend/replication/walreceiver.c b/src/backend/replication/walreceiver.c index bb87a06..8d52c3f 100644 --- a/src/backend/replication/walreceiver.c +++ b/src/backend/replication/walreceiver.c @@ -407,6 +407,22 @@ XLogWalRcvProcessMsg(unsigned char type, char *buf, Size len) XLogWalRcvWrite(buf, len, recptr); break; } + case 'k':/* keepalive */ + { +TimestampTz keepalive; + +if (len != sizeof(TimestampTz)) + ereport(ERROR, + (errcode(ERRCODE_PROTOCOL_VIOLATION), + errmsg_internal("invalid keepalive message received from primary"))); + +memcpy(&keepalive, buf, sizeof(TimestampTz)); +buf += sizeof(TimestampTz); +len -= sizeof(TimestampTz); + +SetLatestXLogTime(keepalive); +break; + } default: ereport(ERROR, (errcode(ERRCODE_PROTOCOL_VIOLATION), diff --git a/src/backend/replication/walsender.c b/src/backend/replication/walsender.c index 1c04fc3..f2f8750 100644 --- a/src/backend/replication/walsender.c +++ b/src/backend/replication/walsender.c @@ -98,6 +98,7 @@ static void WalSndQuickDieHandler(SIGNAL_ARGS); static int WalSndLoop(void); static void InitWalSnd(void); static void WalSndHandshake(void); +static bool WalSndKeepAlive(void); static void WalSndKill(int code, Datum arg); static void XLogRead(char *buf, XLogRecPtr recptr, Size nbytes); static bool XLogSend(StringInfo outMsg); @@ -314,6 +315,30 @@ WalSndHandshake(void) } } +static bool +WalSndKeepAlive(void) +{ + StringInfoData outMsg; + TimestampTz ts; + + if (!XLogStandbyInfoActive()) + return true; + + initStringInfo(&outMsg); + ts = GetCurrentTimestamp(); + + /* format the keepalive message */ + pq_sendbyte(&outMsg, 'k'); + pq_sendbytes(&outMsg, (char *) &ts, sizeof(TimestampTz)); + + /* send the CopyData message */ + pq_putmessage('d', outMsg.data, outMsg.len); + if (pq_flush()) + return false; + + return true; +} + /* * Check if the remote end has closed the connection. */ @@ -428,6 +453,9 @@ WalSndLoop(void) /* Attempt to send the log once every loop */ if (!XLogSend(&output_message)) goto eof; + + if (!WalSndKeepAlive()) + goto eof; } /* can't get here because the above loop never exits */ diff --git a/src/include/access/xlog.h b/src/include/access/xlog.h index 8ff68a6..2d01670 100644 --- a/src/include/access/xlog.h +++ b/src/include/access/xlog.h @@ -284,6 +284,7 @@ extern void issue_xlog_fsync(int fd, uint32 log, uint32 seg); extern bool RecoveryInProgress(void); extern bool XLogInsertAllowed(void); extern TimestampTz GetLatestXLogTime(void); +extern void SetLatestXLogTime(TimestampTz newLatestXLogTime); extern void UpdateControlFile(void); extern uint64 GetSystemIdentifie