subject:"\[HACKERS\] Keepalive for max_standby

Re: [HACKERS] Keepalive for max_standby_delay

2010-07-03 Thread Tom Lane

Heikki Linnakangas  writes:
> On 03/07/10 18:32, Tom Lane wrote:
>> That would not do what you want at all in the case where you're
>> recovering from archive --- XLogReceiptTime would never advance
>> at all for the duration of the recovery.

> Do you mean when using something like pg_standby, which does the waiting 
> itself?

No, I'm thinking about recovery starting from an old base backup, or any
situation where you're trying to process a significant number of
archived WAL segments as quickly as possible.

>> It might be useful if you knew that it was a standby-with-log-shipping
>> situation, but we have no way to tell the difference.

> With pg_standby etc. you use standby_mode=off. Same with traditional 
> archive recovery. In standby mode, it's on.

Uh, no, that parameter is not what I'm talking about.  What I tried to
say is that if you're using log shipping for replication instead of
walsender/walreceiver, you might want to treat data as live even though
the database thinks it is being pulled "from archive".  But we'd need
a way for the database to tell the cases apart ... right now it cannot.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Keepalive for max_standby_delay

2010-07-03 Thread Heikki Linnakangas


On 03/07/10 18:32, Tom Lane wrote:

Heikki Linnakangas  writes:

It would seem logical to use the same logic for archive recovery as we
do for streaming replication, and only set XLogReceiptTime when you have
to wait for a WAL segment to arrive into the archive, ie. when
restore_command fails.


That would not do what you want at all in the case where you're
recovering from archive --- XLogReceiptTime would never advance
at all for the duration of the recovery.


Do you mean when using something like pg_standby, which does the waiting 
itself?



It might be useful if you knew that it was a standby-with-log-shipping
situation, but we have no way to tell the difference.


With pg_standby etc. you use standby_mode=off. Same with traditional 
archive recovery. In standby mode, it's on.


--
  Heikki Linnakangas
  EnterpriseDB   http://www.enterprisedb.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Keepalive for max_standby_delay

2010-07-03 Thread Tom Lane

Heikki Linnakangas  writes:
> It would seem logical to use the same logic for archive recovery as we 
> do for streaming replication, and only set XLogReceiptTime when you have 
> to wait for a WAL segment to arrive into the archive, ie. when 
> restore_command fails.

That would not do what you want at all in the case where you're
recovering from archive --- XLogReceiptTime would never advance
at all for the duration of the recovery.

It might be useful if you knew that it was a standby-with-log-shipping
situation, but we have no way to tell the difference.

The restore_command might know the difference, though.  Is it
worth modifying its API somehow so that the command could tell us
whether to advance XLogReceiptTime?  How would we do that?

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Keepalive for max_standby_delay

2010-07-03 Thread Heikki Linnakangas


On 02/07/10 23:36, Tom Lane wrote:

Robert Haas  writes:

I haven't been able to wrap my head around why the delay should be
LESS in the archive case than in the streaming case.  Can you attempt
to hit me with the clue-by-four?


In the archive case, you're presumably trying to catch up, and so it
makes sense to kill queries faster so you can catch up.  The existing
code essentially forces instant kill when reading from archive, for any
reasonable value of max_standby_delay (because the archived timestamps
will probably be older than that).  That's overenthusiastic in my view,
but you can get that behavior if you want it with this patch by setting
max_standby_archive_delay to zero.  If you don't want it, you can use
something larger.  If you don't think that max_standby_archive_delay
should be less than max_standby_streaming_delay, you can set them the
same, too.


It would seem logical to use the same logic for archive recovery as we 
do for streaming replication, and only set XLogReceiptTime when you have 
to wait for a WAL segment to arrive into the archive, ie. when 
restore_command fails.


--
  Heikki Linnakangas
  EnterpriseDB   http://www.enterprisedb.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Keepalive for max_standby_delay

2010-07-02 Thread Robert Haas

On Fri, Jul 2, 2010 at 4:52 PM, Tom Lane  wrote:
> Robert Haas  writes:
>> On Fri, Jul 2, 2010 at 4:36 PM, Tom Lane  wrote:
>>> In the archive case, you're presumably trying to catch up, and so it
>>> makes sense to kill queries faster so you can catch up.
>
>> On the flip side, the timeout for the WAL segment is for 16MB of WAL,
>> whereas the timeout for SR is normally going to be for a much smaller
>> chunk (right?).  So even with the same value for both, it seems like
>> queries will be killed more aggressively during archive recovery.
>
> True, but I suspect useful settings will be significantly larger than
> those times anyway, so it kind of comes out in the wash.  For
> walreceiver the expected time to process each new chunk is less than
> wal_sender_delay (if it's not, you better get a faster slave machine).
> For the archive case, it probably takes more than 200ms to process a
> 16MB segment, but still only a few seconds.  So if you have both the
> max-delay parameters set to 30 seconds, the minimum "normal" grace
> periods are 29.8 sec vs maybe 25 sec, not all that different.

I guess I was thinking more in terms of conflicts-per-WAL-record than
actual replay time.  If we assume that ever 100th WAL record produces
a conflict, SR will pause every once in a while, and then catch up
(either because it canceled some queries or because they finished
within the allowable grace period), whereas archive recovery will be
patient for a bit and then start nuking 'em from orbit.  Maybe my
mental model is all wrong though.

Anyway, I don't object to having two separate settings, and maybe
we'll know more about how to tune them after this gets out in the
field.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise Postgres Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Keepalive for max_standby_delay

2010-07-02 Thread Tom Lane

Robert Haas  writes:
> On Fri, Jul 2, 2010 at 4:36 PM, Tom Lane  wrote:
>> In the archive case, you're presumably trying to catch up, and so it
>> makes sense to kill queries faster so you can catch up.

> On the flip side, the timeout for the WAL segment is for 16MB of WAL,
> whereas the timeout for SR is normally going to be for a much smaller
> chunk (right?).  So even with the same value for both, it seems like
> queries will be killed more aggressively during archive recovery.

True, but I suspect useful settings will be significantly larger than
those times anyway, so it kind of comes out in the wash.  For
walreceiver the expected time to process each new chunk is less than
wal_sender_delay (if it's not, you better get a faster slave machine).
For the archive case, it probably takes more than 200ms to process a
16MB segment, but still only a few seconds.  So if you have both the
max-delay parameters set to 30 seconds, the minimum "normal" grace
periods are 29.8 sec vs maybe 25 sec, not all that different.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Keepalive for max_standby_delay

2010-07-02 Thread Robert Haas

On Fri, Jul 2, 2010 at 4:36 PM, Tom Lane  wrote:
> Robert Haas  writes:
>> I haven't been able to wrap my head around why the delay should be
>> LESS in the archive case than in the streaming case.  Can you attempt
>> to hit me with the clue-by-four?
>
> In the archive case, you're presumably trying to catch up, and so it
> makes sense to kill queries faster so you can catch up.

On the flip side, the timeout for the WAL segment is for 16MB of WAL,
whereas the timeout for SR is normally going to be for a much smaller
chunk (right?).  So even with the same value for both, it seems like
queries will be killed more aggressively during archive recovery.

Even so, it seems useful to have both.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise Postgres Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Keepalive for max_standby_delay

2010-07-02 Thread Tom Lane

Robert Haas  writes:
> I haven't been able to wrap my head around why the delay should be
> LESS in the archive case than in the streaming case.  Can you attempt
> to hit me with the clue-by-four?

In the archive case, you're presumably trying to catch up, and so it
makes sense to kill queries faster so you can catch up.  The existing
code essentially forces instant kill when reading from archive, for any
reasonable value of max_standby_delay (because the archived timestamps
will probably be older than that).  That's overenthusiastic in my view,
but you can get that behavior if you want it with this patch by setting
max_standby_archive_delay to zero.  If you don't want it, you can use
something larger.  If you don't think that max_standby_archive_delay
should be less than max_standby_streaming_delay, you can set them the
same, too.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Keepalive for max_standby_delay

2010-07-02 Thread Robert Haas

On Fri, Jul 2, 2010 at 4:11 PM, Tom Lane  wrote:
> [ Apologies for the very slow turnaround on this --- I got hit with
> another batch of non-postgres security issues this week. ]
>
> Attached is a draft patch for revising the max_standby_delay behavior into
> something I think is a bit saner.  There is some unfinished business:
>
> * I haven't touched the documentation yet
>
> * The code in xlog.c needs to be reverted to its former behavior so that
> recoveryLastXTime means what it's supposed to mean, ie just the last
> commit/abort timestamp.
>
> However neither of these affects the testability of the patch.
>
> Basically the way that it works is that the standby delay is computed with
> reference to XLogReceiptTime rather than any timestamp obtained from WAL.
> XLogReceiptTime is set to current time whenever we obtain a WAL segment
> from the archives or when we begin processing a fresh batch of WAL from
> walreceiver.  There's a subtlety in the streaming case: we don't advance
> XLogReceiptTime if we are not caught up, that is if the startup process
> is more than one flush cycle behind walreceiver.  In the normal case
> we'll advance XLogReceiptTime on each flush cycle, but once we start
> falling behind, it doesn't move so the grace time alloted to conflicting
> queries begins to decrease.
>
> I split max_standby_delay into two GUC variables, as previously
> threatened: max_standby_archive_delay and max_standby_streaming_delay.
> The former applies when processing data read from archive, the latter
> when processing data received from walreceiver.  I think this is really
> quite important given the new behavior, because max_standby_archive_delay
> ought to be set with reference to the expected time to process one WAL
> segment, whereas max_standby_streaming_delay doesn't depend on that value
> at all.  I'm not sure what good defaults are for these values, so I left
> them at 30 seconds for the moment.  I'm inclined to think
> max_standby_archive_delay ought to be quite a bit less though.
>
> It might be worth adding a minimum-grace-time limit as we previously
> discussed, but this patch doesn't attempt to do that.
>
> Comments?

I haven't been able to wrap my head around why the delay should be
LESS in the archive case than in the streaming case.  Can you attempt
to hit me with the clue-by-four?

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise Postgres Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Keepalive for max_standby_delay

2010-07-02 Thread Tom Lane

[ Apologies for the very slow turnaround on this --- I got hit with
another batch of non-postgres security issues this week. ]

Attached is a draft patch for revising the max_standby_delay behavior into
something I think is a bit saner.  There is some unfinished business:

* I haven't touched the documentation yet

* The code in xlog.c needs to be reverted to its former behavior so that
recoveryLastXTime means what it's supposed to mean, ie just the last
commit/abort timestamp.

However neither of these affects the testability of the patch.

Basically the way that it works is that the standby delay is computed with
reference to XLogReceiptTime rather than any timestamp obtained from WAL.
XLogReceiptTime is set to current time whenever we obtain a WAL segment
from the archives or when we begin processing a fresh batch of WAL from
walreceiver.  There's a subtlety in the streaming case: we don't advance
XLogReceiptTime if we are not caught up, that is if the startup process
is more than one flush cycle behind walreceiver.  In the normal case
we'll advance XLogReceiptTime on each flush cycle, but once we start
falling behind, it doesn't move so the grace time alloted to conflicting
queries begins to decrease.

I split max_standby_delay into two GUC variables, as previously
threatened: max_standby_archive_delay and max_standby_streaming_delay.
The former applies when processing data read from archive, the latter
when processing data received from walreceiver.  I think this is really
quite important given the new behavior, because max_standby_archive_delay
ought to be set with reference to the expected time to process one WAL
segment, whereas max_standby_streaming_delay doesn't depend on that value
at all.  I'm not sure what good defaults are for these values, so I left
them at 30 seconds for the moment.  I'm inclined to think
max_standby_archive_delay ought to be quite a bit less though.

It might be worth adding a minimum-grace-time limit as we previously
discussed, but this patch doesn't attempt to do that.

Comments?

regards, tom lane

Index: src/backend/access/transam/xlog.c
===
RCS file: /cvsroot/pgsql/src/backend/access/transam/xlog.c,v
retrieving revision 1.427
diff -c -r1.427 xlog.c
*** src/backend/access/transam/xlog.c	28 Jun 2010 19:46:19 -	1.427
--- src/backend/access/transam/xlog.c	2 Jul 2010 20:01:24 -
***
*** 72,78 
  bool		XLogArchiveMode = false;
  char	   *XLogArchiveCommand = NULL;
  bool		EnableHotStandby = false;
- int			MaxStandbyDelay = 30 * 1000;
  bool		fullPageWrites = true;
  bool		log_checkpoints = false;
  int			sync_method = DEFAULT_SYNC_METHOD;
--- 72,77 
***
*** 487,493 
   * Keeps track of which sources we've tried to read the current WAL
   * record from and failed.
   */
! static int failedSources = 0;
  
  /* Buffer for currently read page (XLOG_BLCKSZ bytes) */
  static char *readBuf = NULL;
--- 487,502 
   * Keeps track of which sources we've tried to read the current WAL
   * record from and failed.
   */
! static int failedSources = 0;	/* OR of XLOG_FROM_* codes */
! 
! /*
!  * These variables track when we last obtained some WAL data to process,
!  * and where we got it from.  (XLogReceiptSource is initially the same as
!  * readSource, but readSource gets reset to zero when we don't have data
!  * to process right now.)
!  */
! static TimestampTz XLogReceiptTime = 0;
! static int XLogReceiptSource = 0;	/* XLOG_FROM_* code */
  
  /* Buffer for currently read page (XLOG_BLCKSZ bytes) */
  static char *readBuf = NULL;
***
*** 2626,2632 
   * Open a logfile segment for reading (during recovery).
   *
   * If source = XLOG_FROM_ARCHIVE, the segment is retrieved from archive.
!  * If source = XLOG_FROM_PG_XLOG, it's read from pg_xlog.
   */
  static int
  XLogFileRead(uint32 log, uint32 seg, int emode, TimeLineID tli,
--- 2635,2641 
   * Open a logfile segment for reading (during recovery).
   *
   * If source = XLOG_FROM_ARCHIVE, the segment is retrieved from archive.
!  * Otherwise, it's assumed to be already available in pg_xlog.
   */
  static int
  XLogFileRead(uint32 log, uint32 seg, int emode, TimeLineID tli,
***
*** 2655,2660 
--- 2664,2670 
  			break;
  
  		case XLOG_FROM_PG_XLOG:
+ 		case XLOG_FROM_STREAM:
  			XLogFilePath(path, tli, log, seg);
  			restoredFromArchive = false;
  			break;
***
*** 2674,2680 
--- 2684,2696 
   xlogfname);
  		set_ps_display(activitymsg, false);
  
+ 		/* Track source of data in assorted state variables */
  		readSource = source;
+ 		XLogReceiptSource = source;
+ 		/* In FROM_STREAM case, caller tracks receipt time, not me */
+ 		if (source != XLOG_FROM_STREAM)
+ 			XLogReceiptTime = GetCurrentTimestamp();
+ 
  		return fd;
  	}
  	if (errno != ENOENT || !notfoundOk) /* unexpected failure? */
***
*** 5568,5574 
  /*

Re: [HACKERS] Keepalive for max_standby_delay

2010-06-30 Thread Bruce Momjian

Simon Riggs wrote:
> On Mon, 2010-06-28 at 10:09 -0700, Josh Berkus wrote:
> > > It will get done.  It is not the very first thing on my to-do list.
> > 
> > ???  What is then?
> > 
> > If it's not the first thing on your priority list, with 9.0 getting 
> > later by the day, maybe we should leave it to Robert and Simon, who *do* 
> > seem to have it first on *their* list?
> > 
> > I swear, when Simon was keeping his branch to himself in August everyone 
> > was on his case.  It sure seems like Tom is doing exactly the same thing.
> 
> Hmmm, yes, looks that way. At that time I was actively working on the
> code, not just locking it to prevent other activity.
> 
> The only urgency on my part here was to fulfil my responsibility to the
> project.

Simon, you have a very legitimate concern.  I phoned Tom and he is
planning to start working on the max_standby_delay tomorrow.  I am
unclear how it is different from your version, but I hope once Tom is
done we can review his work and decide how to proceed.  The fact that we
allowed Tom this huge amount of time to submit an alternative patch is
unusual and hopefully rare.

FYI, Tom and I are hoping to work through all the outstanding issues
before we package up 9.0 beta3 on Thursday, July 8.

-- 
  Bruce Momjian  http://momjian.us
  EnterpriseDB http://enterprisedb.com

  + None of us is going to be here forever. +

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Keepalive for max_standby_delay

2010-06-30 Thread Simon Riggs

On Mon, 2010-06-28 at 10:09 -0700, Josh Berkus wrote:
> > It will get done.  It is not the very first thing on my to-do list.
> 
> ???  What is then?
> 
> If it's not the first thing on your priority list, with 9.0 getting 
> later by the day, maybe we should leave it to Robert and Simon, who *do* 
> seem to have it first on *their* list?
> 
> I swear, when Simon was keeping his branch to himself in August everyone 
> was on his case.  It sure seems like Tom is doing exactly the same thing.

Hmmm, yes, looks that way. At that time I was actively working on the
code, not just locking it to prevent other activity.

The only urgency on my part here was to fulfil my responsibility to the
project.

-- 
 Simon Riggs   www.2ndQuadrant.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Keepalive for max_standby_delay

2010-06-28 Thread Robert Haas

On Mon, Jun 28, 2010 at 2:26 PM, Tom Lane  wrote:
> Robert Haas  writes:
>> ... It is even more unreasonable to commit to
>> providing a timely patch (twice) and then fail to do so.  We are
>> trying to finalize a release here, and you've made it clear you think
>> this code needs revision before then.  I respect your opinion, but not
>> your right to make the project release timetable dependent on your own
>> schedule, and not your right to shut other people out of working on
>> the issues you've raised.
>
> Since nobody has put forward a proposed beta3 release date, I don't feel
> that I'm holding anything up.  In the meantime, I have many
> responsibilities and am facing Red Hat internal deadlines.

See here, last paragraph:

http://archives.postgresql.org/pgsql-hackers/2010-06/msg01093.php

On a related note, this list is getting pretty darn short:

http://wiki.postgresql.org/wiki/PostgreSQL_9.0_Open_Items

...partly because, in my desire to get another beta out, I have been
devoting a lot of time to clearing it out.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise Postgres Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Keepalive for max_standby_delay

2010-06-28 Thread Tom Lane

Robert Haas  writes:
> ... It is even more unreasonable to commit to
> providing a timely patch (twice) and then fail to do so.  We are
> trying to finalize a release here, and you've made it clear you think
> this code needs revision before then.  I respect your opinion, but not
> your right to make the project release timetable dependent on your own
> schedule, and not your right to shut other people out of working on
> the issues you've raised.

Since nobody has put forward a proposed beta3 release date, I don't feel
that I'm holding anything up.  In the meantime, I have many
responsibilities and am facing Red Hat internal deadlines.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Keepalive for max_standby_delay

2010-06-28 Thread Robert Haas

On Mon, Jun 28, 2010 at 10:19 AM, Tom Lane  wrote:
> Simon Riggs  writes:
>> On Wed, 2010-06-16 at 21:56 -0400, Tom Lane wrote:
>>> Sorry, I've been a bit distracted by other responsibilities (libtiff
>>> security issues for Red Hat, if you must know).  I'll get on it shortly.
>
>> I don't think the PostgreSQL project should wait any longer on this. If
>> it does we risk loss of quality in final release, assuming no slippage.
>
> It will get done.  It is not the very first thing on my to-do list.

That's apparent.  On June 9th, you wrote "Yes, I'll get with it ...";
on June 16th, you wrote "I'll get on it shortly."  Two weeks later
you're backing off from "shortly" to "eventually".  It is unfair to
the community to assert a vigorous opinion of how something should be
handled in the code and then refuse to commit to a time frame for
providing a patch.  It is even more unreasonable to commit to
providing a timely patch (twice) and then fail to do so.  We are
trying to finalize a release here, and you've made it clear you think
this code needs revision before then.  I respect your opinion, but not
your right to make the project release timetable dependent on your own
schedule, and not your right to shut other people out of working on
the issues you've raised.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise Postgres Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Keepalive for max_standby_delay

2010-06-28 Thread Josh Berkus




It will get done.  It is not the very first thing on my to-do list.


???  What is then?

If it's not the first thing on your priority list, with 9.0 getting 
later by the day, maybe we should leave it to Robert and Simon, who *do* 
seem to have it first on *their* list?


I swear, when Simon was keeping his branch to himself in August everyone 
was on his case.  It sure seems like Tom is doing exactly the same thing.


--
  -- Josh Berkus
 PostgreSQL Experts Inc.
 http://www.pgexperts.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Keepalive for max_standby_delay

2010-06-28 Thread Tom Lane

Simon Riggs  writes:
> On Wed, 2010-06-16 at 21:56 -0400, Tom Lane wrote:
>> Sorry, I've been a bit distracted by other responsibilities (libtiff
>> security issues for Red Hat, if you must know).  I'll get on it shortly.

> I don't think the PostgreSQL project should wait any longer on this. If
> it does we risk loss of quality in final release, assuming no slippage.

It will get done.  It is not the very first thing on my to-do list.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Keepalive for max_standby_delay

2010-06-28 Thread Robert Haas

On Mon, Jun 28, 2010 at 3:17 AM, Simon Riggs  wrote:
> On Wed, 2010-06-16 at 21:56 -0400, Tom Lane wrote:
>> Robert Haas  writes:
>> > On Wed, Jun 9, 2010 at 8:01 PM, Tom Lane  wrote:
>> >> Yes, I'll get with it ...
>>
>> > Any update on this?
>>
>> Sorry, I've been a bit distracted by other responsibilities (libtiff
>> security issues for Red Hat, if you must know).  I'll get on it shortly.
>
> I don't think the PostgreSQL project should wait any longer on this. If
> it does we risk loss of quality in final release, assuming no slippage.

I agree, and had actually been intending to post a similar message in
the next day or two.  We've been waiting for this for nearly a month.

> >From here, I will rework my patch of 31 May to
> * use arrival time on standby as base for max_standby_delay

Assuming that by this you mean, in the case of SR, the time of receipt
of the current WAL chunk, and in the case of the archive, the time of
its acquisition from the archive, +1.

> * make delay apply to both streaming and file cases

+1.

> * min_standby_grace_period - min grace on every query, default 0

I could go either way on this.  +0.5, I guess.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise Postgres Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Keepalive for max_standby_delay

2010-06-28 Thread Simon Riggs

On Wed, 2010-06-16 at 21:56 -0400, Tom Lane wrote:
> Robert Haas  writes:
> > On Wed, Jun 9, 2010 at 8:01 PM, Tom Lane  wrote:
> >> Yes, I'll get with it ...
> 
> > Any update on this?
> 
> Sorry, I've been a bit distracted by other responsibilities (libtiff
> security issues for Red Hat, if you must know).  I'll get on it shortly.

I don't think the PostgreSQL project should wait any longer on this. If
it does we risk loss of quality in final release, assuming no slippage.

>From here, I will rework my patch of 31 May to
* use arrival time on standby as base for max_standby_delay
* make delay apply to both streaming and file cases
* min_standby_grace_period - min grace on every query, default 0

Decision time, so thoughts please?

-- 
 Simon Riggs   www.2ndQuadrant.com



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Keepalive for max_standby_delay

2010-06-21 Thread Robert Haas

On Mon, Jun 21, 2010 at 12:20 AM, Ron Mayer
 wrote:
> Robert Haas wrote:
>> On Wed, Jun 16, 2010 at 9:56 PM, Tom Lane  wrote:
>>> Sorry, I've been a bit distracted by other responsibilities (libtiff
>>> security issues for Red Hat, if you must know).  I'll get on it shortly.
>>
>> What?  You have other things to do besides hack on PostgreSQL?  Shocking!  
>> :-)
>
> I suspect you're kidding, but in case some on the list didn't realize,
> Tom's probably as famous (if not moreso) in the image compression
> community as he is in the database community:
>
> http://www.jpeg.org/jpeg/index.html
> "Probably the largest and most important contribution however was the work
>  of the Independent JPEG Group (IJG), and Tom Lane in particular."
>
> http://www.w3.org/TR/PNG-Credits.html , http://www.w3.org/TR/PNG/
> "PNG (Portable Network Graphics) Specification
>  Version 1.0
>  ...
>  Contributing Editor
>  Tom Lane, t...@sss.pgh.pa.us"
>
> http://www.fileformat.info/format/tiff/egff.htm
> "... by Dr. Tom Lane of the Independent JPEG Group, a member of the
>  TIFF Advisory Committee"

Yes, I was joking, hence the smiley.  I did know he was involved in
the above, although I confess I didn't know to what degree... or that
he had a doctorate.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise Postgres Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Keepalive for max_standby_delay

2010-06-20 Thread Ron Mayer

Robert Haas wrote:
> On Wed, Jun 16, 2010 at 9:56 PM, Tom Lane  wrote:
>> Sorry, I've been a bit distracted by other responsibilities (libtiff
>> security issues for Red Hat, if you must know).  I'll get on it shortly.
> 
> What?  You have other things to do besides hack on PostgreSQL?  Shocking!  :-)

I suspect you're kidding, but in case some on the list didn't realize,
Tom's probably as famous (if not moreso) in the image compression
community as he is in the database community:

http://www.jpeg.org/jpeg/index.html
"Probably the largest and most important contribution however was the work
 of the Independent JPEG Group (IJG), and Tom Lane in particular."

http://www.w3.org/TR/PNG-Credits.html , http://www.w3.org/TR/PNG/
"PNG (Portable Network Graphics) Specification
 Version 1.0
 ...
 Contributing Editor
 Tom Lane, t...@sss.pgh.pa.us"

http://www.fileformat.info/format/tiff/egff.htm
"... by Dr. Tom Lane of the Independent JPEG Group, a member of the
 TIFF Advisory Committee"

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Keepalive for max_standby_delay

2010-06-17 Thread Robert Haas

On Wed, Jun 16, 2010 at 9:56 PM, Tom Lane  wrote:
> Robert Haas  writes:
>> On Wed, Jun 9, 2010 at 8:01 PM, Tom Lane  wrote:
>>> Yes, I'll get with it ...
>
>> Any update on this?
>
> Sorry, I've been a bit distracted by other responsibilities (libtiff
> security issues for Red Hat, if you must know).  I'll get on it shortly.

What?  You have other things to do besides hack on PostgreSQL?  Shocking!  :-)

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise Postgres Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Keepalive for max_standby_delay

2010-06-17 Thread Tom Lane

Robert Haas  writes:
> On Wed, Jun 9, 2010 at 8:01 PM, Tom Lane  wrote:
>> Yes, I'll get with it ...

> Any update on this?

Sorry, I've been a bit distracted by other responsibilities (libtiff
security issues for Red Hat, if you must know).  I'll get on it shortly.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Keepalive for max_standby_delay

2010-06-16 Thread Robert Haas

On Wed, Jun 9, 2010 at 8:01 PM, Tom Lane  wrote:
> Simon Riggs  writes:
>> On Thu, 2010-06-03 at 19:02 -0400, Tom Lane wrote:
>>> I decided there wasn't time to get anything useful done on it before the
>>> beta2 deadline (which is, more or less, right now).  I will take another
>>> look over the next few days.
>
>> We all really need you to fix up max_standby_delay, or, let me do it.
>
> Yes, I'll get with it ...

Tom,

Any update on this?

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise Postgres Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Keepalive for max_standby_delay

2010-06-09 Thread Simon Riggs

On Thu, 2010-06-03 at 19:02 -0400, Tom Lane wrote:
> Simon Riggs  writes:
> > On Thu, 2010-06-03 at 18:18 +0100, Simon Riggs wrote:
> >> Are you planning to work on these things now as you said?
> 
> > Are you? Or do you want me to?
> 
> I decided there wasn't time to get anything useful done on it before the
> beta2 deadline (which is, more or less, right now).  I will take another
> look over the next few days.

Esteemed colleague Tom,

We all really need you to fix up max_standby_delay, or, let me do it.

It appears we both agree that more testing would be beneficial. The only
way we are going to get that is by finishing this off and declaring a
new HS-beta.

Please let me know how you wish to proceed. (No answer means I do it.)

-- 
 Simon Riggs   www.2ndQuadrant.com

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Keepalive for max_standby_delay

2010-06-09 Thread Tom Lane

Simon Riggs  writes:
> On Thu, 2010-06-03 at 19:02 -0400, Tom Lane wrote:
>> I decided there wasn't time to get anything useful done on it before the
>> beta2 deadline (which is, more or less, right now).  I will take another
>> look over the next few days.

> We all really need you to fix up max_standby_delay, or, let me do it.

Yes, I'll get with it ...

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Keepalive for max_standby_delay

2010-06-03 Thread Tom Lane

Simon Riggs  writes:
> On Thu, 2010-06-03 at 18:18 +0100, Simon Riggs wrote:
>> Are you planning to work on these things now as you said?

> Are you? Or do you want me to?

I decided there wasn't time to get anything useful done on it before the
beta2 deadline (which is, more or less, right now).  I will take another
look over the next few days.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Keepalive for max_standby_delay

2010-06-03 Thread Simon Riggs

On Thu, 2010-06-03 at 18:18 +0100, Simon Riggs wrote:

> Are you planning to work on these things now as you said?

Are you? Or do you want me to?

-- 
 Simon Riggs   www.2ndQuadrant.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Keepalive for max_standby_delay

2010-06-03 Thread Tom Lane

Greg Stark  writes:
> On Thu, Jun 3, 2010 at 12:11 AM, Tom Lane  wrote:
>> It is off-base.  The receiver does not "request" data, the sender is
>> what determines how much WAL is sent when.

> Hm, so what happens if the slave blocks, doesn't the sender block when
> the kernel buffers fill up?

Well, if the slave can't keep up, that's a separate problem.  It will
not fail to keep up as a result of the transmission mechanism.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Keepalive for max_standby_delay

2010-06-03 Thread Greg Stark

On Thu, Jun 3, 2010 at 4:18 PM, Tom Lane  wrote:
> Greg Stark  writes:
>> On Thu, Jun 3, 2010 at 12:11 AM, Tom Lane  wrote:
>>> It is off-base.  The receiver does not "request" data, the sender is
>>> what determines how much WAL is sent when.
>
>> Hm, so what happens if the slave blocks, doesn't the sender block when
>> the kernel buffers fill up?
>
> Well, if the slave can't keep up, that's a separate problem.  It will
> not fail to keep up as a result of the transmission mechanism.

No, I mean if the slave is paused due to a conflict. Does it stop
reading data from the master or does it buffer it up on disk? If it
stops reading it from the master then the effect is the same as if the
slave stopped "requesting" data even if there's no actual request.


-- 
greg

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Keepalive for max_standby_delay

2010-06-03 Thread Simon Riggs

On Thu, 2010-06-03 at 13:32 -0400, Tom Lane wrote:
> Simon Riggs  writes:
> > On Thu, 2010-06-03 at 12:47 -0400, Tom Lane wrote:
> >> But in any case the current behavior is
> >> still quite broken as regards reading stale timestamps from WAL.
> 
> > Agreed. That wasn't the objective of this patch or a priority.
> 
> > If you're reading from an archive, you need to set max_standby_delay
> > appropriately, current recommendation is -1.
> 
> That's not a fix; the problem is that there *is no* appropriate setting
> for max_standby_delay when you're reading from archive.  It is unlikely
> that the values of the timestamps you're reading should be considered to
> have any bearing on what grace period to allow; but nonetheless it's
> desirable to be able to have a non-infinite grace time.

I'm not saying that's a fix; I agree we should change that also.

> Basically what I think is that we want what you called "apply" semantics
> always for reading from archive (and I still think the DBA should be
> able to set that grace period separately from the one that applies in
> SR operation).  Paying attention to the master's timestamps is only
> reasonable in the SR context.

Agreed.

> And yes, I want the dependency on WAL timestamps to be gone completely,
> not just mostly.  I think that driving the delay logic off of SR receipt
> time (and/or the timestamp we agreed to add to the SR protocol) is the
> appropriate and sufficient way to deal with the problem.  Looking at the
> embedded timestamps just confuses the feedback loop behavior.

I disagree with "sufficient", with good reason. Please look at the code
comments and see what will happen if we don't make the correction
suggested. If after that you still disagree, then do it your way and
we'll wait for comments in the beta; I think there will be some, but I
am tired and prone to agreement to allow this to go through. 

-- 
 Simon Riggs   www.2ndQuadrant.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Keepalive for max_standby_delay

2010-06-03 Thread Tom Lane

Simon Riggs  writes:
> On Thu, 2010-06-03 at 12:47 -0400, Tom Lane wrote:
>> But in any case the current behavior is
>> still quite broken as regards reading stale timestamps from WAL.

> Agreed. That wasn't the objective of this patch or a priority.

> If you're reading from an archive, you need to set max_standby_delay
> appropriately, current recommendation is -1.

That's not a fix; the problem is that there *is no* appropriate setting
for max_standby_delay when you're reading from archive.  It is unlikely
that the values of the timestamps you're reading should be considered to
have any bearing on what grace period to allow; but nonetheless it's
desirable to be able to have a non-infinite grace time.

Basically what I think is that we want what you called "apply" semantics
always for reading from archive (and I still think the DBA should be
able to set that grace period separately from the one that applies in
SR operation).  Paying attention to the master's timestamps is only
reasonable in the SR context.

And yes, I want the dependency on WAL timestamps to be gone completely,
not just mostly.  I think that driving the delay logic off of SR receipt
time (and/or the timestamp we agreed to add to the SR protocol) is the
appropriate and sufficient way to deal with the problem.  Looking at the
embedded timestamps just confuses the feedback loop behavior.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Keepalive for max_standby_delay

2010-06-03 Thread Simon Riggs

On Thu, 2010-06-03 at 12:47 -0400, Tom Lane wrote:
> Simon Riggs  writes:
> > On Wed, 2010-06-02 at 13:14 -0400, Tom Lane wrote:
> >> This patch seems to me to be going in fundamentally the wrong direction.
> >> It's adding complexity and overhead (far more than is needed), and it's
> >> failing utterly to resolve the objections that I raised to start with.
> 
> > Having read your proposal, it seems changing from time-on-sender to
> > time-on-receiver is a one line change to the patch.
> 
> > What else are you thinking of removing, if anything?
> 
> Basically, we need to get rid of everything that feeds timestamps from
> the WAL content into the kill-delay logic.

I understand your wish to do this, though it isn't always accurate to do
that in the case where there is a backlog.

The patch already does that *mostly* for the streaming case. The only
time it does use the WAL content timestamp is when the WAL content
timestamp is later than the oldest receive time, so in that case it is
used as a correction. (see code comments and comments below also)

> >> In particular, Simon seems to be basically refusing to do anything about
> >> the complaint that the code fails unless master and standby clocks are
> >> in close sync.  I do not believe that this is acceptable, and since he
> >> won't fix it, I guess I'll have to.
> 
> > Syncing two servers in replication is common practice, as has been
> > explained here; I'm still surprised people think otherwise.
> 
> Doesn't affect the complaint in the least: I do not find it acceptable
> to have that be *mandated* in order for our code to work sensibly.

OK, accepted.

> I would be OK with having something approaching what you want as a
> non-default optional behavior (with a clearly-documented dependency
> on having synchronized clocks).  

Yes, that's what I'd been thinking also. So that gives us a way
forwards.

standby_delay_base = apply (default) | send

determines whether the standby_delay is calculated solely with reference
to the standby server (apply) or whether times from the sending server
are used (send). Use of send implies that the clocks on primary and
standby are synchronised to within a useful accuracy, in which case it
is usual to enforce that with time synchronisation software such as ntp.

> But in any case the current behavior is
> still quite broken as regards reading stale timestamps from WAL.

Agreed. That wasn't the objective of this patch or a priority.

If you're reading from an archive, you need to set max_standby_delay
appropriately, current recommendation is -1.

We can address that concern once the main streaming case is covered, or
we can add that now.

Are you planning to work on these things now as you said? How about I
apply my patch now, then you do another set of changes afterwards to add
the other items you mentioned, since that is now 2 additional parameters
and related code?

-- 
 Simon Riggs   www.2ndQuadrant.com

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Keepalive for max_standby_delay

2010-06-03 Thread Tom Lane

Simon Riggs  writes:
> On Wed, 2010-06-02 at 16:00 -0400, Tom Lane wrote:
>> the current situation that query grace periods go to zero

> Possibly a better way to handle this concern is to make the second
> parameter: min_standby_grace_period - the minimum time a query will be
> given in which to execute, even if max_standby_delay has been reached or
> exceeded.

> Would that more directly address you concerns?
> min_standby_grace_period (ms) SIGHUP 

A minimum grace period seems like a good idea to me, but I think it's
somewhat orthogonal to the core problem here.  I think we all
intuitively feel that there should be a way to dial back the grace
period when a slave is "far behind" on applying WAL.  The problem is
first how to figure out what "far behind" means, and second how to
adjust the grace period in a way that doesn't have surprising
misbehaviors.  A minimum grace period would prevent some of the very
worst misbehaviors but it's not really addressing the core problem.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Keepalive for max_standby_delay

2010-06-03 Thread Tom Lane

Simon Riggs  writes:
> On Wed, 2010-06-02 at 13:14 -0400, Tom Lane wrote:
>> This patch seems to me to be going in fundamentally the wrong direction.
>> It's adding complexity and overhead (far more than is needed), and it's
>> failing utterly to resolve the objections that I raised to start with.

> Having read your proposal, it seems changing from time-on-sender to
> time-on-receiver is a one line change to the patch.

> What else are you thinking of removing, if anything?

Basically, we need to get rid of everything that feeds timestamps from
the WAL content into the kill-delay logic.

>> In particular, Simon seems to be basically refusing to do anything about
>> the complaint that the code fails unless master and standby clocks are
>> in close sync.  I do not believe that this is acceptable, and since he
>> won't fix it, I guess I'll have to.

> Syncing two servers in replication is common practice, as has been
> explained here; I'm still surprised people think otherwise.

Doesn't affect the complaint in the least: I do not find it acceptable
to have that be *mandated* in order for our code to work sensibly.
I would be OK with having something approaching what you want as a
non-default optional behavior (with a clearly-documented dependency
on having synchronized clocks).  But in any case the current behavior is
still quite broken as regards reading stale timestamps from WAL.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Keepalive for max_standby_delay

2010-06-03 Thread Greg Stark

On Thu, Jun 3, 2010 at 4:34 PM, Tom Lane  wrote:
> The data keeps coming in and getting dumped into the slave's pg_xlog.
> walsender/walreceiver are not at all tied to the slave's application
> of WAL.  In principle we could have the code around max_standby_delay
> notice just how far behind it's gotten and adjust the delay tolerance
> based on that; but I think designing a feedback loop for that is 9.1
> material.

Er, no. In that case my first concern was misguided. I'm happy there's
no feedback loop -- my fear was that there was and it would mean the
"time received" could be decoupled from the time the wal was
generated. But as you describe it then the time received might be
slightly delayed from the time the wal was generated but to some
constant degree -- not in a way that will be influenced by the log
application being blocked on the slave.

-- 
greg

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Keepalive for max_standby_delay

2010-06-03 Thread Tom Lane

Greg Stark  writes:
>> Well, if the slave can't keep up, that's a separate problem.  It will
>> not fail to keep up as a result of the transmission mechanism.

> No, I mean if the slave is paused due to a conflict. Does it stop
> reading data from the master or does it buffer it up on disk? If it
> stops reading it from the master then the effect is the same as if the
> slave stopped "requesting" data even if there's no actual request.

The data keeps coming in and getting dumped into the slave's pg_xlog.
walsender/walreceiver are not at all tied to the slave's application
of WAL.  In principle we could have the code around max_standby_delay
notice just how far behind it's gotten and adjust the delay tolerance
based on that; but I think designing a feedback loop for that is 9.1
material.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Keepalive for max_standby_delay

2010-06-03 Thread Greg Stark

On Thu, Jun 3, 2010 at 12:11 AM, Tom Lane  wrote:
> Greg Stark  writes:
>> I was assuming the walreceiver only requests more wal in relatively
>> small chunks and only when replay has caught up and needs more data. I
>> haven't actually read this code so if that assumption is wrong then
>> I'm off-base.
>
> It is off-base.  The receiver does not "request" data, the sender is
> what determines how much WAL is sent when.

Hm, so what happens if the slave blocks, doesn't the sender block when
the kernel buffers fill up?

>> So I think this isn't necessarily such a blue moon event. As I
>> understand it, all it would take is a single long-running report and a
>> vacuum or HOT cleanup occurring on the master.
>
> I think this is mostly FUD too.  How often do you see vacuum blocked for
> an hour now?

No, that's not comparable. On the master vacuum will just ignore
tuples that are still visible to live transactions. On the slave it
doesn't have a choice, it sees the cleanup record and must pause
recovery until those transactions finish.

-- 
greg

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Keepalive for max_standby_delay

2010-06-03 Thread Fujii Masao

On Thu, Jun 3, 2010 at 5:00 AM, Tom Lane  wrote:
>> I stand by my suggestion from yesterday: Let's define max_standby_delay
>> as the difference between a piece of WAL becoming available in the
>> standby, and applying it.
>
> My proposal is essentially the same as yours plus allowing the DBA to
> choose different max delays for the caught-up and not-caught-up cases.
> Maybe everybody will end up setting the two delays the same, but I think
> we don't have enough experience to decide that for them now.

Applying WAL that arrives via SR is not always the sign of the caught-up
or not-caught-up. OTOH, applying WAL restored from archive is not always
the sign of either of them. So isn't it nonsense to separate the delay in
order to control the behavior of a recovery for those cases?

Regards,

-- 
Fujii Masao
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Keepalive for max_standby_delay

2010-06-03 Thread Simon Riggs

On Thu, 2010-06-03 at 18:39 +0900, Fujii Masao wrote:

> What purpose would that serve?

Tom has already explained this and it makes sense for me.

-- 
 Simon Riggs   www.2ndQuadrant.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Keepalive for max_standby_delay

2010-06-03 Thread Fujii Masao

On Thu, Jun 3, 2010 at 6:07 PM, Simon Riggs  wrote:
> On Thu, 2010-06-03 at 17:56 +0900, Fujii Masao wrote:
>> On Thu, Jun 3, 2010 at 4:41 AM, Heikki Linnakangas
>>  wrote:
>> > I don't understand why you want to use a different delay when you're
>> > restoring from archive vs. when you're streaming (what about existing WAL
>> > files found in pg_xlog, BTW?). The source of WAL shouldn't make a
>> > difference.
>>
>> Yes. The pace of a recovery has nothing to do with that of log shipping.
>> So to hurry up a recovery when restoring from archive seems to be useless.
>
> When streaming drops for some reason we revert to scanning the archive
> for files. There is clearly two modes of operation.

Yes.

> So it makes sense
> that you might want to set different times for the parameter in each
> case.

What purpose would that serve?

Regards,

-- 
Fujii Masao
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Keepalive for max_standby_delay

2010-06-03 Thread Simon Riggs

On Thu, 2010-06-03 at 17:56 +0900, Fujii Masao wrote:
> On Thu, Jun 3, 2010 at 4:41 AM, Heikki Linnakangas
>  wrote:
> > I don't understand why you want to use a different delay when you're
> > restoring from archive vs. when you're streaming (what about existing WAL
> > files found in pg_xlog, BTW?). The source of WAL shouldn't make a
> > difference.
> 
> Yes. The pace of a recovery has nothing to do with that of log shipping.
> So to hurry up a recovery when restoring from archive seems to be useless.

When streaming drops for some reason we revert to scanning the archive
for files. There is clearly two modes of operation. So it makes sense
that you might want to set different times for the parameter in each
case.

We might think of those modes as "connected" and "degraded" modes rather
than streaming and file shipping.

-- 
 Simon Riggs   www.2ndQuadrant.com

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Keepalive for max_standby_delay

2010-06-03 Thread Fujii Masao

On Thu, Jun 3, 2010 at 4:41 AM, Heikki Linnakangas
 wrote:
> I don't understand why you want to use a different delay when you're
> restoring from archive vs. when you're streaming (what about existing WAL
> files found in pg_xlog, BTW?). The source of WAL shouldn't make a
> difference.

Yes. The pace of a recovery has nothing to do with that of log shipping.
So to hurry up a recovery when restoring from archive seems to be useless.

Regards,

-- 
Fujii Masao
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Keepalive for max_standby_delay

2010-06-03 Thread Simon Riggs

On Wed, 2010-06-02 at 16:00 -0400, Tom Lane wrote:
> the current situation that query grace periods go to zero

Possibly a better way to handle this concern is to make the second
parameter: min_standby_grace_period - the minimum time a query will be
given in which to execute, even if max_standby_delay has been reached or
exceeded.

Would that more directly address you concerns?

min_standby_grace_period (ms) SIGHUP 

-- 
 Simon Riggs   www.2ndQuadrant.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Keepalive for max_standby_delay

2010-06-02 Thread Stephen Frost

* Tom Lane (t...@sss.pgh.pa.us) wrote:
> Greg Stark  writes:
> > So I think this isn't necessarily such a blue moon event. As I
> > understand it, all it would take is a single long-running report and a
> > vacuum or HOT cleanup occurring on the master.
> 
> I think this is mostly FUD too.  How often do you see vacuum blocked for
> an hour now?  It probably can happen, which is why we need to be able to
> kick queries off the locks with max_standby_delay, but it's far from
> common.  What we're concerned about here is how long a buffer lock on a
> single page is held, not how long heavyweight locks are held.  The
> normal hold times are measured in microseconds.

Maybe I'm missing something, but I think Greg's point was that if you
have a long-running query running against the standby/slave/whatever,
which is holding locks on various relations to implement that report,
and then a vacuum or HOT update happens on the master, the long-running
report query will get killed off unless you bump max_streaming_delay up
pretty high (eg: 60 mins).

That being said, I'm not sure that there's really another solution.
Yes, in this case, the slave can end up being an hour behind, but that's
the trade-off you have to make if you want to run an hour-long query on
the slave.  The other answer is to make the master not update those
tuples, etc, which might be possible by starting a transaction on the
master which grabs things enough to prevent the vacuum/hot/etc update
from happening.  That may be possible manually, but it's not fun and it
certainly isn't something we'll have built-in support for in 9.0.

Thanks,

Stephen

signature.asc
Description: Digital signature

Re: [HACKERS] Keepalive for max_standby_delay

2010-06-02 Thread Tom Lane

Greg Stark  writes:
> I was assuming the walreceiver only requests more wal in relatively
> small chunks and only when replay has caught up and needs more data. I
> haven't actually read this code so if that assumption is wrong then
> I'm off-base.

It is off-base.  The receiver does not "request" data, the sender is
what determines how much WAL is sent when.

> So I think this isn't necessarily such a blue moon event. As I
> understand it, all it would take is a single long-running report and a
> vacuum or HOT cleanup occurring on the master.

I think this is mostly FUD too.  How often do you see vacuum blocked for
an hour now?  It probably can happen, which is why we need to be able to
kick queries off the locks with max_standby_delay, but it's far from
common.  What we're concerned about here is how long a buffer lock on a
single page is held, not how long heavyweight locks are held.  The
normal hold times are measured in microseconds.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Keepalive for max_standby_delay

2010-06-02 Thread Greg Stark

On Wed, Jun 2, 2010 at 8:14 PM, Tom Lane  wrote:
> Indeed, but nothing we do can prevent that, if the slave is just plain
> slower than the master.  You have to assume that the slave is capable of
> keeping up in the absence of query-caused delays, or you're hosed.

I was assuming the walreceiver only requests more wal in relatively
small chunks and only when replay has caught up and needs more data. I
haven't actually read this code so if that assumption is wrong then
I'm off-base. But if my assumption is right then it's not merely the
master running faster than the slave that can cause you to fall
arbitrarily far behind. The "receive time" is delayed by how long the
slave waited on a previous lock, so if we wait for a few minutes, then
proceed and find we need more wal data we'll read data from the master
that it could have generated those few minutes ago.

> The sticky point is that once in a blue moon you do have a conflicting
> query sitting on a buffer lock for a long time, or even more likely a
> series of queries keeping the WAL replay process from obtaining buffer
> cleanup lock.

So I think this isn't necessarily such a blue moon event. As I
understand it, all it would take is a single long-running report and a
vacuum or HOT cleanup occurring on the master. If I want to set
max_standby_delay to 60min to allow reports to run for up to an hour
then any random HOT cleanup on the master will propagate to the slave
and cause a WAL stall until the transactions which have that xid in
their snapshot end.

Even the buffer locks are could potentially be blocked for a long time
if you happen to run the right kind of query (say a nested loop with
the buffer in question on the outside and a large table on the
inside). That's a rarer event though; is that what you were thinking
of?

-- 
greg

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Keepalive for max_standby_delay

2010-06-02 Thread Tom Lane

Heikki Linnakangas  writes:
> The problem with defining max_archive_delay that way is again that you 
> can fall behind indefinitely.

See my response to Greg Stark --- I don't think this is really an
issue.  It's certainly far less of an issue than the current situation
that query grace periods go to zero under not-at-all-extreme conditions.

> I don't understand why you want to use a different delay when you're 
> restoring from archive vs. when you're streaming (what about existing 
> WAL files found in pg_xlog, BTW?).

You're missing the point.  I want the DBA to be able to control what
happens in those two cases.  In the current implementation he has no
control over what happens while restoring from archive: it's going to
effectively act like max_archive_delay = 0 all the time.  If you're of
the opinion that's good, you can set the parameter that way and be
happy.  I'm of the opinion you'll soon find out it isn't so good,
because it'll kill standby queries too easily.

> I stand by my suggestion from yesterday: Let's define max_standby_delay 
> as the difference between a piece of WAL becoming available in the 
> standby, and applying it.

My proposal is essentially the same as yours plus allowing the DBA to
choose different max delays for the caught-up and not-caught-up cases.
Maybe everybody will end up setting the two delays the same, but I think
we don't have enough experience to decide that for them now.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Keepalive for max_standby_delay

2010-06-02 Thread Heikki Linnakangas


On 02/06/10 20:14, Tom Lane wrote:

 For realistic values of max_standby_delay ...


Hang on right there. What do you consider a realistic value for 
max_standby_delay? Because I'm not sure I have a grip on that myself. 5 
seconds? 5 minutes? 5 hours? I can see use cases for all of those...



What I think might be a realistic compromise is this:

1. Separate max_standby_delay into two GUCs, say "max_streaming_delay"
and "max_archive_delay".

2. When applying WAL that came across SR, use max_streaming_delay and
let the time measurement be current time minus time of receipt of the
current WAL send chunk.

3. When applying WAL that came from archive, use max_archive_delay and
let the time measurement be current time minus time of acquisition of
the current WAL segment from the archive.

The current code's behavior in the latter case could effectively be
modeled by setting max_archive_delay to zero, but that isn't the only
plausible setting.  More likely DBAs would set max_archive_delay to
something smaller than max_streaming_delay, but still positive so as to
not kill conflicting queries instantly.


The problem with defining max_archive_delay that way is again that you 
can fall behind indefinitely. If you set it to 5 minutes, it means that 
you'll wait a maximum of 5 minutes *per WAL segment*, even if WAL is 
being generated faster.


I don't understand why you want to use a different delay when you're 
restoring from archive vs. when you're streaming (what about existing 
WAL files found in pg_xlog, BTW?). The source of WAL shouldn't make a 
difference. If it's because you assume that restoring from archive is a 
sign that you've fallen behind a lot, surely you've exceeded 
max_standby_delay then and I still don't see a need for a separate GUC.


I stand by my suggestion from yesterday: Let's define max_standby_delay 
as the difference between a piece of WAL becoming available in the 
standby, and applying it. To approximate "piece of WAL becoming 
available" for SR, we can use the mechanism with send/applyChunks from 
Simon's latest patch, or go with the simpler scheme of just resetting a 
"last caughtup timestamp" to current time whenever we have to wait for 
new WAL to arrive. When restoring from archive, likewise reset "last 
caughtup timestamp" whenever restore_command returns non-0, i.e we have 
to wait for the next WAL file to arrive.


That works the same for both SR and file-based log shipping, there's 
only one knob to set, is simple to implement and doesn't require 
synchronized clocks.


--
  Heikki Linnakangas
  EnterpriseDB   http://www.enterprisedb.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Keepalive for max_standby_delay

2010-06-02 Thread Tom Lane

Greg Stark  writes:
> On Wed, Jun 2, 2010 at 6:14 PM, Tom Lane  wrote:
>> I believe that the motivation for treating archived timestamps as live
>> is, essentially, to force rapid catchup if a slave falls behind so far
>> that it's reading from archive instead of SR.

> Huh, I think this is the first mention of this that I've seen. I
> always assumed the motivation was just that you wanted to control how
> much data loss could occur on failover and how long recovery would
> take. I think separating the two delays is an interesting idea but I
> don't see how it counts as a simplification.

Well, it isn't a simplification: it's bringing it up to the minimum
complication level where it'll actually work sanely.  The current
implementation doesn't work sanely because it confuses stale timestamps
read from WAL with real live time.

> This also still allows a slave to become arbitrarily far behind the
> master.

Indeed, but nothing we do can prevent that, if the slave is just plain
slower than the master.  You have to assume that the slave is capable of
keeping up in the absence of query-caused delays, or you're hosed.

The real reason this is at issue is the fact that the max_standby_delay
kill mechanism applies to certain buffer-level locking operations.
On the master we just wait, and it's not a problem, because in practice
the conflicting queries almost always release these locks pretty quick.
On the slave, though, instant kill as a result of a buffer-level lock
conflict would result in a very serious degradation in standby query
reliability (while also doing practically nothing for the speed of WAL
application, most of the time).  This morning on the phone Bruce and I
were seriously discussing the idea of ripping the max_standby_delay
mechanism out of the buffer-level locking paths, and just let them work
like they do on the master, ie, wait forever.  If we did that then
simplifying max_standby_delay to a boolean would be reasonable again
(because it really would only come into play for DDL on the master).
The sticky point is that once in a blue moon you do have a conflicting
query sitting on a buffer lock for a long time, or even more likely a
series of queries keeping the WAL replay process from obtaining buffer
cleanup lock.

So it seems that we have to have max_standby_delay-like logic for those
locks, and also that zero grace period before kill isn't a very practical
setting.  However, there isn't a lot of point in obsessing over exactly
how long the grace period ought to be, as long as it's more than a few
milliseconds.  It *isn't* going to have any real effect on whether the
slave can stay caught up.  You could make a fairly decent case for just
measuring the grace period from when the replay process starts to wait,
as I think I proposed awhile back.  The value of measuring delay from a
receipt time is that if you do happen to have a bunch of delays within
a short interval you'll get more willing to kill queries --- but I
really believe that that is a corner case and will have nothing to do
with ordinary performance.

> I propose an alternate way out of the problem of syncing two clocks.
> Instead of comparing timestamps compare time intervals. So as it reads
> xlog records it only ever compares the master timestamps with previous
> master timestamps to determine how much time has elapsed on the
> master. It compares that time elapsed with the time elapsed on the
> slave to determine if it's falling behind.

I think this would just add complexity and uncertainty, to deal with
something that won't be much of a problem in practice.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Keepalive for max_standby_delay

2010-06-02 Thread Robert Haas

On Wed, Jun 2, 2010 at 2:27 PM, Simon Riggs  wrote:
> Syncing two servers in replication is common practice, as has been
> explained here; I'm still surprised people think otherwise. Measuring
> the time between two servers is the very purpose of the patch, so the
> synchronisation is not a design flaw, it is its raison d'etre.

I think the purpose of the patch should not be to measure the time
difference between servers, but rather the replication delay.  While I
don't necessarily agree with Tom's statement that this is must-fix, I
do agree that it would be nicer if we could avoid depending on time
sync.  Yeah, I keep my servers time synced, too.  But, shit happens.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise Postgres Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Keepalive for max_standby_delay

2010-06-02 Thread Simon Riggs

On Wed, 2010-06-02 at 13:14 -0400, Tom Lane wrote:

> This patch seems to me to be going in fundamentally the wrong direction.
> It's adding complexity and overhead (far more than is needed), and it's
> failing utterly to resolve the objections that I raised to start with.

Having read your proposal, it seems changing from time-on-sender to
time-on-receiver is a one line change to the patch.

What else are you thinking of removing, if anything?

Adding an extra parameter adds more obviously and is something I now
agree with.

> In particular, Simon seems to be basically refusing to do anything about
> the complaint that the code fails unless master and standby clocks are
> in close sync.  I do not believe that this is acceptable, and since he
> won't fix it, I guess I'll have to.

Syncing two servers in replication is common practice, as has been
explained here; I'm still surprised people think otherwise. Measuring
the time between two servers is the very purpose of the patch, so the
synchronisation is not a design flaw, it is its raison d'etre. There's
been a few spleens emptied on that topic, not all of them mine, and
certainly no consensus on that. So I'm not refusing to do anything
that's been agreed... 

-- 
 Simon Riggs   www.2ndQuadrant.com

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Keepalive for max_standby_delay

2010-06-02 Thread Greg Stark

On Wed, Jun 2, 2010 at 6:14 PM, Tom Lane  wrote:

> I believe that the motivation for treating archived timestamps as live
> is, essentially, to force rapid catchup if a slave falls behind so far
> that it's reading from archive instead of SR.

Huh, I think this is the first mention of this that I've seen. I
always assumed the motivation was just that you wanted to control how
much data loss could occur on failover and how long recovery would
take. I think separating the two delays is an interesting idea but I
don't see how it counts as a simplification.

This also still allows a slave to become arbitrarily far behind the
master. The master might take a long time to fill up a log file, and
if the slave is blocked for a few minutes, then requests the next bit
of log file and blocks for a few more minutes, then requests the next
log file and blocks for a few more minutes, etc. As long as the slave
is reading more slowly than the master is filling those segments it
will fall further and further behind but never be more than a few
minutes behind receiving the logs.

I propose an alternate way out of the problem of syncing two clocks.
Instead of comparing timestamps compare time intervals. So as it reads
xlog records it only ever compares the master timestamps with previous
master timestamps to determine how much time has elapsed on the
master. It compares that time elapsed with the time elapsed on the
slave to determine if it's falling behind. I'm assuming it
periodically catches up and can throw out its accumulated time elapsed
and start fresh. If not then the accumulated error might be a problem,
but hopefully one we can find a solution to.

-- 
greg

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Keepalive for max_standby_delay

2010-06-02 Thread Robert Haas

On Wed, Jun 2, 2010 at 2:03 PM, Simon Riggs  wrote:
> On Wed, 2010-06-02 at 13:45 -0400, Tom Lane wrote:
>> Stephen Frost  writes:
>> > * Tom Lane (t...@sss.pgh.pa.us) wrote:
>> >> Comments?
>>
>> > I'm not really a huge fan of adding another GUC, to be honest.  I'm more
>> > inclined to say we treat 'max_archive_delay' as '0', and turn
>> > max_streaming_delay into what you've described.  If we fall back so far
>> > that we have to go back to reading WALs, then we need to hurry up and
>> > catch-up and damn the torpedos.
>>
>> If I thought that 0 were a generally acceptable value, I'd still be
>> pushing the "simplify it to a boolean" agenda ;-).  The problem is that
>> that will sometimes kill standby queries even when they are quite short
>> and doing nothing objectionable.
>
> OK, now I understand. I was just thinking the same as Stephen, but now I
> agree we need a second parameter.

I too was thinking the same as Stephen, but now I also agree we need a
second parameter.  I think the whole design Tom proposed seems good.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise Postgres Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Keepalive for max_standby_delay

2010-06-02 Thread Simon Riggs

On Wed, 2010-06-02 at 13:45 -0400, Tom Lane wrote:
> Stephen Frost  writes:
> > * Tom Lane (t...@sss.pgh.pa.us) wrote:
> >> Comments?
> 
> > I'm not really a huge fan of adding another GUC, to be honest.  I'm more
> > inclined to say we treat 'max_archive_delay' as '0', and turn
> > max_streaming_delay into what you've described.  If we fall back so far
> > that we have to go back to reading WALs, then we need to hurry up and
> > catch-up and damn the torpedos.
> 
> If I thought that 0 were a generally acceptable value, I'd still be
> pushing the "simplify it to a boolean" agenda ;-).  The problem is that
> that will sometimes kill standby queries even when they are quite short
> and doing nothing objectionable.

OK, now I understand. I was just thinking the same as Stephen, but now I
agree we need a second parameter.

-- 
 Simon Riggs   www.2ndQuadrant.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Keepalive for max_standby_delay

2010-06-02 Thread Tom Lane

Stephen Frost  writes:
> * Tom Lane (t...@sss.pgh.pa.us) wrote:
>> Comments?

> I'm not really a huge fan of adding another GUC, to be honest.  I'm more
> inclined to say we treat 'max_archive_delay' as '0', and turn
> max_streaming_delay into what you've described.  If we fall back so far
> that we have to go back to reading WALs, then we need to hurry up and
> catch-up and damn the torpedos.

If I thought that 0 were a generally acceptable value, I'd still be
pushing the "simplify it to a boolean" agenda ;-).  The problem is that
that will sometimes kill standby queries even when they are quite short
and doing nothing objectionable.

> I'd also prefer that we only wait the
> delay time once until we're fully caught up again (and have gotten
> back around to waiting for new data).  

The delays will be measured from a receipt instant to current time,
which means that the longer it takes to apply a WAL segment or WAL
send chunk, the less grace period there will be.  (Which is the
same as what CVS HEAD does --- I'm just arguing about where we get
the start time from.)  I believe this does what you suggest and more.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Keepalive for max_standby_delay

2010-06-02 Thread Andrew Dunstan




Tom Lane wrote:

I'm still inclined to apply the part of Simon's patch that adds a
transmit timestamp to each SR send chunk.  That would actually be
completely unused by the slave given my proposal above, but I think that
it is an important step to take to future-proof the SR protocol against
possible changes in the slave-side timing logic.  
  


+1.

From a radically different perspective, I had to do something similar 
in the buildfarm years ago to protect us from machines reporting with 
grossly inaccurate timestamps. This was part of the solution. The client 
adds its current timestamp setting just before transmitting the data to 
the server.


cheers

andrew

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Keepalive for max_standby_delay

2010-06-02 Thread Stephen Frost

* Tom Lane (t...@sss.pgh.pa.us) wrote:
> An important property of this design is that all relevant timestamps
> are taken on the slave, so clock skew isn't an issue.

I agree that this is important, and I do run NTP on all my servers and
even monitor it using Nagios.

It's still not a cure-all for time skew issues.

> Comments?

I'm not really a huge fan of adding another GUC, to be honest.  I'm more
inclined to say we treat 'max_archive_delay' as '0', and turn
max_streaming_delay into what you've described.  If we fall back so far
that we have to go back to reading WALs, then we need to hurry up and
catch-up and damn the torpedos.  I'd also prefer that we only wait the
delay time once until we're fully caught up again (and have gotten
back around to waiting for new data).  

Thanks,

Stephen

signature.asc
Description: Digital signature

Re: [HACKERS] Keepalive for max_standby_delay

2010-06-02 Thread Tom Lane

Simon Riggs  writes:
> OK, here's v4.

I've been trying to stay out of this thread, but with beta2 approaching
and no resolution in sight, I'm afraid I have to get involved.

This patch seems to me to be going in fundamentally the wrong direction.
It's adding complexity and overhead (far more than is needed), and it's
failing utterly to resolve the objections that I raised to start with.

In particular, Simon seems to be basically refusing to do anything about
the complaint that the code fails unless master and standby clocks are
in close sync.  I do not believe that this is acceptable, and since he
won't fix it, I guess I'll have to.

The other basic problem that I had with the current code, which this
does nothing about, is that it believes that timestamps pulled from WAL
archive should be treated on the same footing as timestamps of "live"
actions.  That might work in certain limited scenarios but it's going to
be a disaster in general.

I believe that the motivation for treating archived timestamps as live
is, essentially, to force rapid catchup if a slave falls behind so far
that it's reading from archive instead of SR.  There are certainly
use-cases where that's appropriate (though, again, not all); but even
when you do want it it's a pretty inflexible implementation.  For
realistic values of max_standby_delay the behavior is going to pretty
much be "instant kill whenever we're reading from archive".

I have backed off my original suggestion that we should reduce
max_standby_delay to a boolean: that was based on the idea that delays
would only occur in response to DDL on the master, but since vacuum and
btree page splits can also trigger delays, it seems clear that a "kill
queries immediately" policy isn't really very useful in practice.  So we
need to make max_standby_delay work rather than just get rid of it.

What I think might be a realistic compromise is this:

1. Separate max_standby_delay into two GUCs, say "max_streaming_delay"
and "max_archive_delay".

2. When applying WAL that came across SR, use max_streaming_delay and
let the time measurement be current time minus time of receipt of the
current WAL send chunk.

3. When applying WAL that came from archive, use max_archive_delay and
let the time measurement be current time minus time of acquisition of
the current WAL segment from the archive.

The current code's behavior in the latter case could effectively be
modeled by setting max_archive_delay to zero, but that isn't the only
plausible setting.  More likely DBAs would set max_archive_delay to
something smaller than max_streaming_delay, but still positive so as to
not kill conflicting queries instantly.

An important property of this design is that all relevant timestamps
are taken on the slave, so clock skew isn't an issue.

I'm still inclined to apply the part of Simon's patch that adds a
transmit timestamp to each SR send chunk.  That would actually be
completely unused by the slave given my proposal above, but I think that
it is an important step to take to future-proof the SR protocol against
possible changes in the slave-side timing logic.  I don't however see
the value of transmitting "keepalive" records when we'd otherwise not
have anything to send.  The value of adding timestamps to the SR
protocol is really to let the slave determine the current amount of
clock skew between it and the master; which is a number that should not
change so rapidly that it has to be updated every 100ms even in an idle
system.

Comments?

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Keepalive for max_standby_delay

2010-06-02 Thread Kevin Grittner

Simon Riggs  wrote:
> On Mon, 2010-05-31 at 14:40 -0400, Bruce Momjian wrote:
> 
>> Uh, we have three days before we package 9.0beta2.  It would be
>> good if we could decide on the max_standby_delay issue soon.
> 
> I've heard something from Heikki, not from anyone else. Those
> comments amount to "lets replace max_standby_delay with
> max_apply_delay".
> 
> Got a beta idea?

Given the incessant ticking of the clock, I have a hard time
believing we have any real options besides max_standby_delay or a
boolean which corresponds to the -1 and 0 settings of
max_standby_delay.  I think it's pretty clear that there's a use
case for the positive values, although there are bound to be some
who try it and are surprised by behavior at transition from idle to
active.  The whole debate seems to boil down to how important a
middle ground is versus how damaging the surprise factor is.  (I
don't really buy the argument that we won't be able to remove it
later if we replace it with something better.)

I know there were initially some technical problems, too; have
those been resolved?

-Kevin

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Keepalive for max_standby_delay

2010-06-02 Thread Simon Riggs

On Mon, 2010-05-31 at 14:40 -0400, Bruce Momjian wrote:

> Uh, we have three days before we package 9.0beta2.  It would be good if
> we could decide on the max_standby_delay issue soon.

I've heard something from Heikki, not from anyone else. Those comments
amount to "lets replace max_standby_delay with max_apply_delay".

Got a beta idea?

-- 
 Simon Riggs   www.2ndQuadrant.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Keepalive for max_standby_delay

2010-06-01 Thread Simon Riggs

Thanks for the review.

On Tue, 2010-06-01 at 13:36 +0300, Heikki Linnakangas wrote:

> If we really want to try to salvage max_standby_delay with a meaning 
> similar to what it has now, I think we should go with the idea some 
> people bashed around earlier and define the grace period as the 
> difference between a WAL record becoming available to the standby for 
> replay, and between replaying it. An approximation of that is to do 
> "lastIdle = gettimeofday()" in XLogPageRead() whenever it needs to wait 
> for new WAL to arrive, whether that's via streaming replication or by a 
> success return code from restore_command, and compare the difference of 
> that with current timestamp in WaitExceedsMaxStandbyDelay().

That wouldn't cope with a continuous stream of records arriving, unless
you also include the second half of the patch.

> That's very simple, doesn't require synchronized clocks, and works the 
> same with file- and stream-based setups.

Nor does it provide a mechanism for monitoring of SR. standby_delay is
explicitly defined in terms of the gap between two servers, so is a
useful real world concept. apply_delay is somewhat less interesting.

I'm sure most people would rather have monitoring and therefore the
requirement for synchronised-ish clocks, than no monitoring. If you
think no monitoring is OK, I don't, but there are other ways, so its not
a point to fight about.

> This certainly alleviates some of the problems. You still need to ensure
> that master and standby have synchronized clocks, and you still get zero 
> grace time after a long period of inactivity when not using streaming 
> replication, however.

Second issue can be added once we approve the rest of this if you like.

> Sending a keep-alive message every 100ms seems overly aggressive to me.

It's sent every wal_sender_delay. Why is that a negative?

-- 
 Simon Riggs   www.2ndQuadrant.com

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Keepalive for max_standby_delay

2010-06-01 Thread Heikki Linnakangas


On 27/05/10 20:26, Simon Riggs wrote:

On Wed, 2010-05-26 at 16:22 -0700, Josh Berkus wrote:

Just this second posted about that, as it turns out.

I have a v3 *almost* ready of the keepalive patch. It still makes sense
to me after a few days reflection, so is worth discussion and review. In
or out, I want this settled within a week. Definitely need some R&R
here.


Does the keepalive fix all the issues with max_standby_delay?  Tom?


OK, here's v4.

Summary

* WALSender adds a timestamp onto the header of every WAL chunk sent.

* Each WAL record now has a conceptual "send timestamp" that remains
constant while that record is replayed. This is used as the basis from
which max_standby_delay is calculated when required during replay.

* Send timestamp is calculated as the later of the timestamp of chunk in
which WAL record was sent and the latest XLog time.

* WALSender sends an empty message as a keepalive when nothing else to
send. (No longer a special message type for the keepalive).

I think its close, but if there's a gaping hole here somewhere then I'll
punt for this release.


This certainly alleviates some of the problems. You still need to ensure 
that master and standby have synchronized clocks, and you still get zero 
grace time after a long period of inactivity when not using streaming 
replication, however.


Sending a keep-alive message every 100ms seems overly aggressive to me.


If we really want to try to salvage max_standby_delay with a meaning 
similar to what it has now, I think we should go with the idea some 
people bashed around earlier and define the grace period as the 
difference between a WAL record becoming available to the standby for 
replay, and between replaying it. An approximation of that is to do 
"lastIdle = gettimeofday()" in XLogPageRead() whenever it needs to wait 
for new WAL to arrive, whether that's via streaming replication or by a 
success return code from restore_command, and compare the difference of 
that with current timestamp in WaitExceedsMaxStandbyDelay().


That's very simple, doesn't require synchronized clocks, and works the 
same with file- and stream-based setups.


--
  Heikki Linnakangas
  EnterpriseDB   http://www.enterprisedb.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Keepalive for max_standby_delay

2010-05-31 Thread Bruce Momjian


Uh, we have three days before we package 9.0beta2.  It would be good if
we could decide on the max_standby_delay issue soon.

---

Simon Riggs wrote:
> On Wed, 2010-05-26 at 16:22 -0700, Josh Berkus wrote:
> > > Just this second posted about that, as it turns out.
> > > 
> > > I have a v3 *almost* ready of the keepalive patch. It still makes sense
> > > to me after a few days reflection, so is worth discussion and review. In
> > > or out, I want this settled within a week. Definitely need some R&R
> > > here.
> > 
> > Does the keepalive fix all the issues with max_standby_delay?  Tom?
> 
> OK, here's v4.
> 
> Summary
> 
> * WALSender adds a timestamp onto the header of every WAL chunk sent.
> 
> * Each WAL record now has a conceptual "send timestamp" that remains
> constant while that record is replayed. This is used as the basis from
> which max_standby_delay is calculated when required during replay.
> 
> * Send timestamp is calculated as the later of the timestamp of chunk in
> which WAL record was sent and the latest XLog time.
> 
> * WALSender sends an empty message as a keepalive when nothing else to
> send. (No longer a special message type for the keepalive).
> 
> I think its close, but if there's a gaping hole here somewhere then I'll
> punt for this release.
> 
> -- 
>  Simon Riggs   www.2ndQuadrant.com

[ Attachment, skipping... ]

> 
> -- 
> Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
> To make changes to your subscription:
> http://www.postgresql.org/mailpref/pgsql-hackers

-- 
  Bruce Momjian  http://momjian.us
  EnterpriseDB http://enterprisedb.com

  + None of us is going to be here forever. +


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Keepalive for max_standby_delay

2010-05-27 Thread Simon Riggs

On Wed, 2010-05-26 at 16:22 -0700, Josh Berkus wrote:
> > Just this second posted about that, as it turns out.
> > 
> > I have a v3 *almost* ready of the keepalive patch. It still makes sense
> > to me after a few days reflection, so is worth discussion and review. In
> > or out, I want this settled within a week. Definitely need some R&R
> > here.
> 
> Does the keepalive fix all the issues with max_standby_delay?  Tom?

OK, here's v4.

Summary

* WALSender adds a timestamp onto the header of every WAL chunk sent.

* Each WAL record now has a conceptual "send timestamp" that remains
constant while that record is replayed. This is used as the basis from
which max_standby_delay is calculated when required during replay.

* Send timestamp is calculated as the later of the timestamp of chunk in
which WAL record was sent and the latest XLog time.

* WALSender sends an empty message as a keepalive when nothing else to
send. (No longer a special message type for the keepalive).

I think its close, but if there's a gaping hole here somewhere then I'll
punt for this release.

-- 
 Simon Riggs   www.2ndQuadrant.com
*** a/doc/src/sgml/protocol.sgml
--- b/doc/src/sgml/protocol.sgml
***
*** 4222,4247  The commands accepted in walsender mode are:

!   Byten

!   Data that forms part of WAL data stream.

!   

!  
!  
 A single WAL record is never split across two CopyData messages. When
 a WAL record crosses a WAL page boundary, however, and is therefore
 already split using continuation records, it can be split at the page
 boundary. In other words, the first main WAL record and its
 continuation records can be split across different CopyData messages.

--- 4222,4257 

!   Byte8

!   Message timestamp.

!   
!   
!   Byten
!   
!   
!   
!   Data that forms part of WAL data stream. (May be zero length).

!   
!   
 A single WAL record is never split across two CopyData messages. When
 a WAL record crosses a WAL page boundary, however, and is therefore
 already split using continuation records, it can be split at the page
 boundary. In other words, the first main WAL record and its
 continuation records can be split across different CopyData messages.
+   
+   
+   
+   

*** a/src/backend/access/transam/xlog.c
--- b/src/backend/access/transam/xlog.c
***
*** 1938,1944  UpdateMinRecoveryPoint(XLogRecPtr lsn, bool force)
  			UpdateControlFile();
  			minRecoveryPoint = newMinRecoveryPoint;

! 			ereport(DEBUG2,
  	(errmsg("updated min recovery point to %X/%X",
  		minRecoveryPoint.xlogid, minRecoveryPoint.xrecoff)));
  		}
--- 1938,1944 
  			UpdateControlFile();
  			minRecoveryPoint = newMinRecoveryPoint;

! 			ereport(DEBUG3,
  	(errmsg("updated min recovery point to %X/%X",
  		minRecoveryPoint.xlogid, minRecoveryPoint.xrecoff)));
  		}
***
*** 9210,9218  retry:
  {
  	/*
  	 * While walreceiver is active, wait for new WAL to arrive
! 	 * from primary.
  	 */
! 	receivedUpto = GetWalRcvWriteRecPtr();
  	if (XLByteLT(*RecPtr, receivedUpto))
  	{
  		/*
--- 9210,9218 
  {
  	/*
  	 * While walreceiver is active, wait for new WAL to arrive
! 	 * from primary. Get next applychunk and do other bookkeeping.
  	 */
! 	receivedUpto = GetWalRcvNextApplyChunk();
  	if (XLByteLT(*RecPtr, receivedUpto))
  	{
  		/*
*** a/src/backend/replication/walreceiver.c
--- b/src/backend/replication/walreceiver.c
***
*** 394,410  XLogWalRcvProcessMsg(unsigned char type, char *buf, Size len)
  		case 'w':/* WAL records */
  			{
  XLogRecPtr	recptr;

! if (len < sizeof(XLogRecPtr))
  	ereport(ERROR,
  			(errcode(ERRCODE_PROTOCOL_VIOLATION),
  			 errmsg_internal("invalid WAL message received from primary")));

  memcpy(&recptr, buf, sizeof(XLogRecPtr));
  buf += sizeof(XLogRecPtr);
  len -= sizeof(XLogRecPtr);

! XLogWalRcvWrite(buf, len, recptr);
  break;
  			}
  		default:
--- 394,427 
  		case 'w':/* WAL records */
  			{
  XLogRecPtr	recptr;
+ TimestampTz chunk_timestamp;

! if (len < (sizeof(XLogRecPtr) + sizeof(TimestampTz)))
  	ereport(ERROR,
  			(errcode(ERRCODE_PROTOCOL_VIOLATION),
  			 errmsg_internal("invalid WAL message received from primary")));

+ /*
+  * Extract starting XLogRecPtr of message header
+  */
  memcpy(&recptr, buf, sizeof(XLogRecPtr));
  buf += sizeof(XLogRecPtr);
  len -= sizeof(XLogRecPtr

Re: [HACKERS] Keepalive for max_standby_delay

2010-05-26 Thread Josh Berkus


> Just this second posted about that, as it turns out.
> 
> I have a v3 *almost* ready of the keepalive patch. It still makes sense
> to me after a few days reflection, so is worth discussion and review. In
> or out, I want this settled within a week. Definitely need some R&R
> here.

Does the keepalive fix all the issues with max_standby_delay?  Tom?

-- 
  -- Josh Berkus
 PostgreSQL Experts Inc.
 http://www.pgexperts.com

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Keepalive for max_standby_delay

2010-05-26 Thread Simon Riggs

On Wed, 2010-05-26 at 15:45 -0700, Josh Berkus wrote:
> > Committed with chunk size of 128 kB. I hope that's a reasonable
> > compromise, in the absence of any performance test data either way.
> 
> So where are we with max_standby_delay?  Status check?

Just this second posted about that, as it turns out.

I have a v3 *almost* ready of the keepalive patch. It still makes sense
to me after a few days reflection, so is worth discussion and review. In
or out, I want this settled within a week. Definitely need some R&R
here.

-- 
 Simon Riggs   www.2ndQuadrant.com

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Keepalive for max_standby_delay

2010-05-26 Thread Simon Riggs

On Sun, 2010-05-16 at 17:11 +0100, Simon Riggs wrote:

> New version, with some other cleanup of wait processing.
> 
> New logic is that when Startup asks for next applychunk of WAL it saves
> the lastChunkTimestamp. That is then the base time used by
> WaitExceedsMaxStandbyDelay(), except when latestXLogTime is later.
> Since multiple receivechunks can arrive from primary before Startup asks
> for next applychunk we use the oldest receivechunk timestamp, not the
> latest. Doing it this way means the lastChunkTimestamp doesn't change
> when new keepalives arrive, so we have a stable and reasonably accurate
> recordSendTimestamp for each WAL record.
> 
> The keepalive is sent as the first part of a new message, if any. So
> partial chunks of data always have an accurate timestamp, even if that
> is slightly older as a result. Doesn't make much difference except with
> very large chunks.
> 
> I think that addresses the points raised on this thread and others.

Was about to post v3 after your last commit, but just found a minor bug
in my v2->v3 changes, even though they were fairly light. Too tired to
fix now. The general thinking underlying this patch is still the same
though and is worth discussing over next few days.

-- 
 Simon Riggs   www.2ndQuadrant.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Keepalive for max_standby_delay

2010-05-26 Thread Simon Riggs

On Thu, 2010-05-27 at 01:34 +0300, Heikki Linnakangas wrote:

> Committed with chunk size of 128 kB.

Thanks. I'm sure that's fine.

-- 
 Simon Riggs   www.2ndQuadrant.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Keepalive for max_standby_delay

2010-05-26 Thread Josh Berkus


> Committed with chunk size of 128 kB. I hope that's a reasonable
> compromise, in the absence of any performance test data either way.

So where are we with max_standby_delay?  Status check?

-- 
  -- Josh Berkus
 PostgreSQL Experts Inc.
 http://www.pgexperts.com

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Keepalive for max_standby_delay

2010-05-26 Thread Heikki Linnakangas


On 19/05/10 00:37, Simon Riggs wrote:

On Tue, 2010-05-18 at 17:25 -0400, Heikki Linnakangas wrote:

On 18/05/10 17:17, Simon Riggs wrote:

There's no reason that the buffer size we use for XLogRead() should be
the same as the send buffer, if you're worried about that. My point is
that pq_putmessage contains internal flushes so at the libpq level you
gain nothing by big batches. The read() will be buffered anyway with
readahead so not sure what the issue is. We'll have to do this for sync
rep anyway, so what's the big deal? Just do it now, once. Do we really
want 9.1 code to differ here?


Do what? What exactly is it that you want instead, then?


Read and write smaller messages, so the latency is minimised. Libpq will
send in 8192 byte packets, so writing anything larger gains nothing when
the WAL data is also chunked at exactly the same size.


Committed with chunk size of 128 kB. I hope that's a reasonable 
compromise, in the absence of any performance test data either way.


I'm weary of setting it as low as 8k, as there is some per-message 
overhead. Some of that could be avoided by rearranging the loops so that 
the ps display is not updated at every message as you suggested, but I 
don't feel doing any extra rearrangements at this point. It would not be 
hard, but it also certainly wouldn't make the code simpler.


I believe in practice 128kB is just as good as 8k from the 
responsiveness point of view. If a standby is not responding, walsender 
will be stuck trying to send no matter what the block size is. If it 
responding, no matter how slowly, 128kB should get transferred pretty 
quickly.


--
  Heikki Linnakangas
  EnterpriseDB   http://www.enterprisedb.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Keepalive for max_standby_delay

2010-05-18 Thread Simon Riggs

On Tue, 2010-05-18 at 17:25 -0400, Heikki Linnakangas wrote:
> On 18/05/10 17:17, Simon Riggs wrote:
> > There's no reason that the buffer size we use for XLogRead() should be
> > the same as the send buffer, if you're worried about that. My point is
> > that pq_putmessage contains internal flushes so at the libpq level you
> > gain nothing by big batches. The read() will be buffered anyway with
> > readahead so not sure what the issue is. We'll have to do this for sync
> > rep anyway, so what's the big deal? Just do it now, once. Do we really
> > want 9.1 code to differ here?
> 
> Do what? What exactly is it that you want instead, then?

Read and write smaller messages, so the latency is minimised. Libpq will
send in 8192 byte packets, so writing anything larger gains nothing when
the WAL data is also chunked at exactly the same size.

-- 
 Simon Riggs   www.2ndQuadrant.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Keepalive for max_standby_delay

2010-05-18 Thread Heikki Linnakangas


On 18/05/10 17:17, Simon Riggs wrote:

There's no reason that the buffer size we use for XLogRead() should be
the same as the send buffer, if you're worried about that. My point is
that pq_putmessage contains internal flushes so at the libpq level you
gain nothing by big batches. The read() will be buffered anyway with
readahead so not sure what the issue is. We'll have to do this for sync
rep anyway, so what's the big deal? Just do it now, once. Do we really
want 9.1 code to differ here?


Do what? What exactly is it that you want instead, then?

--
  Heikki Linnakangas
  EnterpriseDB   http://www.enterprisedb.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Keepalive for max_standby_delay

2010-05-18 Thread Simon Riggs

On Tue, 2010-05-18 at 17:08 -0400, Heikki Linnakangas wrote:
> On 17/05/10 12:36, Jim Nasby wrote:
> > On May 15, 2010, at 12:05 PM, Heikki Linnakangas wrote:
> >> What exactly is the user trying to monitor? If it's "how far behind is
> >> the standby", the difference between pg_current_xlog_insert_location()
> >> in the master and pg_last_xlog_replay_location() in the standby seems
> >> more robust and well-defined to me. It's a measure of XLOG location (ie.
> >> bytes) instead of time, but time is a complicated concept.
> >
> > I can tell you that end users *will* want a time-based indication of how 
> > far behind we are. DBAs will understand "we're this many transactions 
> > behind", but managers and end users won't. Unless it's unreasonable to 
> > provide that info, we should do so.
> 
> No doubt about that, the problem is that it's hard to provide a reliable 
> time-based indication.

I think I have one now.

-- 
 Simon Riggs   www.2ndQuadrant.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Keepalive for max_standby_delay

2010-05-18 Thread Simon Riggs

On Tue, 2010-05-18 at 17:06 -0400, Heikki Linnakangas wrote:
> On 17/05/10 04:40, Simon Riggs wrote:
> > On Sun, 2010-05-16 at 16:53 +0100, Simon Riggs wrote:
> >>>
> >>> Attached patch rearranges the walsender loops slightly to fix the above.
> >>> XLogSend() now only sends up to MAX_SEND_SIZE bytes (== XLOG_SEG_SIZE /
> >>> 2) in one round and returns to the main loop after that even if there's
> >>> unsent WAL, and the main loop no longer sleeps if there's unsent WAL.
> >>> That way the main loop gets to respond to signals quickly, and we also
> >>> get to update the shared memory status and PS display more often when
> >>> there's a lot of catching up to do.
> >>>
> >>> Comments
> >>
> >> 8MB at a time still seems like a large batch to me.
> >>
> >> libpq is going to send it in smaller chunks anyway, so I don't see the
> >> importance of trying to keep the batch too large. It just introduces
> >> delay into the sending process. We should be sending chunks that matches
> >> libpq better.
> >
> > More to the point the logic will fail if XLOG_BLCKSZ>  PQ_BUFFER_SIZE
> > because it will send partial pages.
> 
> I don't see a failure. We rely on not splitting WAL records across 
> messages, but we're talking about libpq-level CopyData messages, not TCP 
> messages.

OK

> > Having MAX_SEND_SIZE>  PQ_BUFFER_SIZE is pointless, as libpq currently
> > stands.
> 
> Well, it does affect the size of the read() in walsender, and I'm sure 
> there's some overhead in setting the ps display and the other small 
> stuff we do once per message. But you're probably right that we could 
> easily make MAX_SEND_SIZE much smaller with no noticeable affect on 
> performance, while making walsender more responsive to signals. I'll 
> decrease it to, say, 512 kB.

I'm pretty certain we don't need to set the ps display once per message.
ps doesn't need an update 5 times per second on average.

There's no reason that the buffer size we use for XLogRead() should be
the same as the send buffer, if you're worried about that. My point is
that pq_putmessage contains internal flushes so at the libpq level you
gain nothing by big batches. The read() will be buffered anyway with
readahead so not sure what the issue is. We'll have to do this for sync
rep anyway, so what's the big deal? Just do it now, once. Do we really
want 9.1 code to differ here?

-- 
 Simon Riggs   www.2ndQuadrant.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Keepalive for max_standby_delay

2010-05-18 Thread Heikki Linnakangas


On 17/05/10 12:36, Jim Nasby wrote:

On May 15, 2010, at 12:05 PM, Heikki Linnakangas wrote:

What exactly is the user trying to monitor? If it's "how far behind is
the standby", the difference between pg_current_xlog_insert_location()
in the master and pg_last_xlog_replay_location() in the standby seems
more robust and well-defined to me. It's a measure of XLOG location (ie.
bytes) instead of time, but time is a complicated concept.


I can tell you that end users *will* want a time-based indication of how far behind we 
are. DBAs will understand "we're this many transactions behind", but managers 
and end users won't. Unless it's unreasonable to provide that info, we should do so.


No doubt about that, the problem is that it's hard to provide a reliable 
time-based indication.


--
  Heikki Linnakangas
  EnterpriseDB   http://www.enterprisedb.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Keepalive for max_standby_delay

2010-05-18 Thread Heikki Linnakangas


On 17/05/10 04:40, Simon Riggs wrote:

On Sun, 2010-05-16 at 16:53 +0100, Simon Riggs wrote:


Attached patch rearranges the walsender loops slightly to fix the above.
XLogSend() now only sends up to MAX_SEND_SIZE bytes (== XLOG_SEG_SIZE /
2) in one round and returns to the main loop after that even if there's
unsent WAL, and the main loop no longer sleeps if there's unsent WAL.
That way the main loop gets to respond to signals quickly, and we also
get to update the shared memory status and PS display more often when
there's a lot of catching up to do.

Comments


8MB at a time still seems like a large batch to me.

libpq is going to send it in smaller chunks anyway, so I don't see the
importance of trying to keep the batch too large. It just introduces
delay into the sending process. We should be sending chunks that matches
libpq better.


More to the point the logic will fail if XLOG_BLCKSZ>  PQ_BUFFER_SIZE
because it will send partial pages.


I don't see a failure. We rely on not splitting WAL records across 
messages, but we're talking about libpq-level CopyData messages, not TCP 
messages.



Having MAX_SEND_SIZE>  PQ_BUFFER_SIZE is pointless, as libpq currently
stands.


Well, it does affect the size of the read() in walsender, and I'm sure 
there's some overhead in setting the ps display and the other small 
stuff we do once per message. But you're probably right that we could 
easily make MAX_SEND_SIZE much smaller with no noticeable affect on 
performance, while making walsender more responsive to signals. I'll 
decrease it to, say, 512 kB.


--
  Heikki Linnakangas
  EnterpriseDB   http://www.enterprisedb.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Keepalive for max_standby_delay

2010-05-17 Thread Jim Nasby

On May 15, 2010, at 12:05 PM, Heikki Linnakangas wrote:
> What exactly is the user trying to monitor? If it's "how far behind is
> the standby", the difference between pg_current_xlog_insert_location()
> in the master and pg_last_xlog_replay_location() in the standby seems
> more robust and well-defined to me. It's a measure of XLOG location (ie.
> bytes) instead of time, but time is a complicated concept.

I can tell you that end users *will* want a time-based indication of how far 
behind we are. DBAs will understand "we're this many transactions behind", but 
managers and end users won't. Unless it's unreasonable to provide that info, we 
should do so.
--
Jim C. Nasby, Database Architect   j...@nasby.net
512.569.9461 (cell) http://jim.nasby.net



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Keepalive for max_standby_delay

2010-05-17 Thread Simon Riggs

On Sun, 2010-05-16 at 16:53 +0100, Simon Riggs wrote:
> > 
> > Attached patch rearranges the walsender loops slightly to fix the above.
> > XLogSend() now only sends up to MAX_SEND_SIZE bytes (== XLOG_SEG_SIZE /
> > 2) in one round and returns to the main loop after that even if there's
> > unsent WAL, and the main loop no longer sleeps if there's unsent WAL.
> > That way the main loop gets to respond to signals quickly, and we also
> > get to update the shared memory status and PS display more often when
> > there's a lot of catching up to do.
> > 
> > Comments
> 
> 8MB at a time still seems like a large batch to me.
> 
> libpq is going to send it in smaller chunks anyway, so I don't see the
> importance of trying to keep the batch too large. It just introduces
> delay into the sending process. We should be sending chunks that matches
> libpq better.

More to the point the logic will fail if XLOG_BLCKSZ > PQ_BUFFER_SIZE
because it will send partial pages.

Having MAX_SEND_SIZE > PQ_BUFFER_SIZE is pointless, as libpq currently
stands.

-- 
 Simon Riggs   www.2ndQuadrant.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Keepalive for max_standby_delay

2010-05-17 Thread Simon Riggs

On Mon, 2010-05-17 at 11:51 +0900, Fujii Masao wrote:

> Is it OK that this keepalive message cannot be used by HS in file-based
> log-shipping? Even in SR, the startup process cannot use the keepalive
> until walreceiver has been started up.

Yes, I see those things. 

We already have archive_timeout to handle the keepalive case in
file-based. 

When starting up the delay is high anyway, so doesn't really matter
about accuracy - though we do use latestXLogTime in that cases.

> WalSndKeepAlive() always calls initStringInfo(), which seems to cause
> memory-leak.

Thanks.

-- 
 Simon Riggs   www.2ndQuadrant.com

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Keepalive for max_standby_delay

2010-05-16 Thread Fujii Masao

On Mon, May 17, 2010 at 1:11 AM, Simon Riggs  wrote:
> On Sat, 2010-05-15 at 19:50 +0100, Simon Riggs wrote:
>> On Sat, 2010-05-15 at 18:24 +0100, Simon Riggs wrote:
>>
>> > I will recode using that concept.
>
>> Startup gets new pointer when it runs out of data to replay. That might
>> or might not include an updated keepalive timestamp, since there's no
>> exact relationship between chunks sent and chunks received. Startup
>> might ask for a new chunk when half a chunk has been received, or when
>> multiple chunks have been received.
>
> New version, with some other cleanup of wait processing.
>
> New logic is that when Startup asks for next applychunk of WAL it saves
> the lastChunkTimestamp. That is then the base time used by
> WaitExceedsMaxStandbyDelay(), except when latestXLogTime is later.
> Since multiple receivechunks can arrive from primary before Startup asks
> for next applychunk we use the oldest receivechunk timestamp, not the
> latest. Doing it this way means the lastChunkTimestamp doesn't change
> when new keepalives arrive, so we have a stable and reasonably accurate
> recordSendTimestamp for each WAL record.
>
> The keepalive is sent as the first part of a new message, if any. So
> partial chunks of data always have an accurate timestamp, even if that
> is slightly older as a result. Doesn't make much difference except with
> very large chunks.
>
> I think that addresses the points raised on this thread and others.

Is it OK that this keepalive message cannot be used by HS in file-based
log-shipping? Even in SR, the startup process cannot use the keepalive
until walreceiver has been started up.

WalSndKeepAlive() always calls initStringInfo(), which seems to cause
memory-leak.

Regards,

-- 
Fujii Masao
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Keepalive for max_standby_delay

2010-05-16 Thread Fujii Masao

On Sun, May 16, 2010 at 6:05 AM, Heikki Linnakangas
 wrote:
> Heikki Linnakangas wrote:
>> Simon Riggs wrote:
>>> WALSender sleeps even when it might have more WAL to send, it doesn't
>>> check it just unconditionally sleeps. At least WALReceiver loops until
>>> it has no more to receive. I just can't imagine why that's useful
>>> behaviour.
>>
>> Good catch. That should be fixed.
>>
>> I also note that walsender doesn't respond to signals, while it's
>> sending a large batch. That's analogous to the issue that was addressed
>> recently in the archiver process.
>
> Attached patch rearranges the walsender loops slightly to fix the above.
> XLogSend() now only sends up to MAX_SEND_SIZE bytes (== XLOG_SEG_SIZE /
> 2) in one round and returns to the main loop after that even if there's
> unsent WAL, and the main loop no longer sleeps if there's unsent WAL.
> That way the main loop gets to respond to signals quickly, and we also
> get to update the shared memory status and PS display more often when
> there's a lot of catching up to do.
>
> Comments

The main loop needs to respond to also the 'X' message from the standby
quickly?

Regards,

-- 
Fujii Masao
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Keepalive for max_standby_delay

2010-05-16 Thread Simon Riggs

On Sat, 2010-05-15 at 19:50 +0100, Simon Riggs wrote:
> On Sat, 2010-05-15 at 18:24 +0100, Simon Riggs wrote:
> 
> > I will recode using that concept.

> Startup gets new pointer when it runs out of data to replay. That might
> or might not include an updated keepalive timestamp, since there's no
> exact relationship between chunks sent and chunks received. Startup
> might ask for a new chunk when half a chunk has been received, or when
> multiple chunks have been received.

New version, with some other cleanup of wait processing.

New logic is that when Startup asks for next applychunk of WAL it saves
the lastChunkTimestamp. That is then the base time used by
WaitExceedsMaxStandbyDelay(), except when latestXLogTime is later.
Since multiple receivechunks can arrive from primary before Startup asks
for next applychunk we use the oldest receivechunk timestamp, not the
latest. Doing it this way means the lastChunkTimestamp doesn't change
when new keepalives arrive, so we have a stable and reasonably accurate
recordSendTimestamp for each WAL record.

The keepalive is sent as the first part of a new message, if any. So
partial chunks of data always have an accurate timestamp, even if that
is slightly older as a result. Doesn't make much difference except with
very large chunks.

I think that addresses the points raised on this thread and others.

-- 
 Simon Riggs   www.2ndQuadrant.com
*** a/doc/src/sgml/protocol.sgml
--- b/doc/src/sgml/protocol.sgml
***
*** 4232,4247  The commands accepted in walsender mode are:



!   
!   
!   
!  
!  
 A single WAL record is never split across two CopyData messages. When
 a WAL record crosses a WAL page boundary, however, and is therefore
 already split using continuation records, it can be split at the page
 boundary. In other words, the first main WAL record and its
 continuation records can be split across different CopyData messages.
   
  

--- 4232,4283 



!   
 A single WAL record is never split across two CopyData messages. When
 a WAL record crosses a WAL page boundary, however, and is therefore
 already split using continuation records, it can be split at the page
 boundary. In other words, the first main WAL record and its
 continuation records can be split across different CopyData messages.
+   
+   
+   
+   
+   
+   Keepalive (B)
+   
+   
+   
+   
+   
+   
+   Byte1('k')
+   
+   
+   
+   Identifies the message as a keepalive.
+   
+   
+   
+   
+   
+   TimestampTz
+   
+   
+   
+   The current timestamp on the primary server when the keepalive was sent.
+   
+   
+   
+   
+   
+   
+If wal_level is set to hot_standby then a keepalive
+is sent once per wal_sender_delay. The keepalive is sent after
+WAL data has been sent, if any.
+   
+   
+   
+   
   
  

*** a/src/backend/access/transam/xlog.c
--- b/src/backend/access/transam/xlog.c
***
*** 1938,1944  UpdateMinRecoveryPoint(XLogRecPtr lsn, bool force)
  			UpdateControlFile();
  			minRecoveryPoint = newMinRecoveryPoint;
  
! 			ereport(DEBUG2,
  	(errmsg("updated min recovery point to %X/%X",
  		minRecoveryPoint.xlogid, minRecoveryPoint.xrecoff)));
  		}
--- 1938,1944 
  			UpdateControlFile();
  			minRecoveryPoint = newMinRecoveryPoint;
  
! 			ereport(DEBUG3,
  	(errmsg("updated min recovery point to %X/%X",
  		minRecoveryPoint.xlogid, minRecoveryPoint.xrecoff)));
  		}
***
*** 9212,9218  retry:
  	 * While walreceiver is active, wait for new WAL to arrive
  	 * from primary.
  	 */
! 	receivedUpto = GetWalRcvWriteRecPtr();
  	if (XLByteLT(*RecPtr, receivedUpto))
  	{
  		/*
--- 9212,9218 
  	 * While walreceiver is active, wait for new WAL to arrive
  	 * from primary.
  	 */
! 	receivedUpto = GetWalRcvNextChunk();
  	if (XLByteLT(*RecPtr, receivedUpto))
  	{
  		/*
*** a/src/backend/replication/walreceiver.c
--- b/src/backend/replication/walreceiver.c
***
*** 407,412  XLogWalRcvProcessMsg(unsigned char type, char *buf, Size len)
--- 407,428 
  XLogWalRcvWrite(buf, len, recptr);
  break;
  			}
+ 		case 'k':/* keepalive */
+ 			{
+ TimestampTz keepalive;
+ 
+ if (len != sizeof(TimestampTz))
+ 	ereport(ERROR,
+ 			(errcode(ERRCODE_PROTOCOL_VIOLATION),
+ 			 errmsg_internal("invalid keepalive message received from primary")));
+ 
+ memcpy(&keepalive, buf, sizeof(TimestampTz));
+ buf += sizeof(TimestampTz);
+ len -= sizeof(TimestampTz);
+ 
+ SetWalRcvChunkTimestamp(keepalive);
+ break;
+ 			}
  		default:
  			ereport(ERROR,

Re: [HACKERS] Keepalive for max_standby_delay

2010-05-16 Thread Simon Riggs

On Sun, 2010-05-16 at 00:05 +0300, Heikki Linnakangas wrote:
> Heikki Linnakangas wrote:
> > Simon Riggs wrote:
> >> WALSender sleeps even when it might have more WAL to send, it doesn't
> >> check it just unconditionally sleeps. At least WALReceiver loops until
> >> it has no more to receive. I just can't imagine why that's useful
> >> behaviour.
> > 
> > Good catch. That should be fixed.
> > 
> > I also note that walsender doesn't respond to signals, while it's
> > sending a large batch. That's analogous to the issue that was addressed
> > recently in the archiver process.
> 
> Attached patch rearranges the walsender loops slightly to fix the above.
> XLogSend() now only sends up to MAX_SEND_SIZE bytes (== XLOG_SEG_SIZE /
> 2) in one round and returns to the main loop after that even if there's
> unsent WAL, and the main loop no longer sleeps if there's unsent WAL.
> That way the main loop gets to respond to signals quickly, and we also
> get to update the shared memory status and PS display more often when
> there's a lot of catching up to do.
> 
> Comments

8MB at a time still seems like a large batch to me.

libpq is going to send it in smaller chunks anyway, so I don't see the
importance of trying to keep the batch too large. It just introduces
delay into the sending process. We should be sending chunks that matches
libpq better.

-- 
 Simon Riggs   www.2ndQuadrant.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Keepalive for max_standby_delay

2010-05-15 Thread Heikki Linnakangas

Heikki Linnakangas wrote:
> Simon Riggs wrote:
>> WALSender sleeps even when it might have more WAL to send, it doesn't
>> check it just unconditionally sleeps. At least WALReceiver loops until
>> it has no more to receive. I just can't imagine why that's useful
>> behaviour.
> 
> Good catch. That should be fixed.
> 
> I also note that walsender doesn't respond to signals, while it's
> sending a large batch. That's analogous to the issue that was addressed
> recently in the archiver process.

Attached patch rearranges the walsender loops slightly to fix the above.
XLogSend() now only sends up to MAX_SEND_SIZE bytes (== XLOG_SEG_SIZE /
2) in one round and returns to the main loop after that even if there's
unsent WAL, and the main loop no longer sleeps if there's unsent WAL.
That way the main loop gets to respond to signals quickly, and we also
get to update the shared memory status and PS display more often when
there's a lot of catching up to do.

Comments, have I screwed up anything?

-- 
  Heikki Linnakangas
  EnterpriseDB   http://www.enterprisedb.com
*** a/src/backend/replication/walsender.c
--- b/src/backend/replication/walsender.c
***
*** 100,106  static void InitWalSnd(void);
  static void WalSndHandshake(void);
  static void WalSndKill(int code, Datum arg);
  static void XLogRead(char *buf, XLogRecPtr recptr, Size nbytes);
! static bool XLogSend(StringInfo outMsg);
  static void CheckClosedConnection(void);
  
  /*
--- 100,106 
  static void WalSndHandshake(void);
  static void WalSndKill(int code, Datum arg);
  static void XLogRead(char *buf, XLogRecPtr recptr, Size nbytes);
! static bool XLogSend(StringInfo outMsg, bool *caughtup);
  static void CheckClosedConnection(void);
  
  /*
***
*** 360,365  static int
--- 360,366 
  WalSndLoop(void)
  {
  	StringInfoData output_message;
+ 	bool		caughtup = false;
  
  	initStringInfo(&output_message);
  
***
*** 387,393  WalSndLoop(void)
  		 */
  		if (ready_to_stop)
  		{
! 			XLogSend(&output_message);
  			shutdown_requested = true;
  		}
  
--- 388,394 
  		 */
  		if (ready_to_stop)
  		{
! 			XLogSend(&output_message, &caughtup);
  			shutdown_requested = true;
  		}
  
***
*** 402,432  WalSndLoop(void)
  		}
  
  		/*
! 		 * Nap for the configured time or until a message arrives.
  		 *
  		 * On some platforms, signals won't interrupt the sleep.  To ensure we
  		 * respond reasonably promptly when someone signals us, break down the
  		 * sleep into NAPTIME_PER_CYCLE increments, and check for
  		 * interrupts after each nap.
  		 */
! 		remain = WalSndDelay * 1000L;
! 		while (remain > 0)
  		{
! 			if (got_SIGHUP || shutdown_requested || ready_to_stop)
! break;
  
! 			/*
! 			 * Check to see whether a message from the standby or an interrupt
! 			 * from other processes has arrived.
! 			 */
! 			pg_usleep(remain > NAPTIME_PER_CYCLE ? NAPTIME_PER_CYCLE : remain);
! 			CheckClosedConnection();
  
! 			remain -= NAPTIME_PER_CYCLE;
  		}
- 
  		/* Attempt to send the log once every loop */
! 		if (!XLogSend(&output_message))
  			goto eof;
  	}
  
--- 403,434 
  		}
  
  		/*
! 		 * If we had sent all accumulated WAL in last round, nap for the
! 		 * configured time before retrying.
  		 *
  		 * On some platforms, signals won't interrupt the sleep.  To ensure we
  		 * respond reasonably promptly when someone signals us, break down the
  		 * sleep into NAPTIME_PER_CYCLE increments, and check for
  		 * interrupts after each nap.
  		 */
! 		if (caughtup)
  		{
! 			remain = WalSndDelay * 1000L;
! 			while (remain > 0)
! 			{
! /* Check for interrupts */
! if (got_SIGHUP || shutdown_requested || ready_to_stop)
! 	break;
  
! /* Sleep and check that the connection is still alive */
! pg_usleep(remain > NAPTIME_PER_CYCLE ? NAPTIME_PER_CYCLE : remain);
! CheckClosedConnection();
  
! remain -= NAPTIME_PER_CYCLE;
! 			}
  		}
  		/* Attempt to send the log once every loop */
! 		if (!XLogSend(&output_message, &caughtup))
  			goto eof;
  	}
  
***
*** 623,637  XLogRead(char *buf, XLogRecPtr recptr, Size nbytes)
  }
  
  /*
!  * Read all WAL that's been written (and flushed) since last cycle, and send
!  * it to client.
   *
   * Returns true if OK, false if trouble.
   */
  static bool
! XLogSend(StringInfo outMsg)
  {
  	XLogRecPtr	SendRqstPtr;
  	char		activitymsg[50];
  
  	/* use volatile pointer to prevent code rearrangement */
--- 625,644 
  }
  
  /*
!  * Read up to MAX_SEND_SIZE bytes of WAL that's been written (and flushed),
!  * but not yet sent to the client, and send it. If there is no unsent WAL,
!  * *caughtup is set to true and nothing is sent, otherwise *caughtup is set
!  * to false.
   *
   * Returns true if OK, false if trouble.
   */
  static bool
! XLogSend(StringInfo outMsg, bool *caughtup)
  {
  	XLogRecPtr	SendRqstPtr;
+ 	XLogRecPtr	startptr;
+ 	XLogRecPtr	endptr;
+ 	Size		nbytes;
  	char		acti

Re: [HACKERS] Keepalive for max_standby_delay

2010-05-15 Thread Heikki Linnakangas

Simon Riggs wrote:
> WALSender sleeps even when it might have more WAL to send, it doesn't
> check it just unconditionally sleeps. At least WALReceiver loops until
> it has no more to receive. I just can't imagine why that's useful
> behaviour.

Good catch. That should be fixed.

I also note that walsender doesn't respond to signals, while it's
sending a large batch. That's analogous to the issue that was addressed
recently in the archiver process.

-- 
  Heikki Linnakangas
  EnterpriseDB   http://www.enterprisedb.com

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Keepalive for max_standby_delay

2010-05-15 Thread Simon Riggs

On Sat, 2010-05-15 at 18:24 +0100, Simon Riggs wrote:

> I will recode using that concept.

There's some behaviours that aren't helpful here:

Startup gets new pointer when it runs out of data to replay. That might
or might not include an updated keepalive timestamp, since there's no
exact relationship between chunks sent and chunks received. Startup
might ask for a new chunk when half a chunk has been received, or when
multiple chunks have been received.

WALSender doesn't chunk up what it sends, though libpq does at a lower
level. So there's no way to make a keepalive happen every X bytes
without doing this from within libpq.

WALSender sleeps even when it might have more WAL to send, it doesn't
check it just unconditionally sleeps. At least WALReceiver loops until
it has no more to receive. I just can't imagine why that's useful
behaviour.

All in all, I think we should be calling this "burst replication" not
streaming replication. That does cause an issue in trying to monitor
what's going on cos there's so much jitter.

-- 
 Simon Riggs   www.2ndQuadrant.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Keepalive for max_standby_delay

2010-05-15 Thread Simon Riggs

On Sat, 2010-05-15 at 20:05 +0300, Heikki Linnakangas wrote:
> Simon Riggs wrote:
> > On Sat, 2010-05-15 at 19:30 +0300, Heikki Linnakangas wrote:
> >> Doesn't feel right to me either. If you want to expose the
> >> keepalive-time to queries, it should be a separate field, something like
> >> lastMasterKeepaliveTime and a pg_last_master_keepalive() function to
> >> read it.
> > 
> > That wouldn't be good because then you couldn't easily monitor the
> > delay? You'd have to run two different functions depending on the state
> > of replication (for which we would need yet another function). Users
> > would just wrap that back up into a single function.
> 
> What exactly is the user trying to monitor? If it's "how far behind is
> the standby", the difference between pg_current_xlog_insert_location()
> in the master and pg_last_xlog_replay_location() in the standby seems
> more robust and well-defined to me. It's a measure of XLOG location (ie.
> bytes) instead of time, but time is a complicated concept.

Maybe, but its meaningful to users and that is the point.

> Also note that as the patch stands, if you receive a keep-alive from the
> master at point X, it doesn't mean that the standby is fully up-to-date.
> It's possible that the walsender just finished sending a huge batch of
> accumulated WAL, say 1 GB, and it took 1 hour for the batch to be sent.
> During that time, a lot more WAL has accumulated, yet walsender sends a
> keep-alive with the current timestamp.

Not at all. The timestamp for the keepalive is calculated after the
pq_flush for the main WAL data. So it takes 10 years to send the next
blob of WAL data the timestamp will be current.

However, a point you made in an earlier thread is still true here. It
sounds like it would be best if startup process didn't re-access shared
memory for the next location until it had fully replayed all the WAL up
to the point it last accessed. That way the changing value of the shared
timestamp would have no effect on the calculated value at any point in
time. I will recode using that concept.

> In general, the purpose of a keep-alive is to keep the connection alive,
> but you're trying to accomplish something else too, and I don't fully
> understand what it is.

That surprises me. If nothing else, its in the title of the thread,
though since you personally added this to the Hot Standby todo more than
6 months ago I'd hope you of all people would understand the purpose.

-- 
 Simon Riggs   www.2ndQuadrant.com

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Keepalive for max_standby_delay

2010-05-15 Thread Heikki Linnakangas

Simon Riggs wrote:
> On Sat, 2010-05-15 at 19:30 +0300, Heikki Linnakangas wrote:
>> Doesn't feel right to me either. If you want to expose the
>> keepalive-time to queries, it should be a separate field, something like
>> lastMasterKeepaliveTime and a pg_last_master_keepalive() function to
>> read it.
> 
> That wouldn't be good because then you couldn't easily monitor the
> delay? You'd have to run two different functions depending on the state
> of replication (for which we would need yet another function). Users
> would just wrap that back up into a single function.

What exactly is the user trying to monitor? If it's "how far behind is
the standby", the difference between pg_current_xlog_insert_location()
in the master and pg_last_xlog_replay_location() in the standby seems
more robust and well-defined to me. It's a measure of XLOG location (ie.
bytes) instead of time, but time is a complicated concept.

Also note that as the patch stands, if you receive a keep-alive from the
master at point X, it doesn't mean that the standby is fully up-to-date.
It's possible that the walsender just finished sending a huge batch of
accumulated WAL, say 1 GB, and it took 1 hour for the batch to be sent.
During that time, a lot more WAL has accumulated, yet walsender sends a
keep-alive with the current timestamp.

In general, the purpose of a keep-alive is to keep the connection alive,
but you're trying to accomplish something else too, and I don't fully
understand what it is.

-- 
  Heikki Linnakangas
  EnterpriseDB   http://www.enterprisedb.com

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Keepalive for max_standby_delay

2010-05-15 Thread Simon Riggs

On Sat, 2010-05-15 at 19:30 +0300, Heikki Linnakangas wrote:
> Simon Riggs wrote:
> > On Sat, 2010-05-15 at 11:45 -0400, Tom Lane wrote:
> >> I'm also extremely dubious that it's a good idea to set
> >> recoveryLastXTime from this.  Using both that and the timestamps from
> >> the wal log flies in the face of everything I remember about control
> >> theory.  We should be doing only one or only the other, I think.
> > 
> > I can change it so that the recoveryLastXTime will not be updated if we
> > are using the value from the keepalives. So we have one, or the other.
> > Remember that replication can switch backwards and forwards between
> > modes, so it seems sensible to have a common timing value whichever mode
> > we're in.
> 
> That means that recoveryLastXTime can jump forwards and backwards.

That behaviour would be bad, so why not just prevent that from
happening?

> Doesn't feel right to me either. If you want to expose the
> keepalive-time to queries, it should be a separate field, something like
> lastMasterKeepaliveTime and a pg_last_master_keepalive() function to
> read it.

That wouldn't be good because then you couldn't easily monitor the
delay? You'd have to run two different functions depending on the state
of replication (for which we would need yet another function). Users
would just wrap that back up into a single function.

-- 
 Simon Riggs   www.2ndQuadrant.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Keepalive for max_standby_delay

2010-05-15 Thread Heikki Linnakangas

Simon Riggs wrote:
> On Sat, 2010-05-15 at 11:45 -0400, Tom Lane wrote:
>> I'm also extremely dubious that it's a good idea to set
>> recoveryLastXTime from this.  Using both that and the timestamps from
>> the wal log flies in the face of everything I remember about control
>> theory.  We should be doing only one or only the other, I think.
> 
> I can change it so that the recoveryLastXTime will not be updated if we
> are using the value from the keepalives. So we have one, or the other.
> Remember that replication can switch backwards and forwards between
> modes, so it seems sensible to have a common timing value whichever mode
> we're in.

That means that recoveryLastXTime can jump forwards and backwards.
Doesn't feel right to me either. If you want to expose the
keepalive-time to queries, it should be a separate field, something like
lastMasterKeepaliveTime and a pg_last_master_keepalive() function to
read it.

-- 
  Heikki Linnakangas
  EnterpriseDB   http://www.enterprisedb.com

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Keepalive for max_standby_delay

2010-05-15 Thread Simon Riggs

On Sat, 2010-05-15 at 11:45 -0400, Tom Lane wrote:
> Simon Riggs  writes:
> > Patch adds a keepalive message to ensure max_standby_delay is useful.
> 
> The proposed placement of the keepalive-send is about the worst it could
> possibly be.  It needs to be done right before pq_flush to ensure
> minimum transfer delay.  Otherwise any attempt to measure clock skew
> using the timestamp will be seriously off.  Where you've placed it
> guarantees maximum delay not minimum.

I don't understand. WalSndKeepAlive() contains a pq_flush() immediately
after the timestamp is set. I did that way for exactly the same reasons
you've said.

Do you mean you only want to see one pq_flush()? I used two so that the
actual data is never delayed by a keepalive. WAL Sender was going to
sleep anyway, so shouldn't be a problem.

> I'm also extremely dubious that it's a good idea to set
> recoveryLastXTime from this.  Using both that and the timestamps from
> the wal log flies in the face of everything I remember about control
> theory.  We should be doing only one or only the other, I think.

I can change it so that the recoveryLastXTime will not be updated if we
are using the value from the keepalives. So we have one, or the other.
Remember that replication can switch backwards and forwards between
modes, so it seems sensible to have a common timing value whichever mode
we're in.

-- 
 Simon Riggs   www.2ndQuadrant.com

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Keepalive for max_standby_delay

2010-05-15 Thread Tom Lane

Simon Riggs  writes:
> Patch adds a keepalive message to ensure max_standby_delay is useful.

The proposed placement of the keepalive-send is about the worst it could
possibly be.  It needs to be done right before pq_flush to ensure
minimum transfer delay.  Otherwise any attempt to measure clock skew
using the timestamp will be seriously off.  Where you've placed it
guarantees maximum delay not minimum.

I'm also extremely dubious that it's a good idea to set
recoveryLastXTime from this.  Using both that and the timestamps from
the wal log flies in the face of everything I remember about control
theory.  We should be doing only one or only the other, I think.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

[HACKERS] Keepalive for max_standby_delay

2010-05-15 Thread Simon Riggs


Patch adds a keepalive message to ensure max_standby_delay is useful.

No WAL format changes, no libpq changes. Just an additional message type
for the streaming replication protocol, sent once per main loop in
WALsender. Plus docs.

Comments?

-- 
 Simon Riggs   www.2ndQuadrant.com
diff --git a/doc/src/sgml/protocol.sgml b/doc/src/sgml/protocol.sgml
index c63d003..391d990 100644
--- a/doc/src/sgml/protocol.sgml
+++ b/doc/src/sgml/protocol.sgml
@@ -4232,16 +4232,52 @@ The commands accepted in walsender mode are:
   
   
   
-  
-  
-  
- 
- 
+  
A single WAL record is never split across two CopyData messages. When
a WAL record crosses a WAL page boundary, however, and is therefore
already split using continuation records, it can be split at the page
boundary. In other words, the first main WAL record and its
continuation records can be split across different CopyData messages.
+  
+  
+  
+  
+  
+  Keepalive (B)
+  
+  
+  
+  
+  
+  
+  Byte1('k')
+  
+  
+  
+  Identifies the message as a keepalive.
+  
+  
+  
+  
+  
+  TimestampTz
+  
+  
+  
+  The current timestamp on the primary server when the keepalive was sent.
+  
+  
+  
+  
+  
+  
+   If wal_level is set to hot_standby then a keepalive
+   is sent once per wal_sender_delay. The keepalive is sent after
+   WAL data has been sent, if any.
+  
+  
+  
+  
  
 
   
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 607d57e..ee383af 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -5566,6 +5566,17 @@ GetLatestXLogTime(void)
 	return recoveryLastXTime;
 }
 
+void
+SetLatestXLogTime(TimestampTz newLastXTime)
+{
+	/* use volatile pointer to prevent code rearrangement */
+	volatile XLogCtlData *xlogctl = XLogCtl;
+
+	SpinLockAcquire(&xlogctl->info_lck);
+	xlogctl->recoveryLastXTime = newLastXTime;
+	SpinLockRelease(&xlogctl->info_lck);
+}
+
 /*
  * Note that text field supplied is a parameter name and does not require
  * translation
diff --git a/src/backend/replication/walreceiver.c b/src/backend/replication/walreceiver.c
index bb87a06..8d52c3f 100644
--- a/src/backend/replication/walreceiver.c
+++ b/src/backend/replication/walreceiver.c
@@ -407,6 +407,22 @@ XLogWalRcvProcessMsg(unsigned char type, char *buf, Size len)
 XLogWalRcvWrite(buf, len, recptr);
 break;
 			}
+		case 'k':/* keepalive */
+			{
+TimestampTz keepalive;
+
+if (len != sizeof(TimestampTz))
+	ereport(ERROR,
+			(errcode(ERRCODE_PROTOCOL_VIOLATION),
+			 errmsg_internal("invalid keepalive message received from primary")));
+
+memcpy(&keepalive, buf, sizeof(TimestampTz));
+buf += sizeof(TimestampTz);
+len -= sizeof(TimestampTz);
+
+SetLatestXLogTime(keepalive);
+break;
+			}
 		default:
 			ereport(ERROR,
 	(errcode(ERRCODE_PROTOCOL_VIOLATION),
diff --git a/src/backend/replication/walsender.c b/src/backend/replication/walsender.c
index 1c04fc3..f2f8750 100644
--- a/src/backend/replication/walsender.c
+++ b/src/backend/replication/walsender.c
@@ -98,6 +98,7 @@ static void WalSndQuickDieHandler(SIGNAL_ARGS);
 static int	WalSndLoop(void);
 static void InitWalSnd(void);
 static void WalSndHandshake(void);
+static bool WalSndKeepAlive(void);
 static void WalSndKill(int code, Datum arg);
 static void XLogRead(char *buf, XLogRecPtr recptr, Size nbytes);
 static bool XLogSend(StringInfo outMsg);
@@ -314,6 +315,30 @@ WalSndHandshake(void)
 	}
 }
 
+static bool
+WalSndKeepAlive(void)
+{
+	StringInfoData outMsg;
+	TimestampTz ts;
+
+	if (!XLogStandbyInfoActive())
+		return true;
+
+	initStringInfo(&outMsg);
+	ts = GetCurrentTimestamp();
+
+	/* format the keepalive message */
+	pq_sendbyte(&outMsg, 'k');
+	pq_sendbytes(&outMsg, (char *) &ts, sizeof(TimestampTz));
+
+	/* send the CopyData message */
+	pq_putmessage('d', outMsg.data, outMsg.len);
+	if (pq_flush())
+		return false;
+
+	return true;
+}
+
 /*
  * Check if the remote end has closed the connection.
  */
@@ -428,6 +453,9 @@ WalSndLoop(void)
 		/* Attempt to send the log once every loop */
 		if (!XLogSend(&output_message))
 			goto eof;
+
+		if (!WalSndKeepAlive())
+			goto eof;
 	}
 
 	/* can't get here because the above loop never exits */
diff --git a/src/include/access/xlog.h b/src/include/access/xlog.h
index 8ff68a6..2d01670 100644
--- a/src/include/access/xlog.h
+++ b/src/include/access/xlog.h
@@ -284,6 +284,7 @@ extern void issue_xlog_fsync(int fd, uint32 log, uint32 seg);
 extern bool RecoveryInProgress(void);
 extern bool XLogInsertAllowed(void);
 extern TimestampTz GetLatestXLogTime(void);
+extern void SetLatestXLogTime(TimestampTz newLatestXLogTime);
 
 extern void UpdateControlFile(void);
 extern uint64 GetSystemIdentifie

94 matches

Mail list logo