Re: [HACKERS] Measuring replay lag

2017-03-23 Thread Craig Ringer
On 24 March 2017 at 05:39, Thomas Munro wrote: > Fujii-san for the idea of tracking write and flush lag too You mentioned wishing that logical replication would update sent lag as the decoding position. It appears to do just that already; see the references to

Re: [HACKERS] Measuring replay lag

2017-03-23 Thread Thomas Munro
On Thu, Mar 23, 2017 at 10:50 PM, Simon Riggs wrote: >> Second thoughts... I'll just make LagTrackerWrite externally >> available, so a plugin can send anything it wants to the tracker. >> Which means I'm explicitly removing the "logical replication support" >> from this

Re: [HACKERS] Measuring replay lag

2017-03-23 Thread Thomas Munro
On Thu, Mar 23, 2017 at 10:50 PM, Simon Riggs wrote: >> Second thoughts... I'll just make LagTrackerWrite externally >> available, so a plugin can send anything it wants to the tracker. >> Which means I'm explicitly removing the "logical replication support" >> from this

Re: [HACKERS] Measuring replay lag

2017-03-23 Thread Simon Riggs
> Second thoughts... I'll just make LagTrackerWrite externally > available, so a plugin can send anything it wants to the tracker. > Which means I'm explicitly removing the "logical replication support" > from this patch. Done. Here's the patch I'm looking to commit, with some docs and minor

Re: [HACKERS] Measuring replay lag

2017-03-23 Thread Simon Riggs
On 23 March 2017 at 06:42, Simon Riggs wrote: > On 23 March 2017 at 01:02, Thomas Munro wrote: > >> Thanks! Please find attached v7, which includes a note we can point >> at when someone asks why it doesn't show 00:00:00, as requested. > >

Re: [HACKERS] Measuring replay lag

2017-03-23 Thread Simon Riggs
On 23 March 2017 at 01:02, Thomas Munro wrote: > Thanks! Please find attached v7, which includes a note we can point > at when someone asks why it doesn't show 00:00:00, as requested. Thanks. Now I look harder the handling for logical lag seems like it would be

Re: [HACKERS] Measuring replay lag

2017-03-22 Thread Thomas Munro
On Wed, Mar 15, 2017 at 8:15 PM, Ian Barwick wrote: >> 2. Recognise when the last reported write/flush/apply LSN from the >> standby == end of WAL on the sending server, and show lag times of >> 00:00:00 in all three columns. I consider this entirely bogus: it's >>

Re: [HACKERS] Measuring replay lag

2017-03-22 Thread Thomas Munro
On Thu, Mar 23, 2017 at 12:12 AM, Simon Riggs wrote: > On 22 March 2017 at 11:03, Thomas Munro wrote: > >> Hah. Apologies for the delay -- I will post a patch with >> documentation as requested within 24 hours. > > Thanks very much. I'll

Re: [HACKERS] Measuring replay lag

2017-03-22 Thread Robert Haas
On Wed, Mar 22, 2017 at 6:57 AM, Simon Riggs wrote: > Not sure whether this a 6 day lag, or we should show NULL because we > are up to date. OK, that made me laugh. Thanks for putting in the effort on this patch, BTW. -- Robert Haas EnterpriseDB:

Re: [HACKERS] Measuring replay lag

2017-03-22 Thread Simon Riggs
On 22 March 2017 at 11:03, Thomas Munro wrote: > Hah. Apologies for the delay -- I will post a patch with > documentation as requested within 24 hours. Thanks very much. I'll reserve time to commit it tomorrow, all else being good. -- Simon Riggs

Re: [HACKERS] Measuring replay lag

2017-03-22 Thread Thomas Munro
On Wed, Mar 22, 2017 at 11:57 PM, Simon Riggs wrote: >>> I accept your proposal for how we handle these, on condition that you >>> write up some docs that explain the subtle difference between the two, >>> so we can just show people the URL. That needs to explain clearly

Re: [HACKERS] Measuring replay lag

2017-03-22 Thread Simon Riggs
On 21 March 2017 at 17:32, David Steele wrote: > Hi Thomas, > > On 3/15/17 8:38 PM, Simon Riggs wrote: >> >> On 16 March 2017 at 08:02, Thomas Munro >> wrote: >> >>> I agree that these states exist, but we disagree on what 'lag' really >>>

Re: [HACKERS] Measuring replay lag

2017-03-21 Thread David Steele
Hi Thomas, On 3/15/17 8:38 PM, Simon Riggs wrote: On 16 March 2017 at 08:02, Thomas Munro wrote: I agree that these states exist, but we disagree on what 'lag' really means, or, rather, which of several plausible definitions would be the most useful here. My

Re: [HACKERS] Measuring replay lag

2017-03-15 Thread Simon Riggs
On 16 March 2017 at 08:02, Thomas Munro wrote: > I agree that these states exist, but we disagree on what 'lag' really > means, or, rather, which of several plausible definitions would be the > most useful here. > > My proposal is that the *_lag columns should

Re: [HACKERS] Measuring replay lag

2017-03-15 Thread Thomas Munro
On Thu, Mar 16, 2017 at 12:07 PM, Simon Riggs wrote: > There are two ways of knowing the lag: 1) by measurement/sampling, > which is the main way this patch approaches this, 2) by direct > observation the LSNs match. Both are equally valid ways of > establishing knowledge.

Re: [HACKERS] Measuring replay lag

2017-03-15 Thread Simon Riggs
On 14 March 2017 at 07:39, Thomas Munro wrote: > Hi, > > Please see separate replies to Simon and Craig below. > > On Sun, Mar 5, 2017 at 8:38 PM, Simon Riggs wrote: >> On 1 March 2017 at 10:47, Thomas Munro

Re: [HACKERS] Measuring replay lag

2017-03-15 Thread Simon Riggs
On 14 March 2017 at 07:39, Thomas Munro wrote: > > On Mon, Mar 6, 2017 at 3:22 AM, Craig Ringer wrote: >> On 5 March 2017 at 15:31, Simon Riggs wrote: >>> What we want from this patch is something that works for both,

Re: [HACKERS] Measuring replay lag

2017-03-15 Thread Ian Barwick
Hi Just adding a couple of thoughts on this. On 03/14/2017 08:39 AM, Thomas Munro wrote: > Hi, > > Please see separate replies to Simon and Craig below. > > On Sun, Mar 5, 2017 at 8:38 PM, Simon Riggs wrote: >> On 1 March 2017 at 10:47, Thomas Munro

Re: [HACKERS] Measuring replay lag

2017-03-13 Thread Thomas Munro
Hi, Please see separate replies to Simon and Craig below. On Sun, Mar 5, 2017 at 8:38 PM, Simon Riggs wrote: > On 1 March 2017 at 10:47, Thomas Munro wrote: >> I do see why a new user trying this feature for the first time might >> expect

Re: [HACKERS] Measuring replay lag

2017-03-05 Thread Craig Ringer
On 5 March 2017 at 15:31, Simon Riggs wrote: > On 1 March 2017 at 10:47, Thomas Munro wrote: >> This seems to be problematic. Logical peers report LSN changes for >> all three operations (write, flush, commit) only on commit. I suppose >>

Re: [HACKERS] Measuring replay lag

2017-03-04 Thread Simon Riggs
On 1 March 2017 at 10:47, Thomas Munro wrote: >>> I added a fourth case 'overwhelm.png' which you might find >>> interesting. It's essentially like one 'burst' followed by a 100% ide >>> primary. The primary stops sending new WAL around 50 seconds in and >>> then

Re: [HACKERS] Measuring replay lag

2017-03-04 Thread Simon Riggs
On 1 March 2017 at 10:47, Thomas Munro wrote: > On Fri, Feb 24, 2017 at 9:05 AM, Simon Riggs wrote: >> On 21 February 2017 at 21:38, Thomas Munro >> wrote: >>> However, I think a call like

Re: [HACKERS] Measuring replay lag

2017-03-01 Thread Thomas Munro
On Fri, Feb 24, 2017 at 9:05 AM, Simon Riggs wrote: > On 21 February 2017 at 21:38, Thomas Munro > wrote: >> However, I think a call like LagTrackerWrite(SendRqstPtr, >> GetCurrentTimestamp()) needs to go into XLogSendLogical, to mirror >>

Re: [HACKERS] Measuring replay lag

2017-02-23 Thread Simon Riggs
On 21 February 2017 at 21:38, Thomas Munro wrote: > On Tue, Feb 21, 2017 at 6:21 PM, Simon Riggs wrote: >> And happier again, leading me to move to the next stage of review, >> focusing on the behaviour emerging from the design. >> >> So my

Re: [HACKERS] Measuring replay lag

2017-02-22 Thread Thomas Munro
On Thu, Feb 23, 2017 at 11:52 AM, Thomas Munro wrote: > The overall graph looks pretty similar, but it is more likely to short > hiccups caused by occasional slow WAL fsyncs in walreceiver. See the I meant to write "more likely to *miss* short hiccups". --

Re: [HACKERS] Measuring replay lag

2017-02-22 Thread Thomas Munro
On Tue, Feb 21, 2017 at 6:21 PM, Simon Riggs wrote: > I think what we need to show some test results with the graph of lag > over time for these cases: > 1. steady state - pgbench on master, so we can see how that responds > 2. blocked apply on standby - so we can see how

Re: [HACKERS] Measuring replay lag

2017-02-21 Thread Thomas Munro
On Tue, Feb 21, 2017 at 6:21 PM, Simon Riggs wrote: > And happier again, leading me to move to the next stage of review, > focusing on the behaviour emerging from the design. > > So my current understanding is that this doesn't rely upon LSN > arithmetic to measure lag,

Re: [HACKERS] Measuring replay lag

2017-02-21 Thread Simon Riggs
On 17 February 2017 at 07:45, Thomas Munro wrote: > On Fri, Feb 17, 2017 at 12:45 AM, Simon Riggs wrote: >> Feeling happier about this for now at least. > > Thanks! And happier again, leading me to move to the next stage of review, focusing

Re: [HACKERS] Measuring replay lag

2017-02-16 Thread Thomas Munro
On Fri, Feb 17, 2017 at 12:45 AM, Simon Riggs wrote: > Feeling happier about this for now at least. Thanks! > I think we need to document how this works more in README or header > comments. That way I can review it against what it aims to do rather > than what I think it

Re: [HACKERS] Measuring replay lag

2017-02-16 Thread Thomas Munro
On Thu, Feb 16, 2017 at 11:18 PM, Abhijit Menon-Sen wrote: > Hi Thomas. > > At 2017-02-15 00:48:41 +1300, thomas.mu...@enterprisedb.com wrote: >> >> Here is a new version with the buffer on the sender side as requested. > > This looks good. Thanks for the review! >> +

Re: [HACKERS] Measuring replay lag

2017-02-16 Thread Simon Riggs
On 14 February 2017 at 11:48, Thomas Munro wrote: > On Wed, Feb 1, 2017 at 5:21 PM, Michael Paquier > wrote: >> On Sat, Jan 21, 2017 at 10:49 AM, Thomas Munro >> wrote: >>> Ok. I see that there is a new

Re: [HACKERS] Measuring replay lag

2017-02-16 Thread Abhijit Menon-Sen
Hi Thomas. At 2017-02-15 00:48:41 +1300, thomas.mu...@enterprisedb.com wrote: > > Here is a new version with the buffer on the sender side as requested. This looks good. > + write_lag > + interval > + Estimated time taken for recent WAL records to be written on this > + standby

Re: [HACKERS] Measuring replay lag

2017-02-14 Thread Simon Riggs
On 14 February 2017 at 11:48, Thomas Munro wrote: > Here is a new version with the buffer on the sender side as requested. Thanks, I will definitely review in good time to get this in PG10 -- Simon Riggshttp://www.2ndQuadrant.com/ PostgreSQL

Re: [HACKERS] Measuring replay lag

2017-02-14 Thread Thomas Munro
On Wed, Feb 1, 2017 at 5:21 PM, Michael Paquier wrote: > On Sat, Jan 21, 2017 at 10:49 AM, Thomas Munro > wrote: >> Ok. I see that there is a new compelling reason to move the ring >> buffer to the sender side: then I think lag tracking

Re: [HACKERS] Measuring replay lag

2017-01-31 Thread Michael Paquier
On Sat, Jan 21, 2017 at 10:49 AM, Thomas Munro wrote: > Ok. I see that there is a new compelling reason to move the ring > buffer to the sender side: then I think lag tracking will work > automatically for the new logical replication that just landed on > master.

Re: [HACKERS] Measuring replay lag

2017-01-20 Thread Thomas Munro
On Tue, Jan 17, 2017 at 7:45 PM, Fujii Masao wrote: > On Thu, Dec 22, 2016 at 6:14 AM, Thomas Munro > wrote: >> On Thu, Dec 22, 2016 at 2:14 AM, Fujii Masao wrote: >>> I agree that the capability to measure the

Re: [HACKERS] Measuring replay lag

2017-01-16 Thread Fujii Masao
On Thu, Dec 22, 2016 at 6:14 AM, Thomas Munro wrote: > On Thu, Dec 22, 2016 at 2:14 AM, Fujii Masao wrote: >> I agree that the capability to measure the remote_apply lag is very useful. >> Also I want to measure the remote_write and

Re: [HACKERS] Measuring replay lag

2017-01-04 Thread Thomas Munro
On Thu, Jan 5, 2017 at 12:03 AM, Thomas Munro wrote: > So perhaps I should get rid of that replication_lag_sample_interval > GUC and send back apply timestamps frequently, as you were saying. It > would add up to a third more replies. Oops, of course I meant to

Re: [HACKERS] Measuring replay lag

2017-01-04 Thread Thomas Munro
On Wed, Jan 4, 2017 at 8:58 PM, Simon Riggs wrote: > On 3 January 2017 at 23:22, Thomas Munro > wrote: > >>> I don't see why that would be unacceptable. If we do it for >>> remote_apply, why not also do it for other modes? Whatever the >>>

Re: [HACKERS] Measuring replay lag

2017-01-03 Thread Simon Riggs
On 3 January 2017 at 23:22, Thomas Munro wrote: >> I don't see why that would be unacceptable. If we do it for >> remote_apply, why not also do it for other modes? Whatever the >> reasoning was for remote_apply should work for other modes. I should >> add it was

Re: [HACKERS] Measuring replay lag

2017-01-03 Thread Thomas Munro
On Wed, Jan 4, 2017 at 12:22 PM, Thomas Munro wrote: > (replay_lag - (write_lag / 2) may be a cheap proxy > for a lag time that doesn't include the return network leg, and still > doesn't introduce clock difference error) (Upon reflection it's a terrible proxy for

Re: [HACKERS] Measuring replay lag

2017-01-03 Thread Thomas Munro
On Wed, Jan 4, 2017 at 12:22 PM, Thomas Munro wrote: > The patch streams (time-right-now, end-of-wal) to the standby in every > outgoing message, and then sees how long it takes for those timestamps > to be fed back to it. Correction: we already stream

Re: [HACKERS] Measuring replay lag

2017-01-03 Thread Thomas Munro
On Wed, Jan 4, 2017 at 1:06 AM, Simon Riggs wrote: > On 21 December 2016 at 21:14, Thomas Munro > wrote: >> I thought about that too, but I couldn't figure out how to make the >> sampling work. If the primary is choosing (LSN, time) pairs to

Re: [HACKERS] Measuring replay lag

2017-01-03 Thread Simon Riggs
On 21 December 2016 at 21:14, Thomas Munro wrote: > On Thu, Dec 22, 2016 at 2:14 AM, Fujii Masao wrote: >> I agree that the capability to measure the remote_apply lag is very useful. >> Also I want to measure the remote_write and remote_flush

Re: [HACKERS] Measuring replay lag

2017-01-02 Thread Thomas Munro
On Thu, Dec 29, 2016 at 1:28 AM, Thomas Munro wrote: > On Thu, Dec 22, 2016 at 10:14 AM, Thomas Munro > wrote: >> On Thu, Dec 22, 2016 at 2:14 AM, Fujii Masao wrote: >>> I agree that the capability to measure

Re: [HACKERS] Measuring replay lag

2016-12-28 Thread Thomas Munro
On Thu, Dec 22, 2016 at 10:14 AM, Thomas Munro wrote: > On Thu, Dec 22, 2016 at 2:14 AM, Fujii Masao wrote: >> I agree that the capability to measure the remote_apply lag is very useful. >> Also I want to measure the remote_write and

Re: [HACKERS] Measuring replay lag

2016-12-21 Thread Thomas Munro
On Thu, Dec 22, 2016 at 2:14 AM, Fujii Masao wrote: > I agree that the capability to measure the remote_apply lag is very useful. > Also I want to measure the remote_write and remote_flush lags, for example, > in order to diagnose the cause of replication lag. Good idea.

Re: [HACKERS] Measuring replay lag

2016-12-21 Thread Fujii Masao
On Mon, Dec 19, 2016 at 8:13 PM, Thomas Munro wrote: > On Mon, Dec 19, 2016 at 4:03 PM, Peter Eisentraut > wrote: >> On 11/22/16 4:27 AM, Thomas Munro wrote: >>> Thanks very much for testing! New version attached. I will add this

Re: [HACKERS] Measuring replay lag

2016-12-19 Thread Thomas Munro
On Mon, Dec 19, 2016 at 10:46 PM, Simon Riggs wrote: > On 26 October 2016 at 11:34, Thomas Munro > wrote: > >> It works by taking advantage of the { time, end-of-WAL } samples that >> sending servers already include in message headers to

Re: [HACKERS] Measuring replay lag

2016-12-19 Thread Thomas Munro
On Mon, Dec 19, 2016 at 4:03 PM, Peter Eisentraut wrote: > On 11/22/16 4:27 AM, Thomas Munro wrote: >> Thanks very much for testing! New version attached. I will add this >> to the next CF. > > I don't see it there yet. Thanks for the reminder. Added here:

Re: [HACKERS] Measuring replay lag

2016-12-19 Thread Simon Riggs
On 26 October 2016 at 11:34, Thomas Munro wrote: > It works by taking advantage of the { time, end-of-WAL } samples that > sending servers already include in message headers to standbys. That > seems to provide a pretty good proxy for when the WAL was written, if

Re: [HACKERS] Measuring replay lag

2016-12-18 Thread Peter Eisentraut
On 11/22/16 4:27 AM, Thomas Munro wrote: > Thanks very much for testing! New version attached. I will add this > to the next CF. I don't see it there yet. -- Peter Eisentraut http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services --

Re: [HACKERS] Measuring replay lag

2016-11-22 Thread Thomas Munro
On Tue, Nov 8, 2016 at 2:35 PM, Masahiko Sawada wrote: > replay_lag_sample_interval is 1s by default but I got 1000s by SHOW command. > postgres(1:36789)=# show replay_lag_sample_interval ; > replay_lag_sample_interval > > 1000s > (1 row)

Re: [HACKERS] Measuring replay lag

2016-11-07 Thread Masahiko Sawada
On Wed, Oct 26, 2016 at 7:34 PM, Thomas Munro wrote: > Hi hackers, > > Here is a new version of my patch to add a replay_lag column to the > pg_stat_replication view (originally proposed as part of a larger > patch set for 9.6[1]), like this: Thank you for working