Re: [HACKERS] Timeline following for logical slots

2016-05-04 Thread Alvaro Herrera
Alvaro Herrera wrote:

> Here's a proposed revert patch.  Many minor changes (mostly comment
> additions) that were applied as part of this series are kept intact.
> The test_slot_timeline test module and corresponding recovery test
> script are also reverted.

Pushed.

-- 
Álvaro Herrerahttp://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Timeline following for logical slots

2016-05-03 Thread Alvaro Herrera
I don't like reverting patches, but this patch is making me more and
more uncomfortable.  We have two open items, one of which requires
writing new test code that doesn't exist yet; and we have the
pg_recvlogical changes that were approved post-feature freeze, but that
I now have second thoughts about pushing right away.
Craig has also commented on some followup patch to change this whole
area in 9.7.

Per
https://www.postgresql.org/message-id/20160503165812.GA29604%40alvherre.pgsql
I think the best course of action is to revert the whole feature and
revisit for 9.7.

Here's a proposed revert patch.  Many minor changes (mostly comment
additions) that were applied as part of this series are kept intact.
The test_slot_timeline test module and corresponding recovery test
script are also reverted.

-- 
Álvaro Herrerahttp://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
commit 72604a6c1c3e7952e9e11a554e40117cb974a463
Author: Alvaro Herrera 
AuthorDate: Tue May 3 16:37:01 2016 -0300
CommitDate: Tue May 3 17:18:17 2016 -0300

Revert timeline following in replication slots

This reverts commits f07d18b6e94d, 82c83b337202, 3a3b309041b0, and
24c5f1a103ce.

This feature has shown enough immaturity that it was deemed better to
rip it out before rushing some more fixes at the last minute.  There are
discussions on larger changes in this area for the next release.

diff --git a/src/backend/access/transam/xlogreader.c b/src/backend/access/transam/xlogreader.c
index c5a964a..c3aecc7 100644
--- a/src/backend/access/transam/xlogreader.c
+++ b/src/backend/access/transam/xlogreader.c
@@ -118,11 +118,6 @@ XLogReaderAllocate(XLogPageReadCB pagereadfunc, void *private_data)
 		return NULL;
 	}
 
-#ifndef FRONTEND
-	/* Will be loaded on first read */
-	state->timelineHistory = NIL;
-#endif
-
 	return state;
 }
 
@@ -142,10 +137,6 @@ XLogReaderFree(XLogReaderState *state)
 	pfree(state->errormsg_buf);
 	if (state->readRecordBuf)
 		pfree(state->readRecordBuf);
-#ifndef FRONTEND
-	if (state->timelineHistory)
-		list_free_deep(state->timelineHistory);
-#endif
 	pfree(state->readBuf);
 	pfree(state);
 }
diff --git a/src/backend/access/transam/xlogutils.c b/src/backend/access/transam/xlogutils.c
index f6ca2b9..cb4563e 100644
--- a/src/backend/access/transam/xlogutils.c
+++ b/src/backend/access/transam/xlogutils.c
@@ -19,7 +19,6 @@
 
 #include 
 
-#include "access/timeline.h"
 #include "access/xlog.h"
 #include "access/xlog_internal.h"
 #include "access/xlogutils.h"
@@ -660,7 +659,6 @@ XLogRead(char *buf, TimeLineID tli, XLogRecPtr startptr, Size count)
 	/* state maintained across calls */
 	static int	sendFile = -1;
 	static XLogSegNo sendSegNo = 0;
-	static TimeLineID sendTLI = 0;
 	static uint32 sendOff = 0;
 
 	p = buf;
@@ -676,8 +674,7 @@ XLogRead(char *buf, TimeLineID tli, XLogRecPtr startptr, Size count)
 		startoff = recptr % XLogSegSize;
 
 		/* Do we need to switch to a different xlog segment? */
-		if (sendFile < 0 || !XLByteInSeg(recptr, sendSegNo) ||
-			sendTLI != tli)
+		if (sendFile < 0 || !XLByteInSeg(recptr, sendSegNo))
 		{
 			char		path[MAXPGPATH];
 
@@ -704,7 +701,6 @@ XLogRead(char *buf, TimeLineID tli, XLogRecPtr startptr, Size count)
 	path)));
 			}
 			sendOff = 0;
-			sendTLI = tli;
 		}
 
 		/* Need to seek in the file? */
@@ -753,147 +749,6 @@ XLogRead(char *buf, TimeLineID tli, XLogRecPtr startptr, Size count)
 }
 
 /*
- * Determine XLogReaderState->currTLI and ->currTLIValidUntil;
- * XLogReaderState->EndRecPtr, ->currRecPtr and ThisTimeLineID affect the
- * decision.  This may later be used to determine which xlog segment file to
- * open, etc.
- *
- * We switch to an xlog segment from the new timeline eagerly when on a
- * historical timeline, as soon as we reach the start of the xlog segment
- * containing the timeline switch.  The server copied the segment to the new
- * timeline so all the data up to the switch point is the same, but there's no
- * guarantee the old segment will still exist. It may have been deleted or
- * renamed with a .partial suffix so we can't necessarily keep reading from
- * the old TLI even though tliSwitchPoint says it's OK.
- *
- * Because of this, callers MAY NOT assume that currTLI is the timeline that
- * will be in a page's xlp_tli; the page may begin on an older timeline or we
- * might be reading from historical timeline data on a segment that's been
- * copied to a new timeline.
- */
-static void
-XLogReadDetermineTimeline(XLogReaderState *state)
-{
-	/* Read the history on first time through */
-	if (state->timelineHistory == NIL)
-		state->timelineHistory = readTimeLineHistory(ThisTimeLineID);
-
-	/*
-	 * Are we reading the record immediately following the one we read last
-	 * time?  If not, then don't use the cached timeline info.
-	 */
-	if (state->currRecPtr != state->EndRecPtr)
-	{
-		state->currTLI = 0;
-		

Re: [HACKERS] Timeline following for logical slots

2016-05-03 Thread Alvaro Herrera
Craig Ringer wrote:
 
> With this patch pg_recvlogical takes a new --endpos LSN argument, and will
> exit if either:
> 
> * it receives an XLogData message with dataStart >= endpos; or
> * it receives a keepalive with walEnd >= endpos
> 
> The latter allows it to work without needing a dummy transaction to make it
> see a data message after endpos. If there's nothing to read on the socket
> until a keepalive we know that the server has nothing to send us, and if
> walend has passed endpos we know nothing can have committed before endpos.

Here's a slightly revised version of this patch, for later
consideration.

Changes:
- added -E as short form of --endpos (consistent with -I as --startpos)
- refactored some repetitive code in two auxilliary functions
- allow --endpos to work with --create-slot.
- revert some unrelated changes, such as message additions in verbose
  mode and changes to existing messages
- documentation reworked.

I didn't spot any bugs in Craig's patch, but it needs more testing.

-- 
Álvaro Herrerahttp://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
diff --git a/doc/src/sgml/ref/pg_recvlogical.sgml b/doc/src/sgml/ref/pg_recvlogical.sgml
index 9d0b58b..6f23229 100644
--- a/doc/src/sgml/ref/pg_recvlogical.sgml
+++ b/doc/src/sgml/ref/pg_recvlogical.sgml
@@ -155,6 +155,41 @@ PostgreSQL documentation
  
 
  
+  -E lsn
+  --endpos=lsn
+  
+   
+In --start mode, automatically stop replication
+and exit with normal exit status 0 when receiving reaches the
+specified LSN.  If specified when not in --start
+mode, an error is raised.
+   
+
+   
+Note the following points:
+
+  
+  
+   If there's a record with LSN exactly equal to lsn,
+   the record will not be output.  If you want to receive up to and
+   including a given LSN, specify LSN + 1 as the desired stop point.
+  
+ 
+ 
+  
+   The --endpos option is not aware of transaction
+   boundaries and may truncate output partway through a transaction.
+   Any partially output transaction will not be consumed and will be
+   replayed again when the slot is next read from. Individual messages
+   are never truncated.
+  
+ 
+
+   
+  
+ 
+
+ 
   --if-not-exists
   

diff --git a/src/bin/pg_basebackup/pg_recvlogical.c b/src/bin/pg_basebackup/pg_recvlogical.c
index 6d12705..5108222 100644
--- a/src/bin/pg_basebackup/pg_recvlogical.c
+++ b/src/bin/pg_basebackup/pg_recvlogical.c
@@ -37,6 +37,7 @@ static int	noloop = 0;
 static int	standby_message_timeout = 10 * 1000;		/* 10 sec = default */
 static int	fsync_interval = 10 * 1000; /* 10 sec = default */
 static XLogRecPtr startpos = InvalidXLogRecPtr;
+static XLogRecPtr endpos = InvalidXLogRecPtr;
 static bool do_create_slot = false;
 static bool slot_exists_ok = false;
 static bool do_start_slot = false;
@@ -60,6 +61,9 @@ static XLogRecPtr output_fsync_lsn = InvalidXLogRecPtr;
 static void usage(void);
 static void StreamLogicalLog(void);
 static void disconnect_and_exit(int code);
+static bool flushAndSendFeedback(PGconn *conn, TimestampTz *now);
+static void prepareToTerminate(PGconn *conn, XLogRecPtr endpos,
+   bool keepalive, XLogRecPtr lsn);
 
 static void
 usage(void)
@@ -78,6 +82,7 @@ usage(void)
 			 " time between fsyncs to the output file (default: %d)\n"), (fsync_interval / 1000));
 	printf(_("  --if-not-existsdo not error if slot already exists when creating a slot\n"));
 	printf(_("  -I, --startpos=LSN where in an existing slot should the streaming start\n"));
+	printf(_("  -E, --endpos=LSN   exit upon receiving the specified LSN\n"));
 	printf(_("  -n, --no-loop  do not loop on connection lost\n"));
 	printf(_("  -o, --option=NAME[=VALUE]\n"
 			 " pass option NAME with optional value VALUE to the\n"
@@ -278,6 +283,7 @@ StreamLogicalLog(void)
 		int			bytes_written;
 		int64		now;
 		int			hdr_len;
+		XLogRecPtr	cur_record_lsn = InvalidXLogRecPtr;
 
 		if (copybuf != NULL)
 		{
@@ -451,6 +457,7 @@ StreamLogicalLog(void)
 			int			pos;
 			bool		replyRequested;
 			XLogRecPtr	walEnd;
+			bool		endposReached = false;
 
 			/*
 			 * Parse the keepalive message, enclosed in the CopyData message.
@@ -473,18 +480,32 @@ StreamLogicalLog(void)
 			}
 			replyRequested = copybuf[pos];
 
-			/* If the server requested an immediate reply, send one. */
-			if (replyRequested)
+			if (endpos != InvalidXLogRecPtr && walEnd >= endpos)
 			{
-/* fsync data, so we send a recent flush pointer */
-if (!OutputFsync(now))
-	goto error;
+/*
+ * If there's nothing to read on the socket until a keepalive
+ * we know that the server has nothing to send us; and if
+ * walEnd has passed endpos, we know 

Re: [HACKERS] Timeline following for logical slots

2016-05-02 Thread Alvaro Herrera
Andres Freund wrote:

> -   /* oldest LSN that the client has acked receipt for */
> +   /*
> +* Oldest LSN that the client has acked receipt for.  This is used as the
> +* start_lsn point in case the client doesn't specify one, and also as a
> +* safety measure to back off in case the client specifies a start_lsn
> +* that's further in the future than this value.
> +*/
> XLogRecPtr  confirmed_flush;
> 
> This is the wrong way round. confirmed_flush is used if the client's
> start_lsn is further in the *past* than this value.

Bah.  Funnily enough I got this one right in the comment for
CreateDecodingContext.

Thanks for reading through the commit.

-- 
Álvaro Herrerahttp://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Timeline following for logical slots

2016-05-02 Thread Andres Freund
On 2016-05-02 16:12:53 -0300, Alvaro Herrera wrote:
> I pushed a fix to some comments, including the ones being discussed in
> this subthread, which should hopefully close things here.
> 
> I'm now going to go over Craig's pg_recvlogical changes and the proposed
> for that problem.


-   /* oldest LSN that the client has acked receipt for */
+   /*
+* Oldest LSN that the client has acked receipt for.  This is used as the
+* start_lsn point in case the client doesn't specify one, and also as a
+* safety measure to back off in case the client specifies a start_lsn
+* that's further in the future than this value.
+*/
XLogRecPtr  confirmed_flush;

This is the wrong way round. confirmed_flush is used if the client's
start_lsn is further in the *past* than this value.


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Timeline following for logical slots

2016-05-02 Thread Alvaro Herrera
I pushed a fix to some comments, including the ones being discussed in
this subthread, which should hopefully close things here.

I'm now going to go over Craig's pg_recvlogical changes and the proposed
for that problem.

-- 
Álvaro Herrerahttp://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Timeline following for logical slots

2016-05-02 Thread Alvaro Herrera
Robert Haas wrote:
> On Tue, Apr 26, 2016 at 4:37 PM, Alvaro Herrera
>  wrote:

> > I failed to meet this deadline, for which I apologize.  This week is
> > going to be hectic around here, so my new deadline is to get these two
> > patches applied on Friday 29th.  Ok?
> 
> Not that you don't know this already, but you have also failed to meet
> this deadline.  Also, this open item has now been promoted to the much
> sought-after status of "oldest as-yet-unaddressed PostgreSQL 9.6 open
> item".

Yay.  I thought I was going to have Friday completely available, but
that ended up not happening.

I'm working on this exclusively now until I get the item closed.

-- 
Álvaro Herrerahttp://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Timeline following for logical slots

2016-05-02 Thread Robert Haas
On Tue, Apr 26, 2016 at 4:37 PM, Alvaro Herrera
 wrote:
> Alvaro Herrera wrote:
>> Noah Misch wrote:
>>
>> > The above-described topic is currently a PostgreSQL 9.6 open item.  Alvaro,
>> > since you committed the patch believed to have created it, you own this 
>> > open
>> > item.  If that responsibility lies elsewhere, please let us know whose
>> > responsibility it is to fix this.  Since new open items may be discovered 
>> > at
>> > any time and I want to plan to have them all fixed well in advance of the 
>> > ship
>> > date, I will appreciate your efforts toward speedy resolution.  Please
>> > present, within 72 hours, a plan to fix the defect within seven days of 
>> > this
>> > message.  Thanks.
>>
>> I plan to get Craig's patch applied on Monday 25th, giving me some more
>> time for study.
>
> I failed to meet this deadline, for which I apologize.  This week is
> going to be hectic around here, so my new deadline is to get these two
> patches applied on Friday 29th.  Ok?

Not that you don't know this already, but you have also failed to meet
this deadline.  Also, this open item has now been promoted to the much
sought-after status of "oldest as-yet-unaddressed PostgreSQL 9.6 open
item".

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Timeline following for logical slots

2016-04-29 Thread Craig Ringer
On 29 April 2016 at 15:40, Craig Ringer  wrote:


> I don't think pg_recvlogical can do anything about the need for that dummy
> write, since the client has no way to determine the exact LSN of the commit
> record of the xact of interest. It can't rely
> on pg_current_xlog_insert_location() or pg_current_xlog_location() since
> autovacuum or a checkpoint might've written xlog since. Logical streaming
> replication doesn't have a non-blocking mode where it returns immediately
> if it'd have to wait for more xlog so we can't just send off the most
> recent server LSN as the endpoint.
>

(Patch attached. Blah blah explanation blah):

With this patch pg_recvlogical takes a new --endpos LSN argument, and will
exit if either:

* it receives an XLogData message with dataStart >= endpos; or
* it receives a keepalive with walEnd >= endpos

The latter allows it to work without needing a dummy transaction to make it
see a data message after endpos. If there's nothing to read on the socket
until a keepalive we know that the server has nothing to send us, and if
walend has passed endpos we know nothing can have committed before endpos.


The way I've written things the endpos is the point where we stop receiving
and exit, so if a record with start lsn >= endpos is received we'll exit
without writing it.

I thought about writing out the record before exiting if the record start
LSN is exactly endpos. That'd be handy in cases where the client knows a
commit's LSN and wants everything up to that commit. But it's easy enough
in this case for the client to set endpos to the commit start lsn + 1, so
it's not like the current behaviour stops you doing anything, and it means
the code can just test endpos and exit. pg_current_xlog_insert_location()
will return at least the lsn of the last commit + 1, so you'll get the
expected behaviour for free there. It does mean we might wait for the next
walsender keepalive or status update before we exit, though, so if someone
feels strongly that endpos should be an inclusive bound I can do that. It's
just a bit uglier in the code.

I can't add a "number of xacts" filter like the SQL interface has because
pg_recvlogical has no idea which records represent a commit, so it's not
possible without changing the protocol. I'm not convinced a "number of
messages" filter is particularly useful. I could add a timeout, but it's
easy enough to do that in a wrapper (like IPC::Run). So I'm sticking with
just the LSN filter for now.

Also because pg_recvlogical isn't aware of transaction boundaries, setting
endpos might result in a partial transaction being output if endpos is
after the end of the last xact wanted and some other xact containing
changes made before endpos commits after endpos but before the next status
update/keepalive is sent. That xact won't be consumed from the server and
will just be sent when the slot is next read from. This won't result in
unpredictable output for testing since there we control what other xacts
run and will generally exit based on walsender status updates/keepalives.

Here's the patch. Docs included. Comments?

-- 
 Craig Ringer   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services
From cc54f99d21de3c573873cce3467237caccfb1a33 Mon Sep 17 00:00:00 2001
From: Craig Ringer 
Date: Fri, 29 Apr 2016 18:18:19 +0800
Subject: [PATCH] Allow a stop LSN to be specified to pg_recvlogical

pg_recvlogical just runs until cancelled or until the upstream
server disconnects. For some purposes, especially testing, it's
useful to have the ability to stop receive at a specified LSN
without having to parse the output and deal with buffering issues,
etc.

Add a --endpos parameter that takes the LSN at which no further
messages should be written and receive should stop.
---
 doc/src/sgml/ref/pg_recvlogical.sgml   |  36 
 src/bin/pg_basebackup/pg_recvlogical.c | 102 +
 2 files changed, 127 insertions(+), 11 deletions(-)

diff --git a/doc/src/sgml/ref/pg_recvlogical.sgml b/doc/src/sgml/ref/pg_recvlogical.sgml
index 9d0b58b..8f1e8f6 100644
--- a/doc/src/sgml/ref/pg_recvlogical.sgml
+++ b/doc/src/sgml/ref/pg_recvlogical.sgml
@@ -155,6 +155,42 @@ PostgreSQL documentation
  
 
  
+  --endpos=lsn
+  
+   
+In --start mode, automatically stop replication and
+exit with normal exit status 0 when receive passes the specified LSN.
+Because of logical decoding's reordering of operations this actually
+means that no change that's part of a transaction that committed before
+endpos is still waiting to be received. There can be still be
+pending changes made before endpos that're part of transactions that
+committed after endpos or are not yet committed.
+   
+   
+The endpos option is not aware of transaction boundaries and may
+truncate output partway through a 

Re: [HACKERS] Timeline following for logical slots

2016-04-29 Thread Craig Ringer
On 28 April 2016 at 02:29, Andres Freund  wrote:


>
> I don't object either, I was looking for the feature myself a number of
> times (for tap tests in my case).
>

Exactly what I want it for.


> It requires a some amount of thinking about what the limit applies to
> though. "messages sent by server", Bytes? TCP messages? xids? Time?
>
>
The main thing I think would be useful is a threshold for message LSNs
after which it disconnects and exits. It doesn't have to be a commit
message; if we receive any message with upstream LSN after the LSN of
interest then there can't be any commits with a later LSN anyway, and that
way it'll be useful if/when streaming decoding is implemented too.

That way a test can capture the xlog insert lsn after doing the work it
wants to replay, do another dummy write to make sure there's something more
to replay, and decode up to that point.

I don't think pg_recvlogical can do anything about the need for that dummy
write, since the client has no way to determine the exact LSN of the commit
record of the xact of interest. It can't rely
on pg_current_xlog_insert_location() or pg_current_xlog_location() since
autovacuum or a checkpoint might've written xlog since. Logical streaming
replication doesn't have a non-blocking mode where it returns immediately
if it'd have to wait for more xlog so we can't just send off the most
recent server LSN as the endpoint.

Ideally I think this should be done server-side with a limit on replay LSN
and a non-blocking option, but that's way too invasive for this stage in
the release cycle. Client-side seems fine enough.

Number of xacts received would also be handy, but I don't know if I'll get
to that.

pg_receivexlog should send confirmation on exit.

(The other thing I've wanted sometimes is a --peek option, but that's again
not really in scope for this.)


-- 
 Craig Ringer   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


Re: [HACKERS] Timeline following for logical slots

2016-04-27 Thread Andres Freund
On 2016-04-27 14:21:53 -0400, Robert Haas wrote:
> > Considering that pg_recvlogical was introduced mostly as a way to test
> > logical decoding features, I think this is a serious oversight and we
> > should patch it.  I suppose we could leave it for 9.7, thought I admit I
> > would prefer it to introduce it in 9.6.  Now everyone throwing stones at
> > me in 3 ... 2 ...
> 
> If that's a small and relatively contained change, I don't object to
> the idea of you making it, provided it's done pretty soon, and
> definitely before beta1.  If it's going to require substantial
> refactoring or can't be done quickly, then, yes, I object.

I don't object either, I was looking for the feature myself a number of
times (for tap tests in my case).

It requires a some amount of thinking about what the limit applies to
though. "messages sent by server", Bytes? TCP messages? xids? Time?

- Andres


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Timeline following for logical slots

2016-04-27 Thread Robert Haas
On Tue, Apr 26, 2016 at 4:35 PM, Alvaro Herrera
 wrote:
> Craig Ringer wrote:
>> The reason the new src/test/recovery/ tests don't notice this is that using
>> pg_recvlogical from the TAP tests is currently pretty awkward.
>> pg_recvlogical has no way to specify a stop-point or return when there's no
>> immediately pending data like the SQL interface does. So you have to read
>> from it until you figure it's not going to return anything more then kill
>> it and look at what it did return and hope you don't lose anything in
>> buffering.I don't much like relying on timeouts as part of normal
>> successful results since they can upset some of the older and slower
>> buildfarm members. I'd rather be able to pass a --stoppos= or a --n-xacts=
>> option, but it was a bit too late to add those.
>
> Considering that pg_recvlogical was introduced mostly as a way to test
> logical decoding features, I think this is a serious oversight and we
> should patch it.  I suppose we could leave it for 9.7, thought I admit I
> would prefer it to introduce it in 9.6.  Now everyone throwing stones at
> me in 3 ... 2 ...

If that's a small and relatively contained change, I don't object to
the idea of you making it, provided it's done pretty soon, and
definitely before beta1.  If it's going to require substantial
refactoring or can't be done quickly, then, yes, I object.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Timeline following for logical slots

2016-04-26 Thread Noah Misch
On Tue, Apr 26, 2016 at 05:37:40PM -0300, Alvaro Herrera wrote:
> Alvaro Herrera wrote:
> > Noah Misch wrote:
> > > The above-described topic is currently a PostgreSQL 9.6 open item.  
> > > Alvaro,
> > > since you committed the patch believed to have created it, you own this 
> > > open
> > > item.  If that responsibility lies elsewhere, please let us know whose
> > > responsibility it is to fix this.  Since new open items may be discovered 
> > > at
> > > any time and I want to plan to have them all fixed well in advance of the 
> > > ship
> > > date, I will appreciate your efforts toward speedy resolution.  Please
> > > present, within 72 hours, a plan to fix the defect within seven days of 
> > > this
> > > message.  Thanks.
> > 
> > I plan to get Craig's patch applied on Monday 25th, giving me some more
> > time for study.
> 
> I failed to meet this deadline, for which I apologize.  This week is
> going to be hectic around here, so my new deadline is to get these two
> patches applied on Friday 29th.  Ok?

Okay; thanks for this notification.


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Timeline following for logical slots

2016-04-26 Thread Alvaro Herrera
Alvaro Herrera wrote:
> Noah Misch wrote:
> 
> > The above-described topic is currently a PostgreSQL 9.6 open item.  Alvaro,
> > since you committed the patch believed to have created it, you own this open
> > item.  If that responsibility lies elsewhere, please let us know whose
> > responsibility it is to fix this.  Since new open items may be discovered at
> > any time and I want to plan to have them all fixed well in advance of the 
> > ship
> > date, I will appreciate your efforts toward speedy resolution.  Please
> > present, within 72 hours, a plan to fix the defect within seven days of this
> > message.  Thanks.
> 
> I plan to get Craig's patch applied on Monday 25th, giving me some more
> time for study.

I failed to meet this deadline, for which I apologize.  This week is
going to be hectic around here, so my new deadline is to get these two
patches applied on Friday 29th.  Ok?

-- 
Álvaro Herrerahttp://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Timeline following for logical slots

2016-04-26 Thread Alvaro Herrera
Craig Ringer wrote:

> The reason the new src/test/recovery/ tests don't notice this is that using
> pg_recvlogical from the TAP tests is currently pretty awkward.
> pg_recvlogical has no way to specify a stop-point or return when there's no
> immediately pending data like the SQL interface does. So you have to read
> from it until you figure it's not going to return anything more then kill
> it and look at what it did return and hope you don't lose anything in
> buffering.I don't much like relying on timeouts as part of normal
> successful results since they can upset some of the older and slower
> buildfarm members. I'd rather be able to pass a --stoppos= or a --n-xacts=
> option, but it was a bit too late to add those.

Considering that pg_recvlogical was introduced mostly as a way to test
logical decoding features, I think this is a serious oversight and we
should patch it.  I suppose we could leave it for 9.7, thought I admit I
would prefer it to introduce it in 9.6.  Now everyone throwing stones at
me in 3 ... 2 ...

> Per the separate mail I sent to hackers, xlogreader is currently pretty
> much timeline-agnostic. The timeline tracking code needs to know when the
> xlogreader does a random read and do a fresh lookup so its state is part of
> the XLogReaderState struct. But the walsender's xlogreader page read
> callback doesn't use the xlogreader's state, and it can't because it's also
> used for physical replication, which communicates the timeline to read from
> with the page read function via globals.

The globals there are a disaster and I welcome efforts to get rid of
them.

> So for now, I just set the globals before calling the page read
> function:
> 
> +   XLogReadDetermineTimeline(state);
> +   sendTimeLineIsHistoric = state->currTLI == ThisTimeLineID;
> +   sendTimeLine = state->currTLI;
> +   sendTimeLineValidUpto = state->currTLIValidUntil;
> +   sendTimeLineNextTLI = state->nextTLI;

Ok, this is pretty ugly but seems acceptable.

> Per separate mail, I'd quite like to tackle the level of duplication in
> timeline following logic in 9.7 and maybe see if the _three_ separate read
> xlog page functions can be unified at the same time. But that sure isn't
> 9.6 material.

Agreed.

-- 
Álvaro Herrerahttp://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Timeline following for logical slots

2016-04-26 Thread Andres Freund
On 2016-04-26 17:22:49 -0300, Alvaro Herrera wrote:
> > -   /* oldest LSN that might be required by this replication slot */
> > +   /*
> > +* oldest LSN that might be required by this replication slot.
> > +*
> > +* Points to the LSN of the most recent xl_running_xacts record where
> > +* no transaction that commits after confirmed_flush is already in
> > +* progress. At this point all xacts committing after confirmed_flush
> > +* can be read entirely into reorder buffers and all visibility
> > +* information can be reconstructed.
> > +*/
> > XLogRecPtr  restart_lsn;
> 
> I'm unsure about this one.  Why are we speaking of reorder buffers here,
> if this is generically about replication slots?  Is it that slots used
> by physical replication do not *need* a restart_lsn?

It's used in physical slots as well, so I agree.  Also I think the
xl_running_xacts bit is going into way to much detail; it could very
well just be shutdown checkpoint or other similar locations.

> > diff --git a/src/backend/replication/logical/logicalfuncs.c 
> > b/src/backend/replication/logical/logicalfuncs.c
> > index 5af6751..5f74941 100644
> > --- a/src/backend/replication/logical/logicalfuncs.c
> > +++ b/src/backend/replication/logical/logicalfuncs.c
> > @@ -236,8 +236,10 @@ pg_logical_slot_get_changes_guts(FunctionCallInfo 
> > fcinfo, bool confirm, bool bin
> > PG_TRY();
> > {
> > /*
> > -* Passing InvalidXLogRecPtr here causes decoding to start 
> > returning
> > -* commited xacts to the client at the slot's confirmed_flush.
> > +* Passing InvalidXLogRecPtr here causes logical decoding to
> > +* start calling the output plugin for transactions that commit
> > +* at or after the slot's confirmed_flush. This filters commits
> > +* out from the client but they're still decoded.
> >  */
> > ctx = CreateDecodingContext(InvalidXLogRecPtr,
> > options,
> 
> I this it's weird to have the parameter explained in the callsite rather
> than in the comment about CreateDecodingContext.  I think this patch
> needs to apply to logical.c, not logicalfuncs.

I still think the entire comment ought to be removed.


- Andres


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Timeline following for logical slots

2016-04-26 Thread Alvaro Herrera
Craig Ringer wrote:

> - /* oldest LSN that the client has acked receipt for */
> + /*
> +  * oldest LSN that the client has acked receipt for
> +  *
> +  * Decoding will start calling output plugin callbacks for commits
> +  * after this LSN unless a different start point is specified by
> +  * the client.
> +  */
>   XLogRecPtr  confirmed_flush;

How about this wording?

/*
 * Oldest LSN that the client has acked receipt for.  This is used as
 * the start_lsn point in case the client doesn't specify one, and also
 * as a safety measure to back off in case the client specifies a
 * start_lsn that's further in the future than this value.
 */
XLogRecPtr  confirmed_flush;

> - /* oldest LSN that might be required by this replication slot */
> + /*
> +  * oldest LSN that might be required by this replication slot.
> +  *
> +  * Points to the LSN of the most recent xl_running_xacts record where
> +  * no transaction that commits after confirmed_flush is already in
> +  * progress. At this point all xacts committing after confirmed_flush
> +  * can be read entirely into reorder buffers and all visibility
> +  * information can be reconstructed.
> +  */
>   XLogRecPtr  restart_lsn;

I'm unsure about this one.  Why are we speaking of reorder buffers here,
if this is generically about replication slots?  Is it that slots used
by physical replication do not *need* a restart_lsn?  I think the
comment should be worded in a way that's not specific to logical
decoding; or, if the restart_lsn field *is* specific to logical
decoding, then the comment should state as much.


> diff --git a/src/backend/replication/logical/logicalfuncs.c 
> b/src/backend/replication/logical/logicalfuncs.c
> index 5af6751..5f74941 100644
> --- a/src/backend/replication/logical/logicalfuncs.c
> +++ b/src/backend/replication/logical/logicalfuncs.c
> @@ -236,8 +236,10 @@ pg_logical_slot_get_changes_guts(FunctionCallInfo 
> fcinfo, bool confirm, bool bin
>   PG_TRY();
>   {
>   /*
> -  * Passing InvalidXLogRecPtr here causes decoding to start 
> returning
> -  * commited xacts to the client at the slot's confirmed_flush.
> +  * Passing InvalidXLogRecPtr here causes logical decoding to
> +  * start calling the output plugin for transactions that commit
> +  * at or after the slot's confirmed_flush. This filters commits
> +  * out from the client but they're still decoded.
>*/
>   ctx = CreateDecodingContext(InvalidXLogRecPtr,
>   options,

I this it's weird to have the parameter explained in the callsite rather
than in the comment about CreateDecodingContext.  I think this patch
needs to apply to logical.c, not logicalfuncs.

> @@ -262,11 +264,9 @@ pg_logical_slot_get_changes_guts(FunctionCallInfo 
> fcinfo, bool confirm, bool bin
>   ctx->output_writer_private = p;
>  
>   /*
> -  * We start reading xlog from the restart lsn, even though we 
> won't
> -  * start returning decoded data to the user until the point set 
> up in
> -  * CreateDecodingContext. The restart_lsn is far enough back 
> that we'll
> -  * see the beginning of any xact we're expected to return to 
> the client
> -  * so we can assemble a complete reorder buffer for it.
> +  * Reading and decoding of WAL must start at restart_lsn so 
> that the
> +  * entirety of each of the xacts that commit after confimed_lsn 
> can be
> +  * accumulated into reorder buffers.
>*/
>   startptr = MyReplicationSlot->data.restart_lsn;

This one looks fine to me, except typo s/confimed/confirmed/

-- 
Álvaro Herrerahttp://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Timeline following for logical slots

2016-04-21 Thread Alvaro Herrera
Noah Misch wrote:

> The above-described topic is currently a PostgreSQL 9.6 open item.  Alvaro,
> since you committed the patch believed to have created it, you own this open
> item.  If that responsibility lies elsewhere, please let us know whose
> responsibility it is to fix this.  Since new open items may be discovered at
> any time and I want to plan to have them all fixed well in advance of the ship
> date, I will appreciate your efforts toward speedy resolution.  Please
> present, within 72 hours, a plan to fix the defect within seven days of this
> message.  Thanks.

I plan to get Craig's patch applied on Monday 25th, giving me some more
time for study.
.
-- 
Álvaro Herrerahttp://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Timeline following for logical slots

2016-04-19 Thread Noah Misch
On Sun, Apr 17, 2016 at 10:01:36PM +0800, Craig Ringer wrote:
> I made an unfortunate oversight in the logical decoding timeline following
> code: it doesn't work for logical decoding from the walsender, because I
> left the glue code required out of the final cut of the patch.

> Subject: [PATCH] Enable logical timeline following in the walsender
> 
> ---
>  src/backend/access/transam/xlogutils.c |  7 +++
>  src/backend/replication/walsender.c| 11 +++
>  src/include/access/xlogreader.h|  3 +++
>  src/include/access/xlogutils.h |  2 ++
>  4 files changed, 15 insertions(+), 8 deletions(-)

[This is a generic notification.]

The above-described topic is currently a PostgreSQL 9.6 open item.  Alvaro,
since you committed the patch believed to have created it, you own this open
item.  If that responsibility lies elsewhere, please let us know whose
responsibility it is to fix this.  Since new open items may be discovered at
any time and I want to plan to have them all fixed well in advance of the ship
date, I will appreciate your efforts toward speedy resolution.  Please
present, within 72 hours, a plan to fix the defect within seven days of this
message.  Thanks.


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Timeline following for logical slots

2016-04-17 Thread Craig Ringer
Hi all

I made an unfortunate oversight in the logical decoding timeline following
code: it doesn't work for logical decoding from the walsender, because I
left the glue code required out of the final cut of the patch.

The reason the new src/test/recovery/ tests don't notice this is that using
pg_recvlogical from the TAP tests is currently pretty awkward.
pg_recvlogical has no way to specify a stop-point or return when there's no
immediately pending data like the SQL interface does. So you have to read
from it until you figure it's not going to return anything more then kill
it and look at what it did return and hope you don't lose anything in
buffering.I don't much like relying on timeouts as part of normal
successful results since they can upset some of the older and slower
buildfarm members. I'd rather be able to pass a --stoppos= or a --n-xacts=
option, but it was a bit too late to add those.

Per the separate mail I sent to hackers, xlogreader is currently pretty
much timeline-agnostic. The timeline tracking code needs to know when the
xlogreader does a random read and do a fresh lookup so its state is part of
the XLogReaderState struct. But the walsender's xlogreader page read
callback doesn't use the xlogreader's state, and it can't because it's also
used for physical replication, which communicates the timeline to read from
with the page read function via globals. So for now, I just set the globals
before calling the page read function:

+   XLogReadDetermineTimeline(state);
+   sendTimeLineIsHistoric = state->currTLI == ThisTimeLineID;
+   sendTimeLine = state->currTLI;
+   sendTimeLineValidUpto = state->currTLIValidUntil;
+   sendTimeLineNextTLI = state->nextTLI;

Per separate mail, I'd quite like to tackle the level of duplication in
timeline following logic in 9.7 and maybe see if the _three_ separate read
xlog page functions can be unified at the same time. But that sure isn't
9.6 material.
From c56a32b5ace8a48908da366e5f778fa98a125740 Mon Sep 17 00:00:00 2001
From: Craig Ringer 
Date: Thu, 14 Apr 2016 15:43:20 +0800
Subject: [PATCH] Enable logical timeline following in the walsender

---
 src/backend/access/transam/xlogutils.c |  7 +++
 src/backend/replication/walsender.c| 11 +++
 src/include/access/xlogreader.h|  3 +++
 src/include/access/xlogutils.h |  2 ++
 4 files changed, 15 insertions(+), 8 deletions(-)

diff --git a/src/backend/access/transam/xlogutils.c b/src/backend/access/transam/xlogutils.c
index c3213ac..c3b0e5c 100644
--- a/src/backend/access/transam/xlogutils.c
+++ b/src/backend/access/transam/xlogutils.c
@@ -773,7 +773,7 @@ XLogRead(char *buf, TimeLineID tli, XLogRecPtr startptr, Size count)
  * might be reading from historical timeline data on a segment that's been
  * copied to a new timeline.
  */
-static void
+void
 XLogReadDetermineTimeline(XLogReaderState *state)
 {
 	/* Read the history on first time through */
@@ -856,12 +856,11 @@ XLogReadDetermineTimeline(XLogReaderState *state)
 			   state->currTLIValidUntil == InvalidXLogRecPtr)
 		{
 			XLogRecPtr	tliSwitch;
-			TimeLineID	nextTLI;
 
 			CHECK_FOR_INTERRUPTS();
 
 			tliSwitch = tliSwitchPoint(state->currTLI, state->timelineHistory,
-	   );
+	   >nextTLI);
 
 			/* round ValidUntil down to start of seg containing the switch */
 			state->currTLIValidUntil =
@@ -875,7 +874,7 @@ XLogReadDetermineTimeline(XLogReaderState *state)
  *
  * If that's the current TLI we'll stop searching.
  */
-state->currTLI = nextTLI;
+state->currTLI = state->nextTLI;
 state->currTLIValidUntil = InvalidXLogRecPtr;
 			}
 		}
diff --git a/src/backend/replication/walsender.c b/src/backend/replication/walsender.c
index e4a0119..495bff2 100644
--- a/src/backend/replication/walsender.c
+++ b/src/backend/replication/walsender.c
@@ -47,6 +47,7 @@
 #include "access/transam.h"
 #include "access/xact.h"
 #include "access/xlog_internal.h"
+#include "access/xlogutils.h"
 
 #include "catalog/pg_type.h"
 #include "commands/dbcommands.h"
@@ -756,6 +757,12 @@ logical_read_xlog_page(XLogReaderState *state, XLogRecPtr targetPagePtr, int req
 	XLogRecPtr	flushptr;
 	int			count;
 
+	XLogReadDetermineTimeline(state);
+	sendTimeLineIsHistoric = state->currTLI == ThisTimeLineID;
+	sendTimeLine = state->currTLI;
+	sendTimeLineValidUpto = state->currTLIValidUntil;
+	sendTimeLineNextTLI = state->nextTLI;
+
 	/* make sure we have enough WAL available */
 	flushptr = WalSndWaitForWal(targetPagePtr + reqLen);
 
@@ -984,10 +991,6 @@ StartLogicalReplication(StartReplicationCmd *cmd)
 	pq_endmessage();
 	pq_flush();
 
-	/* setup state for XLogReadPage */
-	sendTimeLineIsHistoric = false;
-	sendTimeLine = ThisTimeLineID;
-
 	/*
 	 * Initialize position to the last ack'ed one, then the xlog records begin
 	 * to be shipped from that position.
diff --git a/src/include/access/xlogreader.h b/src/include/access/xlogreader.h
index 300747d..bbee552 100644
--- 

Re: [HACKERS] Timeline following for logical slots

2016-04-07 Thread Craig Ringer
On 7 April 2016 at 23:32, Robert Haas  wrote:


> > Yeah. I understand the reasons for that decision. Per an earlier reply I
> > think we can avoid making them WAL-logged so they can be used on standbys
> > and still achieve usable failover support on physical replicas.
>
> I think at one point we may have discussed doing this via additional
> side-channel protocol messages.  Is that what you are thinking about
> now, or something else?
>

Essentially, yes.

The way I'd like to do it in 9.6+1 is:


- Require that the replica(s) use streaming replication with a replication
slot to connect to the master

- Extend the feedback protocol to allow the replica to push its required
catalog_xmin up to the master so it doesn't vacuum away catalog tuples
still needed by a replica. (There's no need for it to push the restart_lsn,
and its fine for the master to throw away WAL still needed by a replica).

- Track the replica's catalog_xmin on the replica's slot. So it'll be a
slot used for physical replication that also has a catalog_xmin set.

- Allow applications to create their own slots on read-replicas (for apps
that want to decode from a standby).

- For transparent failover sync slot state from master to replica via
writes by a helper to a table on the master that get applied by a helper on
the standby.



Allowing apps to create slots on a replica can be used by aware apps to do
failover but only if they know about and can connect to all the failover
candidate replica(s), and they have to maintain and advance a slot on each.
Potentially fragile. So this is mostly good for supporting decoding from a
standby, rather than failover.

I really want easy, practical failover that doesn't require every app to
re-implement logic to keep track of replicas its self, etc. For that I'd
use have a bgworker that runs on the replica make a direct libpq connection
to the master and snapshot the state of its slots plus its xlog insert
position. The worker would wait until the replica's replay reaches/passes
that xlog position before applying the slot state copied from the master to
the replica. (Adding a "GET SLOT STATE" or whatever to the walsender
interface would make this less ugly). This basically emulates what failover
slots did, but lazily: with no hook to capture slot state save we have to
poll the master. With no ability to insert the changes into WAL and run a
custom redo function on the replica we have to manually ensure they're
applied at the right time. Unlike with failover slots it's possible for the
slot on the replica to be behind where it was on the master at the same LSN
- but that's OK because we're protecting against catalog vacuum, per above.

(I'd really like to have a slot flag that lets us disallow replay from such
copied slots on the replica, marking them as usable only if not in
recovery.)

The only part of this that isn't possible on 9.6 is having the replica push
the catalog_xmin up to the master over feedback. But we can emulate that
with a bgworker on the replica that maintains an additional dummy logical
slot on the master. It replays the dummy slot to the lowest confirmed_lsn
of any slot on the replica. Somewhat inefficient, but will work.

If it sounds like a bit of a pile of hacks, that's because the failover
support part is. But unlike failover slots it will bring us closer to being
able to do logical decoding from a replica, which is nice. It can be made a
lot less ugly if the help of the walsender can be enlisted to report the
master's slot state, so we don't have to use normal libpq. (The reason I
wouldn't use a bgworker on the master to write it to a table then another
worker to apply changes from that table on the replica is mainly that then
we can't have failover support for ascading replicas, which can't write
WAL).


> Well, we can agree to disagree on this.  I don't think that it's all
> that difficult to figure out how to change your schema in a
> step-by-step way that allows logical replication to keep working while
> the nodes are out of sync, but you don't have to agree and that's
> fine.  I'm not objecting to eventually adding that feature to core.  I
> do think it's a bad idea to be polishing that sort of thing before
> getting some more basic facility into core.
>

That much I agree on - I certainly don't want to block this on DDL
replication.


> While I acknowledge that a logical output plugin has applications
> beyond replication, I think replication is the preeminent one by a
> considerable margin.  Some people want it for other purposes, and we
> should facilitate that.  But that number of people is dwarfed by the
> number who would use a seamless logical replication facility if we had
> one available.  And so that's the thing I think we should focus on
> making work.
>

Agreed.

Really I think we'll want a separate json output plugin for most of those
other uses anyway. Though some of the facilities in pglogical_output will
need to be extracted, added 

Re: [HACKERS] Timeline following for logical slots

2016-04-07 Thread Petr Jelinek

On 07/04/16 17:32, Robert Haas wrote:

Second, I'm not sure whether it was a good design decision to make
logical slots a special kind of object that sit off to the side,
neither configuration (like postgresql.conf) nor WAL-protected data
(like pg_clog and the data files themselves), but it was certainly a
very deliberate decision.  I sort of expected them to be WAL-logged,
but Andres argued (not unconvincingly) that we'd want to have slots on
standbys, and making them WAL-logged would preclude that.


Yeah. I understand the reasons for that decision. Per an earlier reply I
think we can avoid making them WAL-logged so they can be used on standbys
and still achieve usable failover support on physical replicas.


I think at one point we may have discussed doing this via additional
side-channel protocol messages.  Is that what you are thinking about
now, or something else?



I think the most promising idea was to use pull model instead of push 
model for slot updates and using feedback to tell master how far the 
oldest slot on standby is. This has the additional advantage of solving 
the biggest problem with decoding on standby (keeping old enough catalog 
xid and lsn). And also solves your concern of propagating through whole 
cascade as this can be done in more controllable way (from bottom up in 
the replication chain).



For version one, I would cut all of
the stuff that allows data to be sent in any format other than text,
and just use in/outfuncs all the time.


Agreed here, I think doing the binary transfer for base types, etc is 
just optimization, not necessity. This is one of the reasons why I 
wanted to get bit broader feedback on the protocol - to get some kind of 
consensus about what we want now, what we might want in the future and 
how to handle getting from now to future without too much complexity or 
breakage. For example currently we have flags for every protocol message 
which I am not sure are completely necessary there, OTOH we probably 
don't want to bump protocol version with every new version of Postgres 
either.




I do generally think that logical decoding relies too much on trying
to set up situations where it will never fail, and I've said from the
beginning that it should make more provision to cope with failure
rather than just trying to avoid it.  If logical decoding never
breaks, great.  But the approach I would favor is to set things up so
that it automatically reclones if there is a replication break, and
then as an optimization project, try to eliminate those cases one by
one.



Well that really depends. I've seen way too many cases where people use 
logical replication as transport mechanism, rather than replication 
where the destination is same as source, and in those scenarios there is 
often no way to "reclone" either because the historical data are no 
longer on the source or because the data on the target were already 
updated after they've been replicated. But in general the idea of 
recovering from error rather than being hell bent on preventing it is 
something I am pushing as well. For example it should be easier to look 
at what's in replication queue and remove things from there if needed.


--
  Petr Jelinek  http://www.2ndQuadrant.com/
  PostgreSQL Development, 24x7 Support, Training & Services


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Timeline following for logical slots

2016-04-07 Thread Robert Haas
On Tue, Apr 5, 2016 at 10:34 PM, Craig Ringer  wrote:
>> Right, but right now you probably *aren't* do doing any kind of
>> logical decoding from a master server to a standby, because there's
>> squat in the core distribution that could make use of that capability.
>> So you never get as far as discovering this problem.
>
> OK, I'm starting to understand where this is going.
>
> pglogical and pglogical_output are irrelevant so long as they haven't yet
> landed in core. There is no logical replication worth considering, therefore
> anything that improves logical replication is not worth the effort because
> there's really nothing to improve. Why bother with enhancements like
> timeline following when nothing uses it.
>
> Right?

"Irrelevant" is a strong word, and I wouldn't personally go that far.
But they would be a lot *more* relevant if they landed in core, at
least IMHO.

>> Second, I'm not sure whether it was a good design decision to make
>> logical slots a special kind of object that sit off to the side,
>> neither configuration (like postgresql.conf) nor WAL-protected data
>> (like pg_clog and the data files themselves), but it was certainly a
>> very deliberate decision.  I sort of expected them to be WAL-logged,
>> but Andres argued (not unconvincingly) that we'd want to have slots on
>> standbys, and making them WAL-logged would preclude that.
>
> Yeah. I understand the reasons for that decision. Per an earlier reply I
> think we can avoid making them WAL-logged so they can be used on standbys
> and still achieve usable failover support on physical replicas.

I think at one point we may have discussed doing this via additional
side-channel protocol messages.  Is that what you are thinking about
now, or something else?

>> If you retype a column from text to integer,
>> you probably aren't storing anything in it other than integers, in
>> which case it is not necessarily the case that you are locked into
>> applying that change at a particular point in the change stream.
>
> ... if you're replicating in text format, not send/recv format, yeah. You
> get away with it in that particular case. But that *only* works where you're
> using text format and the types are convertable via text representation.
>
> It's also just one example. Adding a NOT NULL column bites you, hard, for
> example.
>
> Asking users to do this themselves is incredibly fragile and requires much
> more knowledge of PostgreSQL innards than I think is realistic. It's part of
> why Slony-I is difficult too.
>
> Sure, you can avoid having DDL replication if you're really, really careful
> and know exactly what you're doing. I don't think it's a pre-req for getting
> logical replication into core, but I do think it's a pretty strong
> requirement for taking the use of logical replication for HA and failover
> seriously for general use.

Well, we can agree to disagree on this.  I don't think that it's all
that difficult to figure out how to change your schema in a
step-by-step way that allows logical replication to keep working while
the nodes are out of sync, but you don't have to agree and that's
fine.  I'm not objecting to eventually adding that feature to core.  I
do think it's a bad idea to be polishing that sort of thing before
getting some more basic facility into core.

>> To be honest, I was shocked that pglogical and pglogical_output didn't
>> go into this release.  I assumed that you and other folks at
>> 2ndQuadrant were going to make a big push to get that done.  I did
>> take a brief look at one of them - pglogical, I think - a week or two
>> ago but there were unaddressed review comments that had been pending
>> for months and there were a lot of fairly obvious things that needed
>> to be done before it could be seriously considered as a core
>> submission.  Like, for example, rewriting the documentation heavily
>> and making it look like the rest of our docs, and putting it in SGML
>> format.  The code seemed to need quite a bit of cleanup, too.
>
> Yes, I agree, and I think those comments are largely fair. Both Petr and I
> had more limited time to work on them than either of us had expected, so
> timely follow-up was a challenge. We also got them ready for initial
> submission later in the cycle than we'd hoped. That said, there was also
> very little review response or interest for a long time after they were
> published... and by the time there was I think both Petr and I were bogged a
> bit in other project work.
>
> In retrospect I think the separate submission of the output plugin was a
> mistake too, at least when called pglogical_output; in part because of the
> name it was completely ignored as meaningless without the downstream and
> written off as useful for nothing except logical replication. So no momentum
> was built by the time the downstream was published.

While I acknowledge that a logical output plugin has applications
beyond replication, I think replication is the preeminent 

Re: [HACKERS] Timeline following for logical slots

2016-04-06 Thread Craig Ringer
Draft patch for comment fix.​
From 304453f67b26219dc94e6aef9dfdc1c1debf12bb Mon Sep 17 00:00:00 2001
From: Craig Ringer 
Date: Wed, 6 Apr 2016 14:57:26 +0800
Subject: [PATCH] Fix incorrect comments introduced in logical decoding
 timeline following

---
 src/backend/replication/logical/logicalfuncs.c | 14 +++---
 src/include/replication/slot.h | 18 --
 2 files changed, 23 insertions(+), 9 deletions(-)

diff --git a/src/backend/replication/logical/logicalfuncs.c b/src/backend/replication/logical/logicalfuncs.c
index 5af6751..5f74941 100644
--- a/src/backend/replication/logical/logicalfuncs.c
+++ b/src/backend/replication/logical/logicalfuncs.c
@@ -236,8 +236,10 @@ pg_logical_slot_get_changes_guts(FunctionCallInfo fcinfo, bool confirm, bool bin
 	PG_TRY();
 	{
 		/*
-		 * Passing InvalidXLogRecPtr here causes decoding to start returning
-		 * commited xacts to the client at the slot's confirmed_flush.
+		 * Passing InvalidXLogRecPtr here causes logical decoding to
+		 * start calling the output plugin for transactions that commit
+		 * at or after the slot's confirmed_flush. This filters commits
+		 * out from the client but they're still decoded.
 		 */
 		ctx = CreateDecodingContext(InvalidXLogRecPtr,
 	options,
@@ -262,11 +264,9 @@ pg_logical_slot_get_changes_guts(FunctionCallInfo fcinfo, bool confirm, bool bin
 		ctx->output_writer_private = p;
 
 		/*
-		 * We start reading xlog from the restart lsn, even though we won't
-		 * start returning decoded data to the user until the point set up in
-		 * CreateDecodingContext. The restart_lsn is far enough back that we'll
-		 * see the beginning of any xact we're expected to return to the client
-		 * so we can assemble a complete reorder buffer for it.
+		 * Reading and decoding of WAL must start at restart_lsn so that the
+		 * entirety of each of the xacts that commit after confimed_lsn can be
+		 * accumulated into reorder buffers.
 		 */
 		startptr = MyReplicationSlot->data.restart_lsn;
 
diff --git a/src/include/replication/slot.h b/src/include/replication/slot.h
index efcce5f..50c8a33 100644
--- a/src/include/replication/slot.h
+++ b/src/include/replication/slot.h
@@ -63,10 +63,24 @@ typedef struct ReplicationSlotPersistentData
 	 */
 	TransactionId catalog_xmin;
 
-	/* oldest LSN that might be required by this replication slot */
+	/*
+	 * oldest LSN that might be required by this replication slot.
+	 *
+	 * Points to the LSN of the most recent xl_running_xacts record where
+	 * no transaction that commits after confirmed_flush is already in
+	 * progress. At this point all xacts committing after confirmed_flush
+	 * can be read entirely into reorder buffers and all visibility
+	 * information can be reconstructed.
+	 */
 	XLogRecPtr	restart_lsn;
 
-	/* oldest LSN that the client has acked receipt for */
+	/*
+	 * oldest LSN that the client has acked receipt for
+	 *
+	 * Decoding will start calling output plugin callbacks for commits
+	 * after this LSN unless a different start point is specified by
+	 * the client.
+	 */
 	XLogRecPtr	confirmed_flush;
 
 	/* plugin name */
-- 
2.1.0


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Timeline following for logical slots

2016-04-06 Thread Craig Ringer
I tried to address Andres's entirely valid complaints about that comment I
added in this patch.

I was going to add a description to restart_lsn in slot.h instead, but it's
proving harder to accurately describe restart_lsn than I'd like.

It's a point at which no transactions that commit after confirmed_lsn are
yet in progress. Easy enough.

But it's also a point at which visibility information can be reconstructed.
It's updated lazily by LogicalIncreaseRestartDecodingForSlot when the
client sends confirmation. Looking at SnapBuildProcessRunningXacts that can
be the most recent serialized snapshot from the snapshot builder (if
there's no xact running at confirmed_lsn) or the restart_decoding_lsn of
the oldest xact in progress at the confirmed_lsn. If I've read it correctly
the xact's restart_decoding_lsn is the ReorderBuffer's restart_decoding_lsn
at the time the xact started, which is in turn the last serialized snapshot
at that time.


So I'd describe restart_lsn as something like:

"
The oldest LSN that might be required by this replication slot.

Points to the LSN of the most recent xl_running_xacts record where no
transaction that commits after confirmed_flush is already in progress. At
this point all xacts committing after confirmed_flush can be read entirely
into reorder buffers and all visibility information can be reconstructed.
"

Is that accurate?

I want to get rid of the incorrect comment and fix this up.


Re: [HACKERS] Timeline following for logical slots

2016-04-05 Thread Petr Jelinek

On 06/04/16 03:29, Robert Haas wrote:



I don't understand why it seems to be considered OK for logical slots to
just vanish on failover. The only other things I can think of where that's
considered OK are unlogged tables (because that's the point and we have
failover-safe ones too) and the old hash indexes nobody's quite willing to
remove yet.


First, it wasn't until 9.3 that physical standbys could follow
timeline switches, but that doesn't mean that streaming replication
was useless in 9.0 - 9.2, or that warm standby was useless in earlier
versions.  Logical decoding isn't useless without that capability
either.  Would it be nice if we did have that capability?  Of course.



Except that in 9.0 - 9.2 there was working workaround for that, there is 
no such thing for logical decoding.



Second, I'm not sure whether it was a good design decision to make
logical slots a special kind of object that sit off to the side,
neither configuration (like postgresql.conf) nor WAL-protected data
(like pg_clog and the data files themselves), but it was certainly a
very deliberate decision.  I sort of expected them to be WAL-logged,
but Andres argued (not unconvincingly) that we'd want to have slots on
standbys, and making them WAL-logged would preclude that.


I do think it was good design decision. We just need to make them 
failoverable bit differently and the failover slots patch IMHO isn't the 
right way either as I said in another reply in this thread.





Review and test responses have been pretty underwhelming for pglogical, and
quite a bit seem to have boiled down to "this should live as an extension,
we don't need it in core". It often feels like we can't win: if we seek to
get it into core we're told it's not wanted/needed, but if we try to focus
on solving issues in core to make it work better and let it live as an
extension we're told we shouldn't bother until it's in core.


To be honest, I was shocked that pglogical and pglogical_output didn't
go into this release.  I assumed that you and other folks at
2ndQuadrant were going to make a big push to get that done.  I did
take a brief look at one of them - pglogical, I think - a week or two
ago but there were unaddressed review comments that had been pending
for months and there were a lot of fairly obvious things that needed
to be done before it could be seriously considered as a core
submission.  Like, for example, rewriting the documentation heavily
and making it look like the rest of our docs, and putting it in SGML
format.  The code seemed to need quite a bit of cleanup, too.  Now,
logical replication is a sufficiently important feature that if the
only way it's going to get into core is if I work on it myself, or get
other people at EnterpriseDB to do so, then I'll try to make that
happen.  But I was assuming that that was your/2ndQuadrant's patch,
that you were going to get it in shape, and that me poking my nose
into it wasn't going to be particularly welcome.  Maybe I've misread
the whole dynamic here.


I guess you did, me and I think Craig as well hoped for some feedback on 
the general ideas in terms of protocol, node setup (I mean catalogs) and 
general architecture from the community. That didn't really happen. And 
without any of that happening I didn't feel confident trying to get it 
right within last month of dev cycle. Especially given the size of the 
patch and the fact we also had other patches that we worked on and had 
realistically higher chance of getting in. Not sure how Craig feels 
about it. Converting documentation, renaming some params in function 
names etc (those unaddressed comments) seemed like secondary to me.


(As a side note I was also 2 weeks without proper working laptop around 
FOSDEM time which had effect on my responses to -hackers about the 
topic, especially to Steve Singer who did good job of reviewing the 
usability at the time, but even if I had it it would not saved the patch)


In general I think project of this size requires more attention from 
committer to help shepherding it and neither Craig or me are that. I am 
glad that Andres said he plans to give some time in next cycle to 
logical replication because that should be big help.




That being said, if we get a logical replication system into core that
doesn't do DDL, doesn't do multi-master, doesn't know squat about
sequences, and rolls over and dies if a timeline switch happens, I
would consider that a huge step forward and I think a lot of other
people would, too.


I agree with the exception of working HA. I would consider it very sad 
if we got logical replication in core without having any provision for 
continuity of service. Doing that is relatively trivial in comparison to 
the logical replication itself however.


--
  Petr Jelinek  http://www.2ndQuadrant.com/
  PostgreSQL Development, 24x7 Support, Training & Services


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your 

Re: [HACKERS] Timeline following for logical slots

2016-04-05 Thread Craig Ringer
On 6 April 2016 at 09:29, Robert Haas  wrote:


> > Right now if you're doing any kind of logical deocoding from a master
> server
> > that fails over to a standby the client just dies. The slot vanishes.
> You're
> > stuffed. Gee, I hope you didn't need all that nice consistent ordering,
> > because you're going to have to start from scratch and somehow reconcile
> the
> > data in the new master with what you've already received ... and haven't.
>
> Right, but right now you probably *aren't* do doing any kind of
> logical decoding from a master server to a standby, because there's
> squat in the core distribution that could make use of that capability.
> So you never get as far as discovering this problem.
>

OK, I'm starting to understand where this is going.

pglogical and pglogical_output are irrelevant so long as they haven't yet
landed in core. There is no logical replication worth considering,
therefore anything that improves logical replication is not worth the
effort because there's really nothing to improve. Why bother with
enhancements like timeline following when nothing uses it.

Right?

That's pretty much what I'm hearing from a number of others too. Obviously
I don't wholly agree, but I do see where you're coming from. Especially
since pglogical is in a sort of twilight zone where it's neither a public,
open project with a proper website, public SCM, etc that encourages
external contribution, nor is it strictly and only a submission to core.

It's not like, say, PostGIS, in that it's always been aimed at core (or
stated to be) and is something that can be realistically considered for
core. It doesn't have much of a life as an independent extension, and at
least from the "outside" it doesn't look like there's much energy going
into giving it one, right? Yet it's also had limited time and effort put
into getting it in core in this release cycle.

While I don't like that, I have to say I understand it.


> First, it wasn't until 9.3 that physical standbys could follow
> timeline switches, but that doesn't mean that streaming replication
> was useless in 9.0 - 9.2


I don't think there's much of a parallel there. It's closer to being like
having _cascading_ replication without the ability to follow timeline
switches via WAL archive replay or streaming.


> Second, I'm not sure whether it was a good design decision to make
> logical slots a special kind of object that sit off to the side,
> neither configuration (like postgresql.conf) nor WAL-protected data
> (like pg_clog and the data files themselves), but it was certainly a
> very deliberate decision.  I sort of expected them to be WAL-logged,
> but Andres argued (not unconvincingly) that we'd want to have slots on
> standbys, and making them WAL-logged would preclude that.


Yeah. I understand the reasons for that decision. Per an earlier reply I
think we can avoid making them WAL-logged so they can be used on standbys
and still achieve usable failover support on physical replicas.


> It's more like unlogged tables, where a deliberate
> design decision to lose data was made in order to meet some other
> goal.
>

Yeah ... but it's like unlogged tables without having the option of logged
tables.

Nice while it lasts but you're in bad trouble when it goes wrong.

(The parallel isn't perfect of course, since unlogged tables aren't crash
safe wheras slots are, they just aren't replicated).


> If you retype a column from text to integer,
> you probably aren't storing anything in it other than integers, in
> which case it is not necessarily the case that you are locked into
> applying that change at a particular point in the change stream.


... if you're replicating in text format, not send/recv format, yeah. You
get away with it in that particular case. But that *only* works where
you're using text format and the types are convertable via text
representation.

It's also just one example. Adding a NOT NULL column bites you, hard, for
example.

Asking users to do this themselves is incredibly fragile and requires much
more knowledge of PostgreSQL innards than I think is realistic. It's part
of why Slony-I is difficult too.

Sure, you can avoid having DDL replication if you're really, really careful
and know exactly what you're doing. I don't think it's a pre-req for
getting logical replication into core, but I do think it's a pretty strong
requirement for taking the use of logical replication for HA and failover
seriously for general use.

To be honest, I was shocked that pglogical and pglogical_output didn't
> go into this release.  I assumed that you and other folks at
> 2ndQuadrant were going to make a big push to get that done.  I did
> take a brief look at one of them - pglogical, I think - a week or two
> ago but there were unaddressed review comments that had been pending
> for months and there were a lot of fairly obvious things that needed
> to be done before it could be seriously considered as a core
> submission. 

Re: [HACKERS] Timeline following for logical slots

2016-04-05 Thread Robert Haas
On Tue, Apr 5, 2016 at 3:51 AM, Craig Ringer  wrote:
> First, I'd like to remind everyone that logical decoding is useful for more
> than replication. You can consume the change stream for audit
> logging/archival, to feed into a different DBMS, etc etc. This is not just
> about replicating from one PostgreSQL to another, though to be sure that'll
> be the main use in the near term.
>
> The Zalando guys at least are already using it for other things, and
> interest in the json support suggests they're not alone.

I have not forgotten any of that, nor do I consider it unimportant.

> Right now if you're doing any kind of logical deocoding from a master server
> that fails over to a standby the client just dies. The slot vanishes. You're
> stuffed. Gee, I hope you didn't need all that nice consistent ordering,
> because you're going to have to start from scratch and somehow reconcile the
> data in the new master with what you've already received ... and haven't.

Right, but right now you probably *aren't* do doing any kind of
logical decoding from a master server to a standby, because there's
squat in the core distribution that could make use of that capability.
So you never get as far as discovering this problem.

> I don't understand why it seems to be considered OK for logical slots to
> just vanish on failover. The only other things I can think of where that's
> considered OK are unlogged tables (because that's the point and we have
> failover-safe ones too) and the old hash indexes nobody's quite willing to
> remove yet.

First, it wasn't until 9.3 that physical standbys could follow
timeline switches, but that doesn't mean that streaming replication
was useless in 9.0 - 9.2, or that warm standby was useless in earlier
versions.  Logical decoding isn't useless without that capability
either.  Would it be nice if we did have that capability?  Of course.

Second, I'm not sure whether it was a good design decision to make
logical slots a special kind of object that sit off to the side,
neither configuration (like postgresql.conf) nor WAL-protected data
(like pg_clog and the data files themselves), but it was certainly a
very deliberate decision.  I sort of expected them to be WAL-logged,
but Andres argued (not unconvincingly) that we'd want to have slots on
standbys, and making them WAL-logged would preclude that.  So I don't
really think that this is much like hash indexes, which just never got
properly finished.  It's more like unlogged tables, where a deliberate
design decision to lose data was made in order to meet some other
goal.

>> realistically, there are a lot of people who simply don't change their
>> schema all that often, and who could (and might even prefer to) manage
>> that process in other ways - e.g. change nodes one by one while they
>> are off-line, then bring them on-line.
>
> Yeah, it's a bit more complex than that. Schema changes *must* be made at a
> specific point in replay. You can't generally just ALTER TABLE ... DROP
> COLUMN on the master then do the same thing on the replica. The replica
> probably still has un-replayed changes from the master that have the
> now-dropped column in their change stream, but now it can't apply them to
> the new table structure on the downstream. This particular case can be
> worked around, but column type changes, addition of non-null columns etc
> cannot.

There are certainly problem cases, but I'm not sure they really arise
that much in practice.  If you retype a column from text to integer,
you probably aren't storing anything in it other than integers, in
which case it is not necessarily the case that you are locked into
applying that change at a particular point in the change stream.  If
you are storing non-integers in a text column and relying on a USING
clause to make them look like integers during the conversion, then,
yes, that has to be done at a precise point in the change stream.  But
that's a pretty strange thing to do, and your application is most
likely going to get confused anyway, so you are probably taking a
maintenance window for the changeover anyway - in which case, there's
not really a big problem.  You can run the same change at the same
time on both servers while nothing else is happening.

> Also, a growing portion of the world uses generated schemas and migrations
> and can't easily pipe that through some arbitrary function we provide. They
> may not be too keen on stopping writes to their entire database as an
> alternative either (and we don't provide a true read-only mode for them to
> use if they did want to). I know it's fashionable around here to just write
> them off as idiots for using ORMs and so on, but they're rather widespread
> idiots who're just as likely to be interested in geographically distributed
> selective replication, change stream extraction, etc. This has been a
> persistent problem with people who want to use BDR, too.

It's not my intent to write anyone off as an idiot.


Re: [HACKERS] Timeline following for logical slots

2016-04-05 Thread Craig Ringer
On 5 April 2016 at 14:09, Andres Freund  wrote:

> On 2016-04-05 05:53:53 +0200, Petr Jelinek wrote:
> > On 04/04/16 17:15, Andres Freund wrote:
> > >
> > >>* Robust sequence decoding and replication. If you were following the
> later
> > >>parts of that discussion you will've seen how fun that's going to be,
> but
> > >>it's the simplest of all of the problems.
> > >
> > >Unconvinced. People used londiste and slony for years without that, and
> > >it's not even remotely at the top of the list of problems with either.
> > >
> >
> > Londiste and Slony also support physical failover unlike logical decoding
> > which is the main point of this discussion, lets not forget that.
>
> Sure. But that level of failover isn't all that hard to implement.
>
>
Well, they get it mostly for free. Because they work at the SQL level and
we have working physical failover, so they don't really notice the change.

Just like I've been trying to make possible for logical decoding and
logical replication.

> If we got failover slots into 9.6 it would
> > be better but that does not look realistic at this point. I don't think
> that
> > current design for failover slots is best possible - I think failover
> slots
> > should be created on replica and send their status up to the master which
> > would then take them into account when calculating oldest needed catalog
> > xmin and lsn (simple way of doing that would be to add this to feedback
> > protocol and let physical slot to keep the xmin/lsn as well)
>
> Yes, that's not too far away from what I'm thinking of. If we do it
> right that also solves the important problems for decoding on a standby.
>

I'm pretty happy with this approach too.

It puts a bit more burden on the client, but I think that can be solved
separately and later with a helper running on the replica.

I've never liked how failover slots can't work with a cascading replica
when we get support for that. By contrast, this approach would actually
help solve one of the problems needed to get replay from a replica working.

It's a pity it's a no-hoper for 9.6 though.

-- 
 Craig Ringer   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


Re: [HACKERS] Timeline following for logical slots

2016-04-05 Thread Oleksii Kliukin

> On 05 Apr 2016, at 09:51, Craig Ringer  wrote:
> 
> On 5 April 2016 at 04:00, Robert Haas  > wrote:
> 
> In general, I think we'd be a lot better off if we got some kind of
> logical replication into core first and then worked on lifting these
> types of limitations afterwards.
> 
> First, I'd like to remind everyone that logical decoding is useful for more 
> than replication. You can consume the change stream for audit 
> logging/archival, to feed into a different DBMS, etc etc. This is not just 
> about replicating from one PostgreSQL to another, though to be sure that'll 
> be the main use in the near term.
> 
> The Zalando guys at least are already using it for other things, and interest 
> in the json support suggests they're not alone.

We are actually interested in both the streaming part and the logical 
replication provided at the moment by BDR. The reason we cannot use BDR 
efficiently is that there is no way to provide a HA for one of the BDR nodes 
using physical replication, meaning that we have to run 2x nodes with a 
requirement of each node communicating to each other. Since the only use case 
we have is to run things in multiple data-centers with latencies of 100ms and 
above, running without the physical HA limits us to only couple of nodes, with 
a manual repair mechanism.

The other use case we have is streaming data from many, sometimes rather big 
(TBs) databases to the consumers interested in storing subsets of that data in 
order to run analytical queries on them. It’s hard to imagine a robust system 
like this that is built around the feature that doesn't support a failover 
between the master and a physical replica, forcing to stream again a set of 
selected tables at best, and the complete database at most (did I mention the 
terabytes already?), and, potentially, doing some very funny tricks to merge 
the newly streamed data with something that is already there. 


> 
> Right now if you're doing any kind of logical deocoding from a master server 
> that fails over to a standby the client just dies. The slot vanishes. You're 
> stuffed. Gee, I hope you didn't need all that nice consistent ordering, 
> because you're going to have to start from scratch and somehow reconcile the 
> data in the new master with what you've already received ... and haven’t.

+1

> 
> We could certainly require clients to jump through all sorts of extra hoops 
> to make sure they can follow replay over physical failover. Or we could say 
> that you shouldn't actually expect to use logical decoding in real world 
> environments where HA is a necessity until we get around to supporting 
> realistic, usable logical-rep based failover in a few years.

And faster than in a few years your typical big organization might decide to 
move on to some other data storage solution, promising  HA right now, at the 
expense of using SQL and strong consistency and transactions. It would be a bad 
choice, but the one that developers (especially those looking to build 
“scalable micro services” with only couple of CRUD queries) will be willing to 
make.


> Or we could make it "just work" for the physical failover systems everyone 
> already uses and relies on, just like sequences, indexes, and everything else 
> in PostgreSQL that's expected to survive failover.
> 

--
Oleksii



Re: [HACKERS] Timeline following for logical slots

2016-04-05 Thread Andres Freund
On 2016-04-05 15:51:00 +0800, Craig Ringer wrote:
> Review and test responses have been pretty underwhelming for pglogical, and
> quite a bit seem to have boiled down to "this should live as an extension,
> we don't need it in core". It often feels like we can't win: if we seek to
> get it into core we're told it's not wanted/needed, but if we try to focus
> on solving issues in core to make it work better and let it live as an
> extension we're told we shouldn't bother until it's in core.

I think partially that's because it's hard to see the goal from those
threads. Leading the intro email with "after applying use these three
steps to replicate a database" or something might help.

I also want to add that so far, to my knowledge, the feedback hasn't
fully been addressed. It's a bit hard to see progress at that pace.


> Do you want to get a logical replication system into core that doesn't work
> properly with lots of the other features in PostgreSQL? That's historically
> not how we've done things here, and sometimes massive amounts of work have
> been required to make new feature X work with obscure/awkward existing
> feature Y.

I think that's a strawman. We have done actual iterative development
where the basic feature came at an early stage a lot of times. Most
impressively FDWs.

And even if we decide that feature X has to be supported, having an
otherwise close-to-committable patch series, goes a *LONG* way to
motivate people.


> Still, I don't really want to block work on making logical decoding more
> real-world usable on inclusion of a logical replication system for
> PostgreSQL, especially one that'll be lucky to get in for 9.7 at the
> earliest.

My impression is that unless you *NOW* press very hard to get it into
core, there's no way to get it into 9.7. Unless you start aggressively
at some point, it'll never get in.

Greetings,

Andres Freund


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Timeline following for logical slots

2016-04-05 Thread Craig Ringer
On 5 April 2016 at 14:18, Peter Geoghegan  wrote:


>
> I rather agree that an in-core system that solved some of the basic
> problems would be a huge step forward, and would motivate people to
> work on the harder problems. It's surprising to me that we don't seem
> to be much closer to that then we were when 9.4 was released.
>
>
I think we are, actually. For one thing a lot of groundwork has gone in,
even if it doesn't seem especially highly visible.

Also, pglogical_output might've been able to get into 9.6 if we'd got it
submittable a bit earlier, it'd had useful review in the first few weeks
(months?), and I'd had more time to more responsively react to  review
comments by the time it did get review.

Even there, there are a bunch of places (like the relation
cache/invalidations) that should really be moved "up" into logical decoding
its self. Which while architecturally superior means yet more groundwork
patches to write and get committed before a new revision of
pglogical_output can go into 9.7.

I thought that getting pglogical_output into 9.6 might help us move closer
to that goal, but folks seemed to think it was useless without the
downstream client. I disagree, both on the principle of iterative
development and because it's plenty useful for other things by its self,
but didn't make any progress.

I don't think it helps that there's a real community split between people
who want to see the actual logical rep engine live out-of-tree as an
extension and those who think it's not interesting until it's in core.

-- 
 Craig Ringer   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


Re: [HACKERS] Timeline following for logical slots

2016-04-05 Thread Craig Ringer
On 5 April 2016 at 04:00, Robert Haas  wrote:

>
> In general, I think we'd be a lot better off if we got some kind of
> logical replication into core first and then worked on lifting these
> types of limitations afterwards.


First, I'd like to remind everyone that logical decoding is useful for more
than replication. You can consume the change stream for audit
logging/archival, to feed into a different DBMS, etc etc. This is not just
about replicating from one PostgreSQL to another, though to be sure that'll
be the main use in the near term.

The Zalando guys at least are already using it for other things, and
interest in the json support suggests they're not alone.

Right now if you're doing any kind of logical deocoding from a master
server that fails over to a standby the client just dies. The slot
vanishes. You're stuffed. Gee, I hope you didn't need all that nice
consistent ordering, because you're going to have to start from scratch and
somehow reconcile the data in the new master with what you've already
received ... and haven't.

We could certainly require clients to jump through all sorts of extra hoops
to make sure they can follow replay over physical failover. Or we could say
that you shouldn't actually expect to use logical decoding in real world
environments where HA is a necessity until we get around to supporting
realistic, usable logical-rep based failover in a few years. Or we could
make it "just work" for the physical failover systems everyone already uses
and relies on, just like sequences, indexes, and everything else in
PostgreSQL that's expected to survive failover.

Would you tell someone to use unlogged tables and then do failover with
Londiste? 'cos that's not too far from what's being talked about here.
Though frankly, Londiste is more capable than the replication currently
delivered with pglogical anyway, and it can follow physical failover too.

I don't understand why it seems to be considered OK for logical slots to
just vanish on failover. The only other things I can think of where that's
considered OK are unlogged tables (because that's the point and we have
failover-safe ones too) and the old hash indexes nobody's quite willing to
remove yet.

If I had to pick an order in which
> to do the things you list, I'd focus first on the one you list second:
> being able to stream and begin applying transactions before they've
> committed is a really big deal for large transactions, and lots of
> people have some large transactions.


I guess. We can manage DDL externally, however ugly that may be, but we
can't do much about big xacts, and all but the very purest OLTP systems do
some batch work sometimes.

It's still 9.7 material at very, very best, and things like parallel apply
then stack on top of it.


DDL replication is nice, but
> realistically, there are a lot of people who simply don't change their
> schema all that often, and who could (and might even prefer to) manage
> that process in other ways - e.g. change nodes one by one while they
> are off-line, then bring them on-line.
>

Yeah, it's a bit more complex than that. Schema changes *must* be made at a
specific point in replay. You can't generally just ALTER TABLE ... DROP
COLUMN on the master then do the same thing on the replica. The replica
probably still has un-replayed changes from the master that have the
now-dropped column in their change stream, but now it can't apply them to
the new table structure on the downstream. This particular case can be
worked around, but column type changes, addition of non-null columns etc
cannot.

You can only do DDL safely by either:

* Freezing all writes to replicated tables on the master and waiting until
all logical slots are replayed up to date, then applying the DDL on each
node; or

* Passing the desired DDL to a function that inserts it into a special
replicated table on the master then executes it on the master. The replica
monitors the replicated table for inserts and executes the same DDL on the
replica as it replays the change stream. The insert serves as a barrier
between old and new table structures.

pglogical supports either approach, but both have major foot-guns attached.

Also, a growing portion of the world uses generated schemas and migrations
and can't easily pipe that through some arbitrary function we provide. They
may not be too keen on stopping writes to their entire database as an
alternative either (and we don't provide a true read-only mode for them to
use if they did want to). I know it's fashionable around here to just write
them off as idiots for using ORMs and so on, but they're rather widespread
idiots who're just as likely to be interested in geographically distributed
selective replication, change stream extraction, etc. This has been a
persistent problem with people who want to use BDR, too.

There are workarounds available, though, and we can't fix everything at
once. So despite the problems I agree that DDL replication 

Re: [HACKERS] Timeline following for logical slots

2016-04-05 Thread Peter Geoghegan
On Mon, Apr 4, 2016 at 1:00 PM, Robert Haas  wrote:
> I don't necessarily disagree with your statement that we'd need all of
> this stuff to make logical replication a substitute for physical
> replication as far as failover is concerned.  But I don't think that's
> the most important goal, and even to the extent that it is the goal, I
> don't think we need to meet every need before we can consider
> ourselves to have met some needs.  I don't think that we need every
> physical replication feature plus some before logical replication can
> start to be useful to PostgreSQL users generally.  We do, however,
> need the functionality to be accessible to people who are using only
> the PostgreSQL core distribution.  The thing that is going to get
> people excited about making logical replication better is getting to a
> point where they can use it at all - and that is not going to be true
> as long as you can't use it without having to download something from
> an external website.

I rather agree that an in-core system that solved some of the basic
problems would be a huge step forward, and would motivate people to
work on the harder problems. It's surprising to me that we don't seem
to be much closer to that then we were when 9.4 was released.

-- 
Peter Geoghegan


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Timeline following for logical slots

2016-04-05 Thread Andres Freund
On 2016-04-05 05:53:53 +0200, Petr Jelinek wrote:
> On 04/04/16 17:15, Andres Freund wrote:
> >
> >>* Robust sequence decoding and replication. If you were following the later
> >>parts of that discussion you will've seen how fun that's going to be, but
> >>it's the simplest of all of the problems.
> >
> >Unconvinced. People used londiste and slony for years without that, and
> >it's not even remotely at the top of the list of problems with either.
> >
> 
> Londiste and Slony also support physical failover unlike logical decoding
> which is the main point of this discussion, lets not forget that.

Sure. But that level of failover isn't all that hard to implement.

I just want to reiterate: I'm not against failover slots per-se, or
against timeline following[1], or something like that. I think it's just
getting the priorities backwards. Until there's basic features
available, you're going to have a hard time fighting for more advanced
features.

[1]: That said, I don't agree with the approach chosen so far, but
that's just a matter of discussion.


> >
> >>* Robust, seamless DDL replication, so things don't just break randomly.
> >>This makes the other points above look nice and simple by comparison.
> >
> >We're talking about a system which involves logical decoding. Whether
> >you have failover via physical rep or not, doesn't have anything to do
> >with this point.

> It is if you are insisting on using logical rep as solution for failover.

How on earth does this follow? If your replicas don't do DDL
propagation, doing so on the primary -> new primary, it surely isn't a
prerequisite.  I agree DDL rep is pretty crucial, but this argument here
isn't that.


> I also don't buy your argument that it's unsafe to use timeline following on
> logical decoding on replica.

I'm not saying that generally. I'm saying that you can't realistically
do it safely with just the timeline following patch, as committed.


> You can always keep master from moving too far
> ahead by other means (even if you just use dummy slot which is only used for
> this purpose, yes ugly I know).

Yes, that somewhat works. And I think this is actually kinda the design
"failover" slots should take instead.


> If we got failover slots into 9.6 it would
> be better but that does not look realistic at this point. I don't think that
> current design for failover slots is best possible - I think failover slots
> should be created on replica and send their status up to the master which
> would then take them into account when calculating oldest needed catalog
> xmin and lsn (simple way of doing that would be to add this to feedback
> protocol and let physical slot to keep the xmin/lsn as well)

Yes, that's not too far away from what I'm thinking of. If we do it
right that also solves the important problems for decoding on a standby.


> but that does not mean timeline following isn't good thing on it's own
> (not to mention that iterative development is a thing).

Yes, I'm not saying that. The reason it scares me is that it's not a
particularly carefully reviewed patch, and the intended usage as
described by Craig.

We most definitely need timeline following, independent of failover
slots. Most importantly for decoding on a standby.

Greetings,

Andres Freund


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Timeline following for logical slots

2016-04-04 Thread Petr Jelinek

On 04/04/16 17:15, Andres Freund wrote:



* Robust sequence decoding and replication. If you were following the later
parts of that discussion you will've seen how fun that's going to be, but
it's the simplest of all of the problems.


Unconvinced. People used londiste and slony for years without that, and
it's not even remotely at the top of the list of problems with either.



Londiste and Slony also support physical failover unlike logical 
decoding which is the main point of this discussion, lets not forget that.





* Robust, seamless DDL replication, so things don't just break randomly.
This makes the other points above look nice and simple by comparison.


We're talking about a system which involves logical decoding. Whether
you have failover via physical rep or not, doesn't have anything to do
with this point.



It is if you are insisting on using logical rep as solution for failover.




I don't think there's any realistic way we're going to get there for
logical rep in 9.6+n for n<2 unless a whole lot more people get on board
and work on it. Even then.


I think the primary problem here is that you're focusing on things that
just are not very interesting for the majority of users, and which thus
won't get very enthusastic help.  The way to make progress is to get
something basic in, and then iterate from there.  Instead you're
focussing on the fringes; which nobody cares about, because the basics
aren't there.


Craig is focusing on solving failover for logical slots which is very 
damn basic issue with logical decoding right now no matter if we have 
logical replication in core or not. I personally don't think it's ok by 
any means to not be able to continue logical decoding on failover event 
2 versions after logical decoding was introduced.


I also don't buy your argument that it's unsafe to use timeline 
following on logical decoding on replica. You can always keep master 
from moving too far ahead by other means (even if you just use dummy 
slot which is only used for this purpose, yes ugly I know). If we got 
failover slots into 9.6 it would be better but that does not look 
realistic at this point. I don't think that current design for failover 
slots is best possible - I think failover slots should be created on 
replica and send their status up to the master which would then take 
them into account when calculating oldest needed catalog xmin and lsn 
(simple way of doing that would be to add this to feedback protocol and 
let physical slot to keep the xmin/lsn as well), but that does not mean 
timeline following isn't good thing on it's own (not to mention that 
iterative development is a thing).


--
  Petr Jelinek  http://www.2ndQuadrant.com/
  PostgreSQL Development, 24x7 Support, Training & Services


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Timeline following for logical slots

2016-04-04 Thread Petr Jelinek

On 04/04/16 22:00, Robert Haas wrote:

On Mon, Apr 4, 2016 at 10:59 AM, Craig Ringer  wrote:

To allow logical rep and failover to be a reasonable substitute for physical
rep and failover IMO *need*:

* Robust sequence decoding and replication. If you were following the later
parts of that discussion you will've seen how fun that's going to be, but
it's the simplest of all of the problems.

* Logical decoding and sending of in-progress xacts, so the logical client
can already be most of the way through receiving a big xact when it commits.
Without this we have a huge lag spike whenever a big xact happens, since we
must first finish decoding it in to a reorder buffer and can only then
*begin* to send it to the client. During which time no later xacts may be
decoded or replayed to the client. If you're running that rare thing, the
perfect pure OLTP system, you won't care... but good luck finding one in the
real world.

* Either parallel apply on the client side or at least buffering of
in-progress xacts on the client side so they can be safely flushed to disk
and confirmed, allowing receive to continue while replay is done on the
client. Otherwise sync rep is completely impractical... and there's no
shortage of systems out there that can't afford to lose any transactions. Or
at least have some crucial transactions they can't lose.

* Robust, seamless DDL replication, so things don't just break randomly.
This makes the other points above look nice and simple by comparison.
Logical decoding of 2PC xacts with DDL would help here, as would the ability
to transparently convert an xact into a prepare-xact on client commit and
hold the client waiting while we replicate it, confirm the successful
prepare on the replica, then commit prepared on the upstream.

* oh, and some way to handle replication of shared catalog changes like
pg_authid, so the above DDL replication doesn't just randomly break if it
happens to refer to a global object that doesn't exist on the downstream.


In general, I think we'd be a lot better off if we got some kind of
logical replication into core first and then worked on lifting these
types of limitations afterwards.  If I had to pick an order in which
to do the things you list, I'd focus first on the one you list second:
being able to stream and begin applying transactions before they've
committed is a really big deal for large transactions, and lots of
people have some large transactions.  DDL replication is nice, but
realistically, there are a lot of people who simply don't change their
schema all that often, and who could (and might even prefer to) manage
that process in other ways - e.g. change nodes one by one while they
are off-line, then bring them on-line.

I don't necessarily disagree with your statement that we'd need all of
this stuff to make logical replication a substitute for physical
replication as far as failover is concerned.  But I don't think that's
the most important goal, and even to the extent that it is the goal, I
don't think we need to meet every need before we can consider
ourselves to have met some needs.  I don't think that we need every
physical replication feature plus some before logical replication can
start to be useful to PostgreSQL users generally.  We do, however,
need the functionality to be accessible to people who are using only
the PostgreSQL core distribution.  The thing that is going to get
people excited about making logical replication better is getting to a
point where they can use it at all - and that is not going to be true
as long as you can't use it without having to download something from
an external website.



I agree with you and I think Craig does as well. IMHO he merely pointed 
this out to show that we at the moment need physical replication as HA 
solution for logical decoding because currently there is no way to have 
reasonable continuation of service when the original master dies. And 
waiting until we have logical that's on par with physical for failover 
will take long time as it needs the above.


--
  Petr Jelinek  http://www.2ndQuadrant.com/
  PostgreSQL Development, 24x7 Support, Training & Services


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Timeline following for logical slots

2016-04-04 Thread Robert Haas
On Mon, Apr 4, 2016 at 10:59 AM, Craig Ringer  wrote:
> To allow logical rep and failover to be a reasonable substitute for physical
> rep and failover IMO *need*:
>
> * Robust sequence decoding and replication. If you were following the later
> parts of that discussion you will've seen how fun that's going to be, but
> it's the simplest of all of the problems.
>
> * Logical decoding and sending of in-progress xacts, so the logical client
> can already be most of the way through receiving a big xact when it commits.
> Without this we have a huge lag spike whenever a big xact happens, since we
> must first finish decoding it in to a reorder buffer and can only then
> *begin* to send it to the client. During which time no later xacts may be
> decoded or replayed to the client. If you're running that rare thing, the
> perfect pure OLTP system, you won't care... but good luck finding one in the
> real world.
>
> * Either parallel apply on the client side or at least buffering of
> in-progress xacts on the client side so they can be safely flushed to disk
> and confirmed, allowing receive to continue while replay is done on the
> client. Otherwise sync rep is completely impractical... and there's no
> shortage of systems out there that can't afford to lose any transactions. Or
> at least have some crucial transactions they can't lose.
>
> * Robust, seamless DDL replication, so things don't just break randomly.
> This makes the other points above look nice and simple by comparison.
> Logical decoding of 2PC xacts with DDL would help here, as would the ability
> to transparently convert an xact into a prepare-xact on client commit and
> hold the client waiting while we replicate it, confirm the successful
> prepare on the replica, then commit prepared on the upstream.
>
> * oh, and some way to handle replication of shared catalog changes like
> pg_authid, so the above DDL replication doesn't just randomly break if it
> happens to refer to a global object that doesn't exist on the downstream.

In general, I think we'd be a lot better off if we got some kind of
logical replication into core first and then worked on lifting these
types of limitations afterwards.  If I had to pick an order in which
to do the things you list, I'd focus first on the one you list second:
being able to stream and begin applying transactions before they've
committed is a really big deal for large transactions, and lots of
people have some large transactions.  DDL replication is nice, but
realistically, there are a lot of people who simply don't change their
schema all that often, and who could (and might even prefer to) manage
that process in other ways - e.g. change nodes one by one while they
are off-line, then bring them on-line.

I don't necessarily disagree with your statement that we'd need all of
this stuff to make logical replication a substitute for physical
replication as far as failover is concerned.  But I don't think that's
the most important goal, and even to the extent that it is the goal, I
don't think we need to meet every need before we can consider
ourselves to have met some needs.  I don't think that we need every
physical replication feature plus some before logical replication can
start to be useful to PostgreSQL users generally.  We do, however,
need the functionality to be accessible to people who are using only
the PostgreSQL core distribution.  The thing that is going to get
people excited about making logical replication better is getting to a
point where they can use it at all - and that is not going to be true
as long as you can't use it without having to download something from
an external website.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Timeline following for logical slots

2016-04-04 Thread Andres Freund
On 2016-04-04 22:59:41 +0800, Craig Ringer wrote:
> Assuming that here you mean separate slots on different machines
> replicating via physical rep:

No,  I don't.

> We don't currently allow the creation of a logical slot on a standby. Nor
> replay from it, even to advance it without receiving the decoded
> changes.

Yes. I know.


> > I think the right way to do this is to focus on failover for logical
> > rep, with separate slots. The whole idea of integrating this physical
> > rep imo makes this a *lot* more complex than necessary. Not all that
> > many people are going to want to physical rep and logical rep.

> If you're saying we should focus on failover between nodes that're
> themselves connected using logical replication rather than physical
> replication, I really have to strongly disagree.
> 
> TL;DR for book-length below: We're a long, long way from being able to
> deliver even vaguely decent logical rep based failover.

I don't buy that.


> * Robust sequence decoding and replication. If you were following the later
> parts of that discussion you will've seen how fun that's going to be, but
> it's the simplest of all of the problems.

Unconvinced. People used londiste and slony for years without that, and
it's not even remotely at the top of the list of problems with either.


> * Logical decoding and sending of in-progress xacts, so the logical client
> can already be most of the way through receiving a big xact when it
> commits. Without this we have a huge lag spike whenever a big xact happens,
> since we must first finish decoding it in to a reorder buffer and can only
> then *begin* to send it to the client. During which time no later xacts may
> be decoded or replayed to the client. If you're running that rare thing,
> the perfect pure OLTP system, you won't care... but good luck finding one
> in the real world.

So? If you're using logical rep, you've always have that.


> * Either parallel apply on the client side or at least buffering of
> in-progress xacts on the client side so they can be safely flushed to disk
> and confirmed, allowing receive to continue while replay is done on the
> client. Otherwise sync rep is completely impractical... and there's no
> shortage of systems out there that can't afford to lose any transactions.
> Or at least have some crucial transactions they can't lose.

That's more or less trivial to implement. In an extension.


> * Robust, seamless DDL replication, so things don't just break randomly.
> This makes the other points above look nice and simple by comparison.

We're talking about a system which involves logical decoding. Whether
you have failover via physical rep or not, doesn't have anything to do
with this point.


> Physical rep *works*. Robustly. Reliably. With decent performance. It's
> proven. It supports sync rep. I'm confident telling people to use it.

Sure? And nothing prevents you from using it.


> I don't think there's any realistic way we're going to get there for
> logical rep in 9.6+n for n<2 unless a whole lot more people get on board
> and work on it. Even then.

I think the primary problem here is that you're focusing on things that
just are not very interesting for the majority of users, and which thus
won't get very enthusastic help.  The way to make progress is to get
something basic in, and then iterate from there.  Instead you're
focussing on the fringes; which nobody cares about, because the basics
aren't there.

FWIW, I plan to aggressively work on in-core (9.7) logical rep starting
in a few weeks. If we can coordinate on that end, I'm very happy, if not
then not.

> Right now we can deliver logical failover for DBs that:
> (b) don't do DDL, ever, or only do some limited DDL via direct admin
> commands where they can call some kind of helper function to queue and
> apply the DDL;
> (c) don't do big transactions or don't care about unbounded lag;
> (d) don't need synchronous replication or don't care about possibly large
> delays before commit is confirmed;
> (e) only manage role creation (among other things) via very strict
> processes that can be absolutely guaranteed to run on all nodes

And just about nothing of that has to do with any of the recent patches
send towards logical rep. I agree about the problems, but you should
attack them, instead of building way too complicated workarounds which
barely anybody will find interesting.


> ... which in my view isn't a great many databases.

Meh^2. Slony and londiste have these problems. And the reason they're
not used isn't primarily those, but that they're external and waaayyy to
complicated to use.


> Physical rep has *none* of those problems. (Sure, it has others, but we're
> used to them).

So?


-Andres


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Timeline following for logical slots

2016-04-04 Thread Craig Ringer
On 4 April 2016 at 18:01, Andres Freund  wrote:


> > The only way I can think of to do that really reliably right now, without
> > full failover slots, is to use the newly committed pluggable WAL
> mechanism
> > and add a hook to SaveSlotToPath() so slot info can be captured, injected
> > in WAL, and replayed on the replica.
>
> I personally think the primary answer is to use separate slots on
> different machines. Failover slots can be an extension to that at some
> point, but I think they're a secondary goal.
>

Assuming that here you mean separate slots on different machines
replicating via physical rep:

We don't currently allow the creation of a logical slot on a standby. Nor
replay from it, even to advance it without receiving the decoded changes.
Both would be required for that to work, as well as extensions to the hot
standby feedback mechanism to allow a standby to ask the master to pin its
catalog_xmin if slots on the standby were further behind than that of the
master.

I was chatting about that with Petr earlier. What we came up with was to
require the standby to connect to the master using a replication slot that,
while remaining a physical replication slot, has a catalog_xmin set and
updated by the replica using extended standby progress messages. The slot's
catalog_xmin the replica pushed up to the master would simply be the
min(catalog_xmin) of all slots on the replica,
i.e. procArray->replication_slot_catalog_xmin . Same with the slot xmin, if
defined for any slot on the replica.

That makes sure that the catalog_xmin required for the standby's slots is
preserved even if the standby isn't currently replaying from the master.

Handily this approach would give us cascading, support for intermediate
servers, and the option of only having failover slots on some replicas not
others. All things that were raised as concerns with failover slots.

However, clients would then have to know about the replica(s) of the master
that were failover candidates and would have to send feedback to advance
the client's slots on those nodes, not just the master. They'd have to be
able to connect to the replicas too. Unless we added some mechanism for the
master to lazily relay those feedback messages to replicas, anyway. Not a
major roadblock, just a bit fiddlier for clients.

Consistency shouldn't be a problem so long as the slot created on the
replica reaches SNAPBUILD_CONSISTENT (or there's enough pending WAL for it
to do so) before failover is required.

I think it'd be a somewhat reasonable alternative to failover slots and
it'd make it much more practical to decode from a replica. Which would be
great. It'd be fiddlier for clients, but probably worth it to get rid of
the limitations failover slots impose.



> > It'd also be necessary to move
> > CheckPointReplicationSlots() out of CheckPointGuts()  to the start of a
> > checkpoint/restartpoint when WAL writing is still permitted, like the
> > failover slots patch does.
>
> Ugh. That makes me rather wary.
>

Your comments say it's called in CheckPointGuts for convenience... and
really there doesn't seem to be anything that makes a slot checkpoint
especially tied to a "real" checkpoint.


> > Basically, failover slots as a plugin using a hook, without the
> > additions to base backup commands and the backup label.
>
> I'm going to be *VERY* hard to convince that adding a hook inside
> checkpointing code is acceptable.
>

Yeah... it's in ReplicationSlotSave, but it's still a slot checkpoint even
if (per above) it ceases to also be in the middle of a full system
checkpoint.


> > I'd really hate 9.6 to go out with - still - no way to use logical
> decoding
> > in a basic, bog-standard HA/failover environment. It overwhelmingly
> limits
> > their utility and it's becoming a major drag on practical use of the
> > feature. That's a difficulty given that the failover slots patch isn't
> > especially trivial and you've shown that lazy sync of slot state is not
> > sufficient.
>
> I think the right way to do this is to focus on failover for logical
> rep, with separate slots. The whole idea of integrating this physical
> rep imo makes this a *lot* more complex than necessary. Not all that
> many people are going to want to physical rep and logical rep.
>

If you're saying we should focus on failover between nodes that're
themselves connected using logical replication rather than physical
replication, I really have to strongly disagree.

TL;DR for book-length below: We're a long, long way from being able to
deliver even vaguely decent logical rep based failover. Without that or
support for logical decoding to survive physical failover we've got a great
tool in logical decoding that can't be used effectively with most
real-world systems.


I originally thought logical rep based failover was the way forward too and
that mixing physical and logical rep didn't make sense.

The problem is that we're a very, very long way from there, wheras we can

Re: [HACKERS] Timeline following for logical slots

2016-04-04 Thread Andres Freund
On 2016-04-04 17:50:02 +0800, Craig Ringer wrote:
> To rephrase per my understanding: The client only specifies the point it
> wants to start seeing decoded commits. Decoding starts from the slot's
> restart_lsn, and that's the point from which the accumulation of reorder
> buffer contents begins, the snapshot building process begins, and where
> accumulation of relcache invalidation information begins. At restart_lsn no
> xact that is to be emitted to the client may yet be in progress. Decoding,
s/yet/already/
> whether or not the xacts will be fed to the output plugin callbacks,
> requires access to the system catalogs. Therefore catalog_xmin reported by
> the slot must be >= the real effective catalog_xmin of the heap and valid
> at the restart_lsn, not just the confirmed flush point or the point the
> client specifies to resume fetching changes from.

Hm. Maybe I'm misunderstanding you here, but doesn't it have to be <=?

> On the original copy of the slot on the pre-failover master the restart_lsn
> would've been further ahead, as would the catalog_xmin. So catalog rows
> have been purged.
+may


> So it's necessary to ensure that the slot's restart_lsn and catalog_xmin
> are advanced in a timely, consistent manner on the replica's copy of the
> slot at a point where no vacuum changes to the catalog that could remove
> needed tuples have been replayed.

Right.


> The only way I can think of to do that really reliably right now, without
> full failover slots, is to use the newly committed pluggable WAL mechanism
> and add a hook to SaveSlotToPath() so slot info can be captured, injected
> in WAL, and replayed on the replica.

I personally think the primary answer is to use separate slots on
different machines. Failover slots can be an extension to that at some
point, but I think they're a secondary goal.


> It'd also be necessary to move
> CheckPointReplicationSlots() out of CheckPointGuts()  to the start of a
> checkpoint/restartpoint when WAL writing is still permitted, like the
> failover slots patch does.

Ugh. That makes me rather wary.


> Basically, failover slots as a plugin using a hook, without the
> additions to base backup commands and the backup label.

I'm going to be *VERY* hard to convince that adding a hook inside
checkpointing code is acceptable.


> I'd really hate 9.6 to go out with - still - no way to use logical decoding
> in a basic, bog-standard HA/failover environment. It overwhelmingly limits
> their utility and it's becoming a major drag on practical use of the
> feature. That's a difficulty given that the failover slots patch isn't
> especially trivial and you've shown that lazy sync of slot state is not
> sufficient.

I think the right way to do this is to focus on failover for logical
rep, with separate slots. The whole idea of integrating this physical
rep imo makes this a *lot* more complex than necessary. Not all that
many people are going to want to physical rep and logical rep.


> The restart_lsn from the newer copy of the slot is, as you said, a point we
> know we can reconstruct visibility info.

We can on the master. There's absolutely no guarantee that the
associated serialized snapshot is present on the standby.


Andres


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Timeline following for logical slots

2016-04-04 Thread Craig Ringer
On 1 April 2016 at 14:46, Andres Freund  wrote:


> > However: It's safe for the slot state on the replica to be somewhat
> behind
> > the local replay from the master, so the slot state on the replica is
> older
> > than what it would've been at an equivalent time on the master... so long
> > as it's not so far behind that the replica has replayed vacuum activity
> > that removes rows still required by the slot's declared catalog_xmin.
> Even
> > then it should actually be OK in practice because the failing-over client
> > will always have replayed past that point on the master (otherwise
> > catalog_xmin couldn't have advanced on the master), so it'll always ask
> to
> > start replay at a point at or after the lsn where the catalog_xmin was
> > advanced to its most recent value on the old master before failover. It's
> > only safe for walsender based decoding though, since there's no way to
> > specify the startpoint for sql-level decoding.
>
> This whole logic fails entirely flats on its face by the fact that even
> if you specify a startpoint, we read older WAL and process the
> records. The startpoint filters every transaction that commits earlier
> than the "startpoint". But with a long-running transaction you still
> will have old changes being processed, which need the corresponding
> catalog state.
>

Right. Having read over those draft docs I see what you mean.

TL;DR: the only way I can see to do this now is full-on failover slots, or
something very similar that uses pluggable WAL + a slot save hook + moving
CheckPointReplicationSlots to the start of a checkpoint while WAL is still
writeable.



To rephrase per my understanding: The client only specifies the point it
wants to start seeing decoded commits. Decoding starts from the slot's
restart_lsn, and that's the point from which the accumulation of reorder
buffer contents begins, the snapshot building process begins, and where
accumulation of relcache invalidation information begins. At restart_lsn no
xact that is to be emitted to the client may yet be in progress. Decoding,
whether or not the xacts will be fed to the output plugin callbacks,
requires access to the system catalogs. Therefore catalog_xmin reported by
the slot must be >= the real effective catalog_xmin of the heap and valid
at the restart_lsn, not just the confirmed flush point or the point the
client specifies to resume fetching changes from.

On the original copy of the slot on the pre-failover master the restart_lsn
would've been further ahead, as would the catalog_xmin. So catalog rows
have been purged. When we decode from the old restart_lsn on the replica
after failover we're not just doing excess work, we'd be accumulating
probably-incorrect invalidation information for the xacts that we're
decoding (but not sending to the client). Or just failing to find entries
we expect to find in the caches and dying.

If the restart_lsn was advanced on the master there must be another safe
decoding restart point after the one the old copy of the slot points to.
But we don't know where, and we don't have any way to know that we *are* in
such a post-failover situation so we can't just scan forward looking for it
without looking up invalidation info for xacts we know we'll be throwing
away.

So it's necessary to ensure that the slot's restart_lsn and catalog_xmin
are advanced in a timely, consistent manner on the replica's copy of the
slot at a point where no vacuum changes to the catalog that could remove
needed tuples have been replayed.

The only way I can think of to do that really reliably right now, without
full failover slots, is to use the newly committed pluggable WAL mechanism
and add a hook to SaveSlotToPath() so slot info can be captured, injected
in WAL, and replayed on the replica.  It'd also be necessary to move
CheckPointReplicationSlots() out of CheckPointGuts()  to the start of a
checkpoint/restartpoint when WAL writing is still permitted, like the
failover slots patch does. Basically, failover slots as a plugin using a
hook, without the additions to base backup commands and the backup label.
I'm not convinced that's at all better than the full failover slots, and
I'm not sure I can make a solid argument for the general purpose utility of
the hook it'd need, but at least it'd avoid committing core to that
approach.

I'd really hate 9.6 to go out with - still - no way to use logical decoding
in a basic, bog-standard HA/failover environment. It overwhelmingly limits
their utility and it's becoming a major drag on practical use of the
feature. That's a difficulty given that the failover slots patch isn't
especially trivial and you've shown that lazy sync of slot state is not
sufficient.

> It's also OK for the slot state to be *ahead* of the replica's replay
> > position, i.e. from the future. restart_lsn and catalog_xmin will be
> higher
> > than they were on the master at the same point in time, but that doesn't
> > matter at all, since the client 

Re: [HACKERS] Timeline following for logical slots

2016-04-04 Thread Craig Ringer
On 4 April 2016 at 14:43, Andres Freund  wrote:


> > OK, makes sense. And still resume decoding from restart_lsn, despite
> having
> > that visibility information stashed, because we also have to rebuild the
> > information on invalidations for running xacts, which is not stored
> > persistently anywhere as decoding progresses. So for now at least it's an
> > optimisation to store the visibility info, since we still have go go back
> > and decode for invalidations anyway. Right?
>
> Not really no. The important point isn't invalidation or anything. It's
> that we don't have the content & metadata of the transactions
> themselves.  Yes, we also do re-read invalidations, but that's just a
> side effect.
>

Right, because the reorder buffer contents aren't persistent so decoding
must restart at an lsn at or before the start of the oldest uncommitted
xact at the time it wants to start returning committed changes to the
client.

I've been looking too hard for the details and somehow managed to
completely miss the blindingly obvious right in front of me. I knew that
the confirmed lsn was the threshold of confirmed commit LSNs, I knew about
the reorder buffer needing to accumulate changes, I knew it wasn't
persistent, etc. Yet I can't pretend I didn't overlook that when looking
over the xlogreader vs decoding startup, per those comments.

I don't think a succinct comment there will be useful given that it'd
really just do a better job of explaining what restart_lsn is and why we
start decoding there. So I guess a more detailed comment on the slot struct
(not on the field, at the start) would be good. Or that README.

Thanks for your patience.

-- 
 Craig Ringer   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


Re: [HACKERS] Timeline following for logical slots

2016-04-04 Thread Andres Freund
On 2016-04-04 14:36:29 +0800, Craig Ringer wrote:
> On 4 April 2016 at 14:30, Andres Freund  wrote:
> 
> > On 2016-04-04 14:24:52 +0800, Craig Ringer wrote:
> > > I don't feel like I've grasped this properly yet. I think it's referring
> > to
> > > the pg_logical/snapshots/ serialization, the use of which allows us to
> > > avoid doing extra work in SnapBuildFindSnapshot(...), but doesn't seem to
> > > be crucial for correct function. After all, decoding still restarts at
> > the
> > > restart_lsn and feeds relevant xact info into the snapshot builder,
> > > accumulates invalidation information, etc.
> >
> > restart_lsn is set to the last known point where a) all changes for
> > ongoing transactions are available b) we can re-build visiblity
> > information when we start reading from there.
> >
> > As we essentially can only start determining visibility information
> > whenever processing a xl_running_xacts record. Building visibility
> > information means that there has to be a xl_running_xacts to start
> > from. To build full visibility information we also have to wait till we
> > have seen all in-progress transactions finish. So we dump visibility
> > information every now and then, so we can re-use the information we'd
> > already assembled.
> >
> 
> OK, makes sense. And still resume decoding from restart_lsn, despite having
> that visibility information stashed, because we also have to rebuild the
> information on invalidations for running xacts, which is not stored
> persistently anywhere as decoding progresses. So for now at least it's an
> optimisation to store the visibility info, since we still have go go back
> and decode for invalidations anyway. Right?

Not really no. The important point isn't invalidation or anything. It's
that we don't have the content & metadata of the transactions
themselves.  Yes, we also do re-read invalidations, but that's just a
side effect.


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Timeline following for logical slots

2016-04-04 Thread Andres Freund
On 2016-04-04 14:18:53 +0800, Craig Ringer wrote:
> I want to go over the rest of your reply in more detail, but regarding this
> and the original comment, I'm going by:
> 
> if (start_lsn == InvalidXLogRecPtr)
> {
> /* continue from last position */
> start_lsn = slot->data.confirmed_flush;
> }
> 
> ctx = StartupDecodingContext(output_plugin_options,
>  start_lsn, InvalidTransactionId,
>  read_page, prepare_write, do_write);
> 
> ...  in CreateDecodingContext(), which in turn does ...
> 
> ctx->snapshot_builder =
> AllocateSnapshotBuilder(ctx->reorder, xmin_horizon, start_lsn);
> 
> ... in StartupDecodingContext(...).
> 
> The snapshot builder controls when the snapshot builder returns
> SNAPBUILD_CONSISTENT, therefore when we start returning decoded commits to
> the client. Right?

> We pass InvalidXLogRecPtr as the start_lsn
> in pg_logical_slot_get_changes_guts(...).
> 
> So, when I wrote that:
> 
> /*
>  * We start reading xlog from the restart lsn, even though in
>  * CreateDecodingContext we set the snapshot builder up using the
>  * slot's confirmed_flush. This means we might read xlog we don't
>  * actually decode rows from, but the snapshot builder might need it
>  * to get to a consistent point. The point we start returning data
> to
>  * *users* at is the confirmed_flush lsn set up in the decoding
>  * context.
>  */
> 
> I don't see what's wrong there at all.

They're independent. What the snapshot builder gets as the start
position isn't the same as where we start reading WAL.


> We do "start reading xlog from the slot's restart_lsn". That's what gets
> passed to set up the xlogreader. That's where we read the first xlog
> records from.
> 
> In CreateDecodingContext we _do_ "set the snapshot builder up using the
> slot's confirmed_flush" [unless overridden by explicit argument to
> CreateDecodingContext, which isn't given here].

Yes. But again, that hasn't really got anything to do with where we read
WAL from. We always have to read older WAL than were confirmed_flush is
set to; to assemble all the changes from transactions that are
in-progress at the point of confirmed_flush.


> This is presumably what you're taking issue with:
> 
>  * This means we might read xlog we don't
>  * actually decode rows from, but the snapshot builder might need it
>  * to get to a consistent point.
> 
> and would be better worded as something like:
> 
> This means we will read and decode xlog from before the point we actually
> start returning decoded commits to the client, but the snapshot builder may
> need this additional xlog to get to a consistent point.

> right?

No. The snapshot builder really hasn't gotten that much to do with
things. The fundamental thing we have to do is re-assemble all
in-progress transactions (as in reorderbuffer.c, not snapbuild.c).


> I think
> 
> The point we start returning data to
>  * *users* at is the confirmed_flush lsn set up in the decoding
>  * context.
> 
> is still right.
> 
> What I was proposing instead is, in pg_logical_slot_get_changes_guts,
> changing:
> 
> ctx = CreateDecodingContext(InvalidXLogRecPtr,
> options,
> logical_read_local_xlog_page,
> LogicalOutputPrepareWrite,
> LogicalOutputWrite);
> 
> so that instead of InvalidXLogRecPtr we explicitly pass
> 
>MyReplicationSlot->data.confirmed_flush;
> 
> thus making it way easier to see what's going on without having to chase
> way deeper to figure out why data isn't returned from the restart_lsn.
> Where, y'know, you'd expect decoding to restart from... and where it does
> restart from, just not where it resumes sending output to the client from.

I just don't buy this. It's absolutely fundamental to understand that
the commit LSN isn't the point where all the data of transactions is
stored.


Andres


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Timeline following for logical slots

2016-04-04 Thread Craig Ringer
On 4 April 2016 at 14:30, Andres Freund  wrote:

> On 2016-04-04 14:24:52 +0800, Craig Ringer wrote:
> > I don't feel like I've grasped this properly yet. I think it's referring
> to
> > the pg_logical/snapshots/ serialization, the use of which allows us to
> > avoid doing extra work in SnapBuildFindSnapshot(...), but doesn't seem to
> > be crucial for correct function. After all, decoding still restarts at
> the
> > restart_lsn and feeds relevant xact info into the snapshot builder,
> > accumulates invalidation information, etc.
>
> restart_lsn is set to the last known point where a) all changes for
> ongoing transactions are available b) we can re-build visiblity
> information when we start reading from there.
>
> As we essentially can only start determining visibility information
> whenever processing a xl_running_xacts record. Building visibility
> information means that there has to be a xl_running_xacts to start
> from. To build full visibility information we also have to wait till we
> have seen all in-progress transactions finish. So we dump visibility
> information every now and then, so we can re-use the information we'd
> already assembled.
>

OK, makes sense. And still resume decoding from restart_lsn, despite having
that visibility information stashed, because we also have to rebuild the
information on invalidations for running xacts, which is not stored
persistently anywhere as decoding progresses. So for now at least it's an
optimisation to store the visibility info, since we still have go go back
and decode for invalidations anyway. Right?



-- 
 Craig Ringer   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


Re: [HACKERS] Timeline following for logical slots

2016-04-04 Thread Andres Freund
On 2016-04-04 14:24:52 +0800, Craig Ringer wrote:
> I don't feel like I've grasped this properly yet. I think it's referring to
> the pg_logical/snapshots/ serialization, the use of which allows us to
> avoid doing extra work in SnapBuildFindSnapshot(...), but doesn't seem to
> be crucial for correct function. After all, decoding still restarts at the
> restart_lsn and feeds relevant xact info into the snapshot builder,
> accumulates invalidation information, etc.

restart_lsn is set to the last known point where a) all changes for
ongoing transactions are available b) we can re-build visiblity
information when we start reading from there.

As we essentially can only start determining visibility information
whenever processing a xl_running_xacts record. Building visibility
information means that there has to be a xl_running_xacts to start
from. To build full visibility information we also have to wait till we
have seen all in-progress transactions finish. So we dump visibility
information every now and then, so we can re-use the information we'd
already assembled.

Greetings,

Andres Freund


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Timeline following for logical slots

2016-04-04 Thread Craig Ringer
On 1 April 2016 at 14:52, Andres Freund  wrote:

> Hi,
>
> On 2016-04-01 08:46:01 +0200, Andres Freund wrote:
> > That's a fundamental misunderstanding on your part (perhaps created by
> > imprecise docs).
>
> > > Speaking of which, did you see the proposed README I sent for
> > > src/backend/replication/logical ?
> >
> > I skimmed it. But given we have a CF full of patches, some submitted
> > over a year ago, it seems unfair to spend time on a patch submitted a
> > few days ago.
>
> For that purpos
>
> WRT design readme, it might be interesting to look at 0009 in
>
> http://archives.postgresql.org/message-id/20140127162006.GA25670%40awork2.anarazel.de
>
> That's not up2date obviously, but it still might help.
>

Thanks, I've been reading it and the posts it references.

Most of it was familiar by this point, but would've been a good reference
earlier on. The snapshot builder docs in README.SNAPBUILD.txt are handy and
help glue a few separate pieces together better for me, and the
invalidations section was brief but informative.

The very last point looks interesting, but only really alludes to what's
going on:


+== Restartable Decoding ==
+
+As we want to generate a consistent stream of changes we need to have the
+ability to start from a previously decoded location without waiting
possibly
+very long to reach consistency. For that reason we dump the current
visibility
+information to disk whenever we read an xl_running_xacts record.

I don't feel like I've grasped this properly yet. I think it's referring to
the pg_logical/snapshots/ serialization, the use of which allows us to
avoid doing extra work in SnapBuildFindSnapshot(...), but doesn't seem to
be crucial for correct function. After all, decoding still restarts at the
restart_lsn and feeds relevant xact info into the snapshot builder,
accumulates invalidation information, etc.


After 9.6 I'd like to go through that, update it, and get it in as a README
for logical decoding. It would've done me a lot of good when getting up to
speed.





-- 
 Craig Ringer   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


Re: [HACKERS] Timeline following for logical slots

2016-04-04 Thread Craig Ringer
On 1 April 2016 at 14:46, Andres Freund  wrote:


>
> > You're thinking from the perspective of someone who knows this code
> > intimately.
>
> Maybe. But that's not addressed by adding half-true comments to places
> they only belong to peripherally. And not to all the relevant places,
> since you've not added the same comment to StartLogicalReplication().
>

I want to go over the rest of your reply in more detail, but regarding this
and the original comment, I'm going by:

if (start_lsn == InvalidXLogRecPtr)
{
/* continue from last position */
start_lsn = slot->data.confirmed_flush;
}

ctx = StartupDecodingContext(output_plugin_options,
 start_lsn, InvalidTransactionId,
 read_page, prepare_write, do_write);

...  in CreateDecodingContext(), which in turn does ...

ctx->snapshot_builder =
AllocateSnapshotBuilder(ctx->reorder, xmin_horizon, start_lsn);

... in StartupDecodingContext(...).

The snapshot builder controls when the snapshot builder returns
SNAPBUILD_CONSISTENT, therefore when we start returning decoded commits to
the client. Right?

We pass InvalidXLogRecPtr as the start_lsn
in pg_logical_slot_get_changes_guts(...).

So, when I wrote that:

/*
 * We start reading xlog from the restart lsn, even though in
 * CreateDecodingContext we set the snapshot builder up using the
 * slot's confirmed_flush. This means we might read xlog we don't
 * actually decode rows from, but the snapshot builder might need it
 * to get to a consistent point. The point we start returning data
to
 * *users* at is the confirmed_flush lsn set up in the decoding
 * context.
 */

I don't see what's wrong there at all.

We do "start reading xlog from the slot's restart_lsn". That's what gets
passed to set up the xlogreader. That's where we read the first xlog
records from.

In CreateDecodingContext we _do_ "set the snapshot builder up using the
slot's confirmed_flush" [unless overridden by explicit argument to
CreateDecodingContext, which isn't given here].

This is presumably what you're taking issue with:

 * This means we might read xlog we don't
 * actually decode rows from, but the snapshot builder might need it
 * to get to a consistent point.

and would be better worded as something like:

This means we will read and decode xlog from before the point we actually
start returning decoded commits to the client, but the snapshot builder may
need this additional xlog to get to a consistent point.

right?

I think

The point we start returning data to
 * *users* at is the confirmed_flush lsn set up in the decoding
 * context.

is still right.

What I was proposing instead is, in pg_logical_slot_get_changes_guts,
changing:

ctx = CreateDecodingContext(InvalidXLogRecPtr,
options,
logical_read_local_xlog_page,
LogicalOutputPrepareWrite,
LogicalOutputWrite);

so that instead of InvalidXLogRecPtr we explicitly pass

   MyReplicationSlot->data.confirmed_flush;

thus making it way easier to see what's going on without having to chase
way deeper to figure out why data isn't returned from the restart_lsn.
Where, y'know, you'd expect decoding to restart from... and where it does
restart from, just not where it resumes sending output to the client from.

> Speaking of which, did you see the proposed README I sent for
> > src/backend/replication/logical ?
>
> I skimmed it. But given we have a CF full of patches, some submitted
> over a year ago, it seems unfair to spend time on a patch submitted a
> few days ago


Of course. I wasn't expecting to wedge this into the release, nor do I see
any value in doing so. I just thought you'd probably rather know about it
than not and if it was obviously wrong it'd be good to know. I've found it
very hard to get my head around how logical decoding's parts fit together,
and now that there are some other folks in 2ndQ getting involved  it seemed
like a good idea to start working on making that easier for the next guy.

Thanks for linking to your old docs. While a bit outdated as you noted,
they'll be simple to update and would be really valuable to have in a
README where they're discoverable.

-- 
 Craig Ringer   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


Re: [HACKERS] Timeline following for logical slots

2016-04-01 Thread Andres Freund
Hi,

On 2016-04-01 08:46:01 +0200, Andres Freund wrote:
> That's a fundamental misunderstanding on your part (perhaps created by
> imprecise docs).

> > Speaking of which, did you see the proposed README I sent for
> > src/backend/replication/logical ?
> 
> I skimmed it. But given we have a CF full of patches, some submitted
> over a year ago, it seems unfair to spend time on a patch submitted a
> few days ago.

For that purpos

WRT design readme, it might be interesting to look at 0009 in
http://archives.postgresql.org/message-id/20140127162006.GA25670%40awork2.anarazel.de

That's not up2date obviously, but it still might help.

Andres


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Timeline following for logical slots

2016-04-01 Thread Andres Freund
Hi,

On 2016-04-01 13:16:18 +0800, Craig Ringer wrote:
> I think it's pretty unsafe from SQL, to be sure.
> 
> Unless failover slots get in to 9.6 we'll need to do exactly that from
> internal C stuff in pglogical to support following physical failover,

I know. And this makes me scared shitless.


I'll refuse to debug anything related to decoding, when this "feature"
has been used. And I think there'll be plenty; if you notice the issues
that is.


The more I think about it, the more I think we should rip this out
again, until we have the actual feature is in. With proper review.


> However: It's safe for the slot state on the replica to be somewhat behind
> the local replay from the master, so the slot state on the replica is older
> than what it would've been at an equivalent time on the master... so long
> as it's not so far behind that the replica has replayed vacuum activity
> that removes rows still required by the slot's declared catalog_xmin. Even
> then it should actually be OK in practice because the failing-over client
> will always have replayed past that point on the master (otherwise
> catalog_xmin couldn't have advanced on the master), so it'll always ask to
> start replay at a point at or after the lsn where the catalog_xmin was
> advanced to its most recent value on the old master before failover. It's
> only safe for walsender based decoding though, since there's no way to
> specify the startpoint for sql-level decoding.

This whole logic fails entirely flats on its face by the fact that even
if you specify a startpoint, we read older WAL and process the
records. The startpoint filters every transaction that commits earlier
than the "startpoint". But with a long-running transaction you still
will have old changes being processed, which need the corresponding
catalog state.


> My only real worry there is whether invalidations are a concern - if
> internal replay starts at the restart_lsn and replays over part of history
> where the catalog entries have been vacuumed before it starts collecting
> rows for return to the decoding client at the requested decoding
> startpoint, can that cause problems with invalidation processing?

Yes.


> It's also OK for the slot state to be *ahead* of the replica's replay
> position, i.e. from the future. restart_lsn and catalog_xmin will be higher
> than they were on the master at the same point in time, but that doesn't
> matter at all, since the client will always be requesting to start from
> that point or later, having replayed up to there on the old master before
> failover.

I don't think that's true. See my point above about startpoint.



> > Andres, I tried to address your comments as best I could. The main one
> > that
> > > I think stayed open was about the loop that finds the last timeline on a
> > > segment. If you think that's better done by directly scanning the List*
> > of
> > > timeline history entries I'm happy to prep a follow-up.
> >
> > Have to look again.
> >
> > +* We start reading xlog from the restart lsn, even though in
> > +* CreateDecodingContext we set the snapshot builder up using the
> > +* slot's confirmed_flush. This means we might read xlog we don't
> > +* actually decode rows from, but the snapshot builder might need
> > it
> > +* to get to a consistent point. The point we start returning data
> > to
> > +* *users* at is the confirmed_flush lsn set up in the decoding
> > +* context.
> > +*/
> > still seems pretty misleading - and pretty much unrelated to the
> > callsite.
> >
> 
> 
> I think it's pretty confusing that the function uses's the slot's
> restart_lsn and never refers to confirmed_flush which is where it actually
> starts decoding from as far as the user's concerned. It took me a while to
> figure out what was going on there.

That's a fundamental misunderstanding on your part (perhaps created by
imprecise docs). The startpoint isn't about where to start
decoding. It's about skip calling the output plugin of a transaction if
it *commits* before the startpoint. We almost always will *decode* rows
from before the startpoint.  And that's pretty much unavoidable: The
consumer of a decoding slot only really can do anything with commit LSNs
wrt replay progress: But for a remembered progress LSN there can be a
lot of in-progress xacts (about which said client doesn't know
anything); and they start *earlier* than the commit LSN of the just
replayed xact. So we have to be able to re-process their state and send
it out.


> I think the comment would be less necessary, and could be moved up to the
> CreateDecodingContext call, if we passed the slot's confirmed_flush
> explicitly to CreateDecodingContext instead of passing InvalidXLogRecPtr
> and having the CreateDecodingContext call infer the start point. That way
> what's going on would be a lot clearer when reading the code.

How would that solve anything? We still need to process the old records,

Re: [HACKERS] Timeline following for logical slots

2016-03-31 Thread Craig Ringer
On 31 March 2016 at 16:09, Andres Freund  wrote:


> > This gives us an option for failover of logical replication in 9.6, even
> if
> > it's a bit cumbersome and complex for the client, in case failover slots
> > don't make the cut. And, of course, it's a pre-req for failover slots,
> > which I'll rebase on top of it shortly.
>
> FWIW, I think it's dangerous to use this that way. If people manipulate
> slots that way we'll have hellishly to debug issues.


I think it's pretty unsafe from SQL, to be sure.

Unless failover slots get in to 9.6 we'll need to do exactly that from
internal C stuff in pglogical to support following physical failover,
though. I'm not thrilled by that, especially since we have no way (in 9.6)
to insert generic WAL records in xlog to allow the slot updates to
accurately pace replay. I don't mean logical wal messages here, they
wouldn't help, it'd actually be pluggable generic WAL that we'd need to
sync slots fully consistently in the absence of failover slots.

However: It's safe for the slot state on the replica to be somewhat behind
the local replay from the master, so the slot state on the replica is older
than what it would've been at an equivalent time on the master... so long
as it's not so far behind that the replica has replayed vacuum activity
that removes rows still required by the slot's declared catalog_xmin. Even
then it should actually be OK in practice because the failing-over client
will always have replayed past that point on the master (otherwise
catalog_xmin couldn't have advanced on the master), so it'll always ask to
start replay at a point at or after the lsn where the catalog_xmin was
advanced to its most recent value on the old master before failover. It's
only safe for walsender based decoding though, since there's no way to
specify the startpoint for sql-level decoding.

My only real worry there is whether invalidations are a concern - if
internal replay starts at the restart_lsn and replays over part of history
where the catalog entries have been vacuumed before it starts collecting
rows for return to the decoding client at the requested decoding
startpoint, can that cause problems with invalidation processing? I haven't
got my head around what, exactly, logical decoding is doing with its
invalidations stuff yet. I didn't find any problems in testing, but wasn't
sure how to create the conditions under which failures might occur.

It's also OK for the slot state to be *ahead* of the replica's replay
position, i.e. from the future. restart_lsn and catalog_xmin will be higher
than they were on the master at the same point in time, but that doesn't
matter at all, since the client will always be requesting to start from
that point or later, having replayed up to there on the old master before
failover.

It's even possible for the restart_lsn and catalog_xmin to be in the
replica's future, i.e. the lsn > the current replay lsn and the
catalog_xmin greater than the next xid. In that case we know the logical
client replayed state from the old master that hasn't been applied to the
replica by the time it's promoted. If this state of affairs exists when we
promote a replica to a master we should really kill the slot, since the
client has obviously replayed from the old master past the point of
divergence with the promoted replica and there's a  potentially an
unresolvable timeline fork.  I'd like to add this if I can manage it,
probably by WARNing a complaint then dropping the slot.

Needless to say I'd really rather just have failover slots, where we know
the slot will be consistent with the DB state and updated in sync with it.
This is a fallback plan and I don't like it too much.


> The test code needs
> a big disclaimer to never ever be used in production, and we should
> "disclaim any warranty" if somebody does that. To the point of not
> fixing issues around it in back branches.
>

I agree. The README has just that, and there are comments above the
individual functions like:

 * Note that this is test harness code. You shouldn't expose slot internals
 * to SQL like this for any real world usage. See the README.

> Andres, I tried to address your comments as best I could. The main one
> that
> > I think stayed open was about the loop that finds the last timeline on a
> > segment. If you think that's better done by directly scanning the List*
> of
> > timeline history entries I'm happy to prep a follow-up.
>
> Have to look again.
>
> +* We start reading xlog from the restart lsn, even though in
> +* CreateDecodingContext we set the snapshot builder up using the
> +* slot's confirmed_flush. This means we might read xlog we don't
> +* actually decode rows from, but the snapshot builder might need
> it
> +* to get to a consistent point. The point we start returning data
> to
> +* *users* at is the confirmed_flush lsn set up in the decoding
> +* context.
> +*/
> still seems pretty 

Re: [HACKERS] Timeline following for logical slots

2016-03-31 Thread Andres Freund
Hi,

On 2016-03-31 08:52:34 +0800, Craig Ringer wrote:
> On 31 March 2016 at 07:15, Alvaro Herrera  wrote:
> 
> 
> > > Available attached or at
> > >
> > https://github.com/2ndQuadrant/postgres/tree/dev/logical-decoding-timeline-following
> >
> > And pushed this too.
> >
> 
> Much appreciated. Marked as committed at
> https://commitfest.postgresql.org/9/568/ .
> 
> This gives us an option for failover of logical replication in 9.6, even if
> it's a bit cumbersome and complex for the client, in case failover slots
> don't make the cut. And, of course, it's a pre-req for failover slots,
> which I'll rebase on top of it shortly.

FWIW, I think it's dangerous to use this that way. If people manipulate
slots that way we'll have hellishly to debug issues. The test code needs
a big disclaimer to never ever be used in production, and we should
"disclaim any warranty" if somebody does that. To the point of not
fixing issues around it in back branches.

> Andres, I tried to address your comments as best I could. The main one that
> I think stayed open was about the loop that finds the last timeline on a
> segment. If you think that's better done by directly scanning the List* of
> timeline history entries I'm happy to prep a follow-up.

Have to look again.

+* We start reading xlog from the restart lsn, even though in
+* CreateDecodingContext we set the snapshot builder up using the
+* slot's confirmed_flush. This means we might read xlog we don't
+* actually decode rows from, but the snapshot builder might need it
+* to get to a consistent point. The point we start returning data to
+* *users* at is the confirmed_flush lsn set up in the decoding
+* context.
+*/
still seems pretty misleading - and pretty much unrelated to the
callsite.


Greetings,

Andres Freund


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Timeline following for logical slots

2016-03-30 Thread Craig Ringer
On 31 March 2016 at 07:15, Alvaro Herrera  wrote:


> > Available attached or at
> >
> https://github.com/2ndQuadrant/postgres/tree/dev/logical-decoding-timeline-following
>
> And pushed this too.
>

Much appreciated. Marked as committed at
https://commitfest.postgresql.org/9/568/ .

This gives us an option for failover of logical replication in 9.6, even if
it's a bit cumbersome and complex for the client, in case failover slots
don't make the cut. And, of course, it's a pre-req for failover slots,
which I'll rebase on top of it shortly.

Andres, I tried to address your comments as best I could. The main one that
I think stayed open was about the loop that finds the last timeline on a
segment. If you think that's better done by directly scanning the List* of
timeline history entries I'm happy to prep a follow-up.

-- 
 Craig Ringer   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


Re: [HACKERS] Timeline following for logical slots

2016-03-30 Thread Alvaro Herrera
Craig Ringer wrote:

> It removes the questionable cleanups, fixes the randAccess comment (IMO),

I pushed cleanup separately, including the addition of a few comments
that were documenting the original code.  I actually made a mistake in
extracting those, so there's one comment that's actually bogus in that
commit (the candidate_restart_lsn one); it's fixed in the second commit.
Sorry about that.  I also introduced XLogReaderInvalReadState here,
called XLogReaderInvalCache originally, since Andres objected to the
"cache" terminology (though you will note that the word "cache" was used
in the original code too.)

>  changes the "open a new xlog segment" wording and explains why we don't
> need GetFlushRecPtr/GetXLogReplayRecPtr on replay of a historical timeline.
> I removed the change that did less than a whole page at the XLogRead call
> and instead added a comment explaining it. Fixed the brainfrart with
> candidate_restart_lsn, removed the outdated startptr comment and update and
> the unnecessary XLogReaderInvalCache call after it.
> 
> Also renamed src/test/modules/decoding_failover to
> src/test/modules/test_slot_timelines . Best name I could come up with that
> wasn't 1000chars long.
> 
> Passes main 'check', contrib/test_decoding check,
> src/test/modules/test_slot_timelines check and src/test/recovery check.
> 
> Available attached or at
> https://github.com/2ndQuadrant/postgres/tree/dev/logical-decoding-timeline-following

And pushed this too.

-- 
Álvaro Herrerahttp://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Timeline following for logical slots

2016-03-22 Thread Craig Ringer
On 22 March 2016 at 19:48, Andres Freund  wrote:

> On 2016-03-15 21:04:12 +0800, Craig Ringer wrote:
> > Thanks very much for the review.
>
> Are you planning to update the patch?
>
>
Updated patch attached.

It removes the questionable cleanups, fixes the randAccess comment (IMO),
 changes the "open a new xlog segment" wording and explains why we don't
need GetFlushRecPtr/GetXLogReplayRecPtr on replay of a historical timeline.
I removed the change that did less than a whole page at the XLogRead call
and instead added a comment explaining it. Fixed the brainfrart with
candidate_restart_lsn, removed the outdated startptr comment and update and
the unnecessary XLogReaderInvalCache call after it.

Also renamed src/test/modules/decoding_failover to
src/test/modules/test_slot_timelines . Best name I could come up with that
wasn't 1000chars long.

Passes main 'check', contrib/test_decoding check,
src/test/modules/test_slot_timelines check and src/test/recovery check.

Available attached or at
https://github.com/2ndQuadrant/postgres/tree/dev/logical-decoding-timeline-following
.

-- 
 Craig Ringer   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services
From e1fb318e4c25b93770c53425277f3efa83ccb618 Mon Sep 17 00:00:00 2001
From: Craig Ringer 
Date: Wed, 23 Mar 2016 09:03:04 +0800
Subject: [PATCH] Timeline following for logical decoding

Allow logical slots to follow timeline switches

Make logical replication slots timeline-aware, so replay can
continue from a historical timeline onto the server's current
timeline.

This is required to make failover slots possible and may also
be used by extensions that CreateReplicationSlot on a standby
and replay from that slot once the replica is promoted.

This does NOT add support for replaying from a logical slot on
a standby or for syncing slots to replicas.
---
 src/backend/access/transam/xlogreader.c|  51 +++-
 src/backend/access/transam/xlogutils.c | 260 --
 src/backend/replication/logical/logicalfuncs.c |  33 ++-
 src/include/access/xlogreader.h|  37 ++-
 src/include/access/xlogutils.h |   2 +-
 src/test/modules/Makefile  |   1 +
 src/test/modules/test_slot_timelines/.gitignore|   3 +
 src/test/modules/test_slot_timelines/Makefile  |  22 ++
 src/test/modules/test_slot_timelines/README|  19 ++
 .../expected/load_extension.out|  19 ++
 .../test_slot_timelines/sql/load_extension.sql |   7 +
 .../test_slot_timelines--1.0.sql   |  16 ++
 .../test_slot_timelines/test_slot_timelines.c  | 133 +
 .../test_slot_timelines/test_slot_timelines.conf   |   2 +
 .../test_slot_timelines.control|   5 +
 src/test/recovery/Makefile |   2 +
 .../recovery/t/006_logical_decoding_timelines.pl   | 304 +
 17 files changed, 865 insertions(+), 51 deletions(-)
 create mode 100644 src/test/modules/test_slot_timelines/.gitignore
 create mode 100644 src/test/modules/test_slot_timelines/Makefile
 create mode 100644 src/test/modules/test_slot_timelines/README
 create mode 100644 src/test/modules/test_slot_timelines/expected/load_extension.out
 create mode 100644 src/test/modules/test_slot_timelines/sql/load_extension.sql
 create mode 100644 src/test/modules/test_slot_timelines/test_slot_timelines--1.0.sql
 create mode 100644 src/test/modules/test_slot_timelines/test_slot_timelines.c
 create mode 100644 src/test/modules/test_slot_timelines/test_slot_timelines.conf
 create mode 100644 src/test/modules/test_slot_timelines/test_slot_timelines.control
 create mode 100644 src/test/recovery/t/006_logical_decoding_timelines.pl

diff --git a/src/backend/access/transam/xlogreader.c b/src/backend/access/transam/xlogreader.c
index fcb0872..b7d249c 100644
--- a/src/backend/access/transam/xlogreader.c
+++ b/src/backend/access/transam/xlogreader.c
@@ -10,6 +10,9 @@
  *
  * NOTES
  *		See xlogreader.h for more notes on this facility.
+ *
+ *		This file is compiled as both front-end and backend code, so it
+ *		may not use ereport, server-defined static variables, etc.
  *-
  */
 
@@ -116,6 +119,11 @@ XLogReaderAllocate(XLogPageReadCB pagereadfunc, void *private_data)
 		return NULL;
 	}
 
+#ifndef FRONTEND
+	/* Will be loaded on first read */
+	state->timelineHistory = NIL;
+#endif
+
 	return state;
 }
 
@@ -135,6 +143,10 @@ XLogReaderFree(XLogReaderState *state)
 	pfree(state->errormsg_buf);
 	if (state->readRecordBuf)
 		pfree(state->readRecordBuf);
+#ifndef FRONTEND
+	if (state->timelineHistory)
+		list_free_deep(state->timelineHistory);
+#endif
 	pfree(state->readBuf);
 	pfree(state);
 }
@@ -192,6 +204,10 @@ XLogReadRecord(XLogReaderState *state, XLogRecPtr RecPtr, char **errormsg)
 {
 	XLogRecord *record;
 	

Re: [HACKERS] Timeline following for logical slots

2016-03-22 Thread Craig Ringer
On 22 March 2016 at 19:48, Andres Freund  wrote:

> On 2016-03-15 21:04:12 +0800, Craig Ringer wrote:
> > Thanks very much for the review.
>
> Are you planning to update the patch?
>

Yes. I just spoke to Álvaro earlier. I'll pick up his modified version of
my patch, remove the unrelated cleanup stuff he added, amend the other
issues and re-post today.

Then I'll try to get an updated pglogical_output up, now that I've knocked
off a variety of urgent work elsewhere...

-- 
 Craig Ringer   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


Re: [HACKERS] Timeline following for logical slots

2016-03-22 Thread Andres Freund
On 2016-03-15 21:04:12 +0800, Craig Ringer wrote:
> Thanks very much for the review.

Are you planning to update the patch?

- Andres


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Timeline following for logical slots

2016-03-15 Thread Alvaro Herrera
Craig already replied to most points.  I think we can sum up as "we've
done the best we can with what we have, maybe somebody can figure how to
do better in the future but for now this works."

Andres Freund wrote:

> On 2016-03-14 20:10:58 -0300, Alvaro Herrera wrote:

> > +#ifndef FRONTEND
> > +   if (state->timelineHistory)
> > +   list_free_deep(state->timelineHistory);
> > +#endif
> 
> Hm. So we don't support timelines following for frontend code, although
> it'd be rather helpful for pg_xlogdump. And possibly pg_rewind.

Exactly.  I had the same comment, but the argument that we would need to
rewrite a lot of timeline.c won me.  That can wait for a future patch;
we certainly don't want to be rewriting these things during the last CF.

> > if (state->ReadRecPtr == InvalidXLogRecPtr)
> > -   randAccess = true;
> > +   randAccess = true;  /* allow readPageTLI to go 
> > backwards */
> 
> randAccess is doing more than that, so I'm doubtful that comment is an
> improvment.

As I said, it's only a copy of what's already a few lines above.  I
would be happier removing both, and instead adding an explanatory
comment about that variable elsewhere.

> >  #include 
> >  
> > -#include "miscadmin.h"
> > -
> 
> spurious change imo.

Yeah, in a way this is cleanup after previous patches that have messed
up things somewhat.  I wouldn't have posted it unless I was close to
committing this ... I think it'd be better to push these cleanup changes
separately.

> > /* Need to seek in the file? */
> > if (sendOff != startoff)
> > {
> > if (lseek(sendFile, (off_t) startoff, SEEK_SET) < 0)
> > -   {
> > -   charpath[MAXPGPATH];
> > -
> > -   XLogFilePath(path, tli, sendSegNo);
> > -
> > ereport(ERROR,
> > (errcode_for_file_access(),
> >   errmsg("could not seek in log segment %s to 
> > offset %u: %m",
> > -path, startoff)));
> > -   }
> > +XLogFileNameP(tli, sendSegNo), 
> > startoff)));
> > sendOff = startoff;
> > }
> 
> Not a serious issue, more a general remark: I'm doubtful that going for
> palloc in error situations is good practice. This will be allocated in
> the current memory context; without access to the emergency error
> reserves.

Yeah, this code was copied from elsewhere in 422a55a68784f but there is
already another version of this routine in walsender which is using
XLogFileNameP instead of the stack-allocated one.  I just made this one
more similar to the walsender's (which is in backend environment just
like this one) instead of continuing to use the frontend-limited
version (which is where this one was copied from).

I'm having a hard time getting concerned about using palloc there,
considering that every single call of XLogFileNameP occurs in an error
message.  If we want to reject this one, surely we should change all the
others too.

> I'm also getting the feeling that the patch is bordering on doing some
> relatively random cleanups mixed in with architectural changes. Makes
> things a bit harder to review.

Yes.

> > +static void
> > +XLogReadDetermineTimeline(XLogReaderState *state)
> > +{
> > +   /* Read the history on first time through */
> > +   if (state->timelineHistory == NIL)
> > +   state->timelineHistory = readTimeLineHistory(ThisTimeLineID);
> > +
> > +   /*
> > +* Are we reading the record immediately following the one we read last
> > +* time?  If not, then don't use the cached timeline info.
> > +*/
> > +   if (state->currRecPtr != state->EndRecPtr)
> > +   {
> > +   state->currTLI = 0;
> > +   state->currTLIValidUntil = InvalidXLogRecPtr;
> > +   }
> 
> Hm. So we grow essentially a second version of the last end position and
> the randAccess stuff in XLogReadRecord().

Craig's argument against that seems reasonable to me.

> XLogReadDetermineTimeline() doesn't sit quite right with me, I do wonder
> whether there's not a simpler way to write this.

Feel free to suggest something :-)

-- 
Álvaro Herrerahttp://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Timeline following for logical slots

2016-03-15 Thread Craig Ringer
On 15 March 2016 at 17:12, Andres Freund  wrote:

> Hi
>

Thanks very much for the review.

This patch was split out from failover slots, which its self underwent
quite a few revisions, so I'm really happy to have fresh eyes on it.
Especially more experienced ones.


> On 2016-03-14 20:10:58 -0300, Alvaro Herrera wrote:
> > diff --git a/src/backend/access/transam/xlogreader.c
> b/src/backend/access/transam/xlogreader.c
> > index fcb0872..7b60b8f 100644
> > --- a/src/backend/access/transam/xlogreader.c
> > +++ b/src/backend/access/transam/xlogreader.c
> > @@ -10,9 +10,11 @@
> >   *
> >   * NOTES
> >   *   See xlogreader.h for more notes on this facility.
> > + *
> > + *   This file is compiled as both front-end and backend code,
> so it
> > + *   may not use ereport, server-defined static variables, etc.
> >
>  *-
> >   */
> > -
>
> Huh?
>

I'm not sure what the concern here is. xlogreader *is* compiled as frontend
code - the file gets linked into the tree for pg_xlogdump and pg_rewind, at
least.

I found that really confusing when working on it and thought it merited a
comment.


> > @@ -135,6 +142,10 @@ XLogReaderFree(XLogReaderState *state)
> >   pfree(state->errormsg_buf);
> >   if (state->readRecordBuf)
> >   pfree(state->readRecordBuf);
> > +#ifndef FRONTEND
> > + if (state->timelineHistory)
> > + list_free_deep(state->timelineHistory);
> > +#endif
>
> Hm. So we don't support timelines following for frontend code, although
> it'd be rather helpful for pg_xlogdump. And possibly pg_rewind.
>

Yes, it would. I don't want to address that in the same patch though. It'd
require making timeline.c frontend-clean, dealing with the absence of List
on the frontend, etc, and I don't want to complicate this patch with that.

I've intentionally written the timeline logic so it can pretty easily be
moved into xlogreader.c as a self-contained unit and used for those
utilities once timeline.c can be compiled for frontend too.

>   pfree(state->readBuf);
> >   pfree(state);
> >  }
> > @@ -208,10 +219,11 @@ XLogReadRecord(XLogReaderState *state, XLogRecPtr
> RecPtr, char **errormsg)
> >
> >   if (RecPtr == InvalidXLogRecPtr)
> >   {
> > + /* No explicit start point; read the record after the one
> we just read */
> >   RecPtr = state->EndRecPtr;
> >
> >   if (state->ReadRecPtr == InvalidXLogRecPtr)
> > - randAccess = true;
> > + randAccess = true;  /* allow readPageTLI to go
> backwards */
>
> randAccess is doing more than that, so I'm doubtful that comment is an
> improvment.
>

Yeah, I have no idea what I was on about there, per response to Álvaro's
post.



> > @@ -466,9 +482,7 @@ err:
> >* Invalidate the xlog page we've cached. We might read from a
> different
> >* source after failure.
> >*/
> > - state->readSegNo = 0;
> > - state->readOff = 0;
> > - state->readLen = 0;
> > + XLogReaderInvalCache(state);
>
> I don't think that "cache" is the right way to describe this.
>

Isn't that what it is? It reads a page, caches it, and reuses it for
subsequent requests on the same page. The pre-existing comment even calls
it a cache above.

I don't mind changing it, but don't have any better ideas.


> >  #include 
> >
> > -#include "miscadmin.h"
> > -
>
> spurious change imo.
>

Added in Álvaro's rev; it puts the header in the correct sort order, but
I'm not sure it should be bundled with this patch.


> > - if (sendFile < 0 || !XLByteInSeg(recptr, sendSegNo))
> > + /* Do we need to open a new xlog segment? */
> > + if (sendFile < 0 || !XLByteInSeg(recptr, sendSegNo) ||
> > + sendTLI != tli)
> >   {
>
> s/open a new/open a different/?  New imo has connotations that we don't
> really want here.
>

In my original patch this was:

/* Do we need to switch to a new xlog segment? */

but yeah, "open a different" is better than either.



> >   /* Need to seek in the file? */
> >   if (sendOff != startoff)
> >   {
> >   if (lseek(sendFile, (off_t) startoff, SEEK_SET) <
> 0)
> > - {
> > - charpath[MAXPGPATH];
> > -
> > - XLogFilePath(path, tli, sendSegNo);
> > -
> >   ereport(ERROR,
> >   (errcode_for_file_access(),
> > errmsg("could not seek in log segment %s
> to offset %u: %m",
> > -  path, startoff)));
> > - }
> > +  XLogFileNameP(tli,
> sendSegNo), startoff)));
> >   sendOff = startoff;
> >   }
>

Re: [HACKERS] Timeline following for logical slots

2016-03-15 Thread Alvaro Herrera
Andres Freund wrote:

> Could you perhaps wait till tomorrow? I'd like to do a pass over it.

Sure thing.  I'll wait for your review.

-- 
Álvaro Herrerahttp://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Timeline following for logical slots

2016-03-15 Thread Andres Freund
Hi,


On 2016-03-14 20:10:58 -0300, Alvaro Herrera wrote:
> diff --git a/src/backend/access/transam/xlogreader.c 
> b/src/backend/access/transam/xlogreader.c
> index fcb0872..7b60b8f 100644
> --- a/src/backend/access/transam/xlogreader.c
> +++ b/src/backend/access/transam/xlogreader.c
> @@ -10,9 +10,11 @@
>   *
>   * NOTES
>   *   See xlogreader.h for more notes on this facility.
> + *
> + *   This file is compiled as both front-end and backend code, so it
> + *   may not use ereport, server-defined static variables, etc.
>   *-
>   */
> -

Huh?

>  #include "postgres.h"
>  
>  #include "access/transam.h"
> @@ -116,6 +118,11 @@ XLogReaderAllocate(XLogPageReadCB pagereadfunc, void 
> *private_data)
>   return NULL;
>   }
>  
> +#ifndef FRONTEND
> + /* Will be loaded on first read */
> + state->timelineHistory = NIL;
> +#endif
> +
>   return state;
>  }
>  
> @@ -135,6 +142,10 @@ XLogReaderFree(XLogReaderState *state)
>   pfree(state->errormsg_buf);
>   if (state->readRecordBuf)
>   pfree(state->readRecordBuf);
> +#ifndef FRONTEND
> + if (state->timelineHistory)
> + list_free_deep(state->timelineHistory);
> +#endif

Hm. So we don't support timelines following for frontend code, although
it'd be rather helpful for pg_xlogdump. And possibly pg_rewind.


>   pfree(state->readBuf);
>   pfree(state);
>  }
> @@ -208,10 +219,11 @@ XLogReadRecord(XLogReaderState *state, XLogRecPtr 
> RecPtr, char **errormsg)
>  
>   if (RecPtr == InvalidXLogRecPtr)
>   {
> + /* No explicit start point; read the record after the one we 
> just read */
>   RecPtr = state->EndRecPtr;
>  
>   if (state->ReadRecPtr == InvalidXLogRecPtr)
> - randAccess = true;
> + randAccess = true;  /* allow readPageTLI to go 
> backwards */

randAccess is doing more than that, so I'm doubtful that comment is an
improvment.


>   /*
>* RecPtr is pointing to end+1 of the previous WAL record.  If 
> we're
> @@ -223,6 +235,8 @@ XLogReadRecord(XLogReaderState *state, XLogRecPtr RecPtr, 
> char **errormsg)
>   else
>   {
>   /*
> +  * Caller supplied a position to start at.
> +  *
>* In this case, the passed-in record pointer should already be
>* pointing to a valid record starting position.
>*/
> @@ -309,8 +323,10 @@ XLogReadRecord(XLogReaderState *state, XLogRecPtr 
> RecPtr, char **errormsg)
>   /* XXX: more validation should be done here */
>   if (total_len < SizeOfXLogRecord)
>   {
> - report_invalid_record(state, "invalid record length at 
> %X/%X",
> -   (uint32) 
> (RecPtr >> 32), (uint32) RecPtr);
> + report_invalid_record(state,
> + "invalid record length at 
> %X/%X: wanted %lu, got %u",
> +   (uint32) 
> (RecPtr >> 32), (uint32) RecPtr,
> +   
> SizeOfXLogRecord, total_len);
>   goto err;
>   }
>   gotheader = false;
> @@ -466,9 +482,7 @@ err:
>* Invalidate the xlog page we've cached. We might read from a different
>* source after failure.
>*/
> - state->readSegNo = 0;
> - state->readOff = 0;
> - state->readLen = 0;
> + XLogReaderInvalCache(state);

I don't think that "cache" is the right way to describe this.


>  #include 
>  
> -#include "miscadmin.h"
> -

spurious change imo.



>  /*
> - * TODO: This is duplicate code with pg_xlogdump, similar to walsender.c, but
> - * we currently don't have the infrastructure (elog!) to share it.
> + * Read 'count' bytes from WAL into 'buf', starting at location 'startptr'
> + * in timeline 'tli'.
> + *
> + * Will open, and keep open, one WAL segment stored in the static file
> + * descriptor 'sendFile'. This means if XLogRead is used once, there will
> + * always be one descriptor left open until the process ends, but never
> + * more than one.
> + *
> + * XXX This is very similar to pg_xlogdump's XLogDumpXLogRead and to XLogRead
> + * in walsender.c but for small differences (such as lack of elog() in
> + * frontend).  Probably these should be merged at some point.
>   */
>  static void
>  XLogRead(char *buf, TimeLineID tli, XLogRecPtr startptr, Size count)
> @@ -648,8 +657,12 @@ XLogRead(char *buf, TimeLineID tli, XLogRecPtr startptr, 
> Size count)
>   XLogRecPtr  recptr;
>   Sizenbytes;
>  
> + /*
> +  * Cached state across calls.
> +  */

One line?


>   static int  sendFile = -1;
>   static XLogSegNo 

Re: [HACKERS] Timeline following for logical slots

2016-03-14 Thread Craig Ringer
On 15 March 2016 at 07:10, Alvaro Herrera  wrote:

> Petr Jelinek wrote:
> > On 04/03/16 17:08, Craig Ringer wrote:
> > >I'd really appreciate some review of the logic there by people who know
> > >timelines well and preferably know the xlogreader. It's really just one
> > >function and 2/3 comments; the code is simple but the reasoning leading
> > >to it is not.
> >
> > I think this will have to be work for committer at this point. I can't
> find
> > any flaws in the logic myself so I unless somebody protests I am going to
> > mark this as ready for committer.
>
> Great, thanks.  I've studied this to the point where I'm confident that
> it makes sense, so I'm about to push it.
>

Thanks, though I'm happy enough to wait for Andres's input as he requested.


> Also, hopefully you don't have any future callers for the functions I
> marked static (I checked the failover slots series and couldn't see
> anything).  I will push this early tomorrow.
>

I don't see it being overly useful for an extension, so that sounds fine.


> One thing I'm not quite sure about is why this is said to apply to
> "logical slots" and not just to replication slots in general.  In what
> sense does it not apply to physical replication slots?
>

 It's only useful for logical slots currently because physical slot
timeline following is done client side by the walreceiver. When the
walsender hits the end of the timeline it sends CopyDone and returns to
command mode. The client is expected to request the timeline history file,
parse it, work out which timeline to request next and specify that in its
next START_REPLICATION call.

None of that happens for logical slots. Andres was quite deliberate about
keeping timelines out of the interface for logical slots and logical
replication. I tried to retain that when adding the ability to follow
physical failover, keeping things simple for the client. Given that logical
replication slots are intended to be consumed by more than just a postgres
downstream it IMO makes sense to keep as much internal complexity hidden,
especially when dealing with something as surprisingly convoluted as
timelines.

This does mean that we can't tell if we replayed past the timeline switch
point on the old master before we switched to the new master, but I don't
think sending a timeline history file is the solution there. I'd rather
expose a walsender command and SQL function to let the client say, after
connecting, "I last replayed from timeline  up to point , if I start
replaying from you will I get a consistent stream?". That can be done
separately to this patch and IMO isn't crucial since clients can currently
work it out, just with more difficulty.

Maybe physical rep can use the same logic, but I'm really not convinced. It
already knows how to follow timelines client-side and handles that as part
of walreceiver and archive recovery. Because recovery handles following the
timeline switches and walreceiver just fetches the WAL as an alternative to
archive fetches I'm not sure it makes sense; we still have to have all that
logic for archive recovery from restore_command, so doing it differently
for streaming just makes things more complicated for no clear benefit.

I certainly didn't want to address it in the same patch, much like I
intentionally avoided trying to teach pg_xlogdump to be able to follow
timelines. Partly to keep it simpler and focused, partly because it'd
require timeline.c to be made frontend-clean which means no List, no elog,
etc...



>
> (I noticed that we have this new line,
> -   randAccess = true;
> +   randAccess = true;  /* allow readPageTLI to go
> backwards */
> in which now the comment is an identical copy of an identical line
> elsewhere; however, randAccess doesn't actually affect readPageTLI in
> any way, so AFAICS both comments are now wrong.)
>

Yeah, that's completely bogus. I have a clear image in my head of
randAccess being tested by ValidXLogPageHeader where it sanity
checks state->latestPageTLI, the TLI of the *prior* page, against the TLI
of the most recent page read, to make sure the TLI advances. But nope, I'm
apparently imagining things 'cos it just isn't there, it just tests to see
if we read a LSN later than the prior page instead.


> > Well for testing purposes it's quite fine I think. The TAP framework
> > enhancements needed for this are now in and it works correctly against
> > current master.
>
> Certainly the new src/test/recovery/t/006whatever.pl file is going to be
> part of the commit.  What I'm not so sure about is
> src/test/modules/decoding_failover: is there any reason we don't just
> put the new functions in contrib/test_decoding?


Because contrib/test_decoding isn't an extension. It is a sample logical
decoding output plugin.  I didn't want to mess it up by adding an extension
control file, extension script and some extension functions. Sure, you
*can* use the same shared library as an extension and 

Re: [HACKERS] Timeline following for logical slots

2016-03-14 Thread Andres Freund
On 2016-03-14 20:10:58 -0300, Alvaro Herrera wrote:
> Great, thanks.  I've studied this to the point where I'm confident that
> it makes sense, so I'm about to push it.  I didn't change any logic,
> only updated comments here and there, both in the patch and in some
> preexisting code.  I also changed the "List *timelineHistory" to be
> #ifndef FRONTEND, which seems cleaner than having it and insist that it
> couldn't be used.

Could you perhaps wait till tomorrow? I'd like to do a pass over it.

Thanks

Andres


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Timeline following for logical slots

2016-03-14 Thread Alvaro Herrera
Petr Jelinek wrote:
> On 04/03/16 17:08, Craig Ringer wrote:
> >I'd really appreciate some review of the logic there by people who know
> >timelines well and preferably know the xlogreader. It's really just one
> >function and 2/3 comments; the code is simple but the reasoning leading
> >to it is not.
> 
> I think this will have to be work for committer at this point. I can't find
> any flaws in the logic myself so I unless somebody protests I am going to
> mark this as ready for committer.

Great, thanks.  I've studied this to the point where I'm confident that
it makes sense, so I'm about to push it.  I didn't change any logic,
only updated comments here and there, both in the patch and in some
preexisting code.  I also changed the "List *timelineHistory" to be
#ifndef FRONTEND, which seems cleaner than having it and insist that it
couldn't be used.

Also, hopefully you don't have any future callers for the functions I
marked static (I checked the failover slots series and couldn't see
anything).  I will push this early tomorrow.

One thing I'm not quite sure about is why this is said to apply to
"logical slots" and not just to replication slots in general.  In what
sense does it not apply to physical replication slots?

(I noticed that we have this new line,
-   randAccess = true;
+   randAccess = true;  /* allow readPageTLI to go backwards */ 
in which now the comment is an identical copy of an identical line
elsewhere; however, randAccess doesn't actually affect readPageTLI in
any way, so AFAICS both comments are now wrong.)

> Well for testing purposes it's quite fine I think. The TAP framework
> enhancements needed for this are now in and it works correctly against
> current master.

Certainly the new src/test/recovery/t/006whatever.pl file is going to be
part of the commit.  What I'm not so sure about is
src/test/modules/decoding_failover: is there any reason we don't just
put the new functions in contrib/test_decoding?

-- 
Álvaro Herrerahttp://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
diff --git a/src/backend/access/transam/xlogreader.c b/src/backend/access/transam/xlogreader.c
index fcb0872..7b60b8f 100644
--- a/src/backend/access/transam/xlogreader.c
+++ b/src/backend/access/transam/xlogreader.c
@@ -10,9 +10,11 @@
  *
  * NOTES
  *		See xlogreader.h for more notes on this facility.
+ *
+ *		This file is compiled as both front-end and backend code, so it
+ *		may not use ereport, server-defined static variables, etc.
  *-
  */
-
 #include "postgres.h"
 
 #include "access/transam.h"
@@ -116,6 +118,11 @@ XLogReaderAllocate(XLogPageReadCB pagereadfunc, void *private_data)
 		return NULL;
 	}
 
+#ifndef FRONTEND
+	/* Will be loaded on first read */
+	state->timelineHistory = NIL;
+#endif
+
 	return state;
 }
 
@@ -135,6 +142,10 @@ XLogReaderFree(XLogReaderState *state)
 	pfree(state->errormsg_buf);
 	if (state->readRecordBuf)
 		pfree(state->readRecordBuf);
+#ifndef FRONTEND
+	if (state->timelineHistory)
+		list_free_deep(state->timelineHistory);
+#endif
 	pfree(state->readBuf);
 	pfree(state);
 }
@@ -208,10 +219,11 @@ XLogReadRecord(XLogReaderState *state, XLogRecPtr RecPtr, char **errormsg)
 
 	if (RecPtr == InvalidXLogRecPtr)
 	{
+		/* No explicit start point; read the record after the one we just read */
 		RecPtr = state->EndRecPtr;
 
 		if (state->ReadRecPtr == InvalidXLogRecPtr)
-			randAccess = true;
+			randAccess = true;	/* allow readPageTLI to go backwards */
 
 		/*
 		 * RecPtr is pointing to end+1 of the previous WAL record.  If we're
@@ -223,6 +235,8 @@ XLogReadRecord(XLogReaderState *state, XLogRecPtr RecPtr, char **errormsg)
 	else
 	{
 		/*
+		 * Caller supplied a position to start at.
+		 *
 		 * In this case, the passed-in record pointer should already be
 		 * pointing to a valid record starting position.
 		 */
@@ -309,8 +323,10 @@ XLogReadRecord(XLogReaderState *state, XLogRecPtr RecPtr, char **errormsg)
 		/* XXX: more validation should be done here */
 		if (total_len < SizeOfXLogRecord)
 		{
-			report_invalid_record(state, "invalid record length at %X/%X",
-  (uint32) (RecPtr >> 32), (uint32) RecPtr);
+			report_invalid_record(state,
+		"invalid record length at %X/%X: wanted %lu, got %u",
+  (uint32) (RecPtr >> 32), (uint32) RecPtr,
+  SizeOfXLogRecord, total_len);
 			goto err;
 		}
 		gotheader = false;
@@ -466,9 +482,7 @@ err:
 	 * Invalidate the xlog page we've cached. We might read from a different
 	 * source after failure.
 	 */
-	state->readSegNo = 0;
-	state->readOff = 0;
-	state->readLen = 0;
+	XLogReaderInvalCache(state);
 
 	if (state->errormsg_buf[0] != '\0')
 		*errormsg = state->errormsg_buf;
@@ -580,10 +594,19 @@ ReadPageInternal(XLogReaderState *state, XLogRecPtr pageptr, int reqLen)
 	return readLen;
 
 err:
+	XLogReaderInvalCache(state);
+	return -1;
+}
+
+/*
+ * 

Re: [HACKERS] Timeline following for logical slots

2016-03-09 Thread Petr Jelinek

On 04/03/16 17:08, Craig Ringer wrote:

I'd really appreciate some review of the logic there by people who know
timelines well and preferably know the xlogreader. It's really just one
function and 2/3 comments; the code is simple but the reasoning leading
to it is not.



I think this will have to be work for committer at this point. I can't 
find any flaws in the logic myself so I unless somebody protests I am 
going to mark this as ready for committer.



I've also attached an updated version of the tests posted a few days
ago.  The tests depend on the remaining patches from the TAP
enhancements tree so it's easiest to just get the whole tree from
https://github.com/2ndQuadrant/postgres/tree/dev/logical-decoding-timeline-following
(subject to regular rebases and force pushes, do not use as a base).

The tests now include a test module that exposes some slots guts to SQL
to allow the client to sync slot state from master to replica(s) without
needing failover slots and the use of extra WAL as transport. It's very
much for-testing-only.

The new test module is used by a second round of tests to demonstrate
the practicality of failover of a logical replication client to a
physical replica using a base backup taken by pg_basebackup and without
the presence of failover slots. I won't pretend it's pretty.



Well for testing purposes it's quite fine I think. The TAP framework 
enhancements needed for this are now in and it works correctly against 
current master.


--
  Petr Jelinek  http://www.2ndQuadrant.com/
  PostgreSQL Development, 24x7 Support, Training & Services


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Timeline following for logical slots

2016-03-04 Thread Craig Ringer
On 1 March 2016 at 21:00, Craig Ringer  wrote:

> Hi all
>
> Per discussion on the failover slots thread (
> https://commitfest.postgresql.org/9/488/) I'm splitting timeline
> following for logical slots into its own separate patch.
>
>
I've updated the logical decoding timeline following patch to fix a bug
found as a result of test development related to how Pg renames the last
WAL seg on the old timeline to suffix it with .partial on promotion. The
xlogreader must switch to reading from the newest-timeline version of a
given segment eagerly, for the first page of the segment, since that's the
only one guaranteed to actually exist.

I'd really appreciate some review of the logic there by people who know
timelines well and preferably know the xlogreader. It's really just one
function and 2/3 comments; the code is simple but the reasoning leading to
it is not.


I've also attached an updated version of the tests posted a few days ago.
The tests depend on the remaining patches from the TAP enhancements tree so
it's easiest to just get the whole tree from
https://github.com/2ndQuadrant/postgres/tree/dev/logical-decoding-timeline-following
(subject to regular rebases and force pushes, do not use as a base).

The tests now include a test module that exposes some slots guts to SQL to
allow the client to sync slot state from master to replica(s) without
needing failover slots and the use of extra WAL as transport. It's very
much for-testing-only.

The new test module is used by a second round of tests to demonstrate the
practicality of failover of a logical replication client to a physical
replica using a base backup taken by pg_basebackup and without the presence
of failover slots. I won't pretend it's pretty.

This proves that the approach works barring unforseen showstoppers. It also
proves it's pretty ugly - failover slots provide a much, MUCH simpler and
safer way for clients to achieve this with way less custom code needed by
each client to sync slot state.

I've got a bit of cleanup to do in the test suite and a few more tests to
write for cases where the slot on the replica is allowed to fall behind the
slot on the master but this is mostly waiting on the remaining two TAP test
patches before it can be evaluated for possible push.

-- 
 Craig Ringer   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services
From 37bd2e654345af65749ccff6ca73d3afebf67072 Mon Sep 17 00:00:00 2001
From: Craig Ringer 
Date: Thu, 11 Feb 2016 10:44:14 +0800
Subject: [PATCH 1/2] Allow logical slots to follow timeline switches

Make logical replication slots timeline-aware, so replay can
continue from a historical timeline onto the server's current
timeline.

This is required to make failover slots possible and may also
be used by extensions that CreateReplicationSlot on a standby
and replay from that slot once the replica is promoted.

This does NOT add support for replaying from a logical slot on
a standby or for syncing slots to replicas.
---
 src/backend/access/transam/xlogreader.c|  43 -
 src/backend/access/transam/xlogutils.c | 240 +++--
 src/backend/replication/logical/logicalfuncs.c |  38 +++-
 src/include/access/xlogreader.h|  35 +++-
 src/include/access/xlogutils.h |   2 +
 5 files changed, 323 insertions(+), 35 deletions(-)

diff --git a/src/backend/access/transam/xlogreader.c b/src/backend/access/transam/xlogreader.c
index fcb0872..5899f44 100644
--- a/src/backend/access/transam/xlogreader.c
+++ b/src/backend/access/transam/xlogreader.c
@@ -10,6 +10,9 @@
  *
  * NOTES
  *		See xlogreader.h for more notes on this facility.
+ *
+ * 		The xlogreader is compiled as both front-end and backend code so
+ * 		it may not use elog, server-defined static variables, etc.
  *-
  */
 
@@ -116,6 +119,9 @@ XLogReaderAllocate(XLogPageReadCB pagereadfunc, void *private_data)
 		return NULL;
 	}
 
+	/* Will be loaded on first read */
+	state->timelineHistory = NULL;
+
 	return state;
 }
 
@@ -135,6 +141,13 @@ XLogReaderFree(XLogReaderState *state)
 	pfree(state->errormsg_buf);
 	if (state->readRecordBuf)
 		pfree(state->readRecordBuf);
+#ifdef FRONTEND
+	/* FE code doesn't use this and we can't list_free_deep on FE */
+	Assert(state->timelineHistory == NULL);
+#else
+	if (state->timelineHistory)
+		list_free_deep(state->timelineHistory);
+#endif
 	pfree(state->readBuf);
 	pfree(state);
 }
@@ -208,9 +221,11 @@ XLogReadRecord(XLogReaderState *state, XLogRecPtr RecPtr, char **errormsg)
 
 	if (RecPtr == InvalidXLogRecPtr)
 	{
+		/* No explicit start point, read the record after the one we just read */
 		RecPtr = state->EndRecPtr;
 
 		if (state->ReadRecPtr == InvalidXLogRecPtr)
+			/* allow readPageTLI to go backward */
 			randAccess = true;
 
 		/*
@@ -223,6 +238,8 @@ XLogReadRecord(XLogReaderState 

Re: [HACKERS] Timeline following for logical slots

2016-03-01 Thread Craig Ringer
Here are the tests.

They depend on https://commitfest.postgresql.org/9/569/#

I've also rebased the git tree for failover slots on top of the tree for
TAP test improvements.

-- 
 Craig Ringer   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services
From 199c2623939333bf01d8e5fc4eca00587a9c15dd Mon Sep 17 00:00:00 2001
From: Craig Ringer 
Date: Tue, 1 Mar 2016 20:57:17 +0800
Subject: [PATCH] Tests for logical decoding timeline following

---
 src/test/recovery/Makefile |  2 +
 .../recovery/t/006_logical_decoding_timelines.pl   | 96 ++
 2 files changed, 98 insertions(+)
 create mode 100644 src/test/recovery/t/006_logical_decoding_timelines.pl

diff --git a/src/test/recovery/Makefile b/src/test/recovery/Makefile
index 330ab2b..ad1d764 100644
--- a/src/test/recovery/Makefile
+++ b/src/test/recovery/Makefile
@@ -9,6 +9,8 @@
 #
 #-
 
+EXTRA_INSTALL=contrib/test_decoding
+
 subdir = src/test/recovery
 top_builddir = ../../..
 include $(top_builddir)/src/Makefile.global
diff --git a/src/test/recovery/t/006_logical_decoding_timelines.pl b/src/test/recovery/t/006_logical_decoding_timelines.pl
new file mode 100644
index 000..e3acaa9
--- /dev/null
+++ b/src/test/recovery/t/006_logical_decoding_timelines.pl
@@ -0,0 +1,96 @@
+# Demonstrate that logical can follow timeline switches.
+#
+# Logical replication slots can follow timeline switches but it's
+# normally not possible to have a logical slot on a replica where
+# promotion and a timeline switch can occur. The only ways
+# we can create that circumstance are:
+#
+# * By doing a filesystem-level copy of the DB, since pg_basebackup
+#   excludes pg_replslot but we can copy it directly; or
+#
+# * by creating a slot directly at the C level on the replica and
+#   advancing it as we go using the low level APIs. It can't be done
+#   from SQL since logical decoding isn't allowed on replicas.
+#
+# This module uses the first approach to show that timeline following
+# on a logical slot works.
+#
+use strict;
+use warnings;
+use PostgresNode;
+use TestLib;
+use Test::More tests => 3;
+use RecursiveCopy;
+
+my ( $stdout, $stderr, $ret );
+
+# Initialize master node
+my $node_master = get_new_node('master');
+$node_master->init( allows_streaming => 1, has_archiving => 1 );
+$node_master->append_conf( 'postgresql.conf', "wal_level = 'logical'\n" );
+$node_master->append_conf( 'postgresql.conf', "max_replication_slots = 2\n" );
+$node_master->append_conf( 'postgresql.conf', "max_wal_senders = 2\n" );
+$node_master->append_conf( 'postgresql.conf', "log_min_messages = 'debug2'\n" );
+$node_master->dump_info;
+$node_master->start;
+
+$node_master->psql_check( 'postgres',
+"SELECT pg_create_logical_replication_slot('before_basebackup', 'test_decoding');"
+);
+$node_master->psql_check( 'postgres', "CREATE TABLE decoding(blah text);" );
+$node_master->psql_check( 'postgres',
+"INSERT INTO decoding(blah) VALUES ('afterbb');" );
+$node_master->psql_check( 'postgres', 'CHECKPOINT;' );
+
+my $backup_name = 'b1';
+$node_master->backup_fs_hot($backup_name);
+
+my $node_replica = get_new_node('replica');
+$node_replica->init_from_backup(
+$node_master, $backup_name,
+has_streaming => 1,
+has_restoring => 1
+);
+$node_replica->start;
+
+$node_master->psql_check( 'postgres',
+"SELECT pg_create_logical_replication_slot('after_basebackup', 'test_decoding');"
+);
+$node_master->psql_check( 'postgres',
+"INSERT INTO decoding(blah) VALUES ('afterbb');" );
+$node_master->psql_check( 'postgres', 'CHECKPOINT;' );
+
+# Boom, crash
+$node_master->stop('immediate');
+
+$node_replica->promote;
+$node_replica->poll_query_until( 'postgres',
+"SELECT NOT pg_is_in_recovery();" );
+
+$node_replica->psql_check( 'postgres',
+"INSERT INTO decoding(blah) VALUES ('after failover');" );
+
+# Shouldn't be able to read from slot created after base backup
+$ret = $node_replica->psql_expert(
+'postgres',
+"SELECT * FROM pg_logical_slot_peek_changes('after_basebackup', NULL, NULL);",
+stdout => \$stdout,
+stderr => \$stderr
+);
+is( $ret, 3, 'replaying from after_basebackup slot fails' );
+like(
+$stderr,
+qr/replication slot "after_basebackup" does not exist/,
+'after_basebackup slot missing'
+);
+
+# Should be able to read from slot created before base backup
+diag "Trying to replay from slot before_basebackup, timeout in 30s";
+$node_replica->psql_check(
+'postgres',
+"SELECT * FROM pg_logical_slot_peek_changes('before_basebackup', NULL, NULL);",
+timeout => 30
+);
+ok('replay from before_basebackup successful');
+
+1;
-- 
2.1.0


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] Timeline following for logical slots

2016-03-01 Thread Craig Ringer
Hi all

Per discussion on the failover slots thread (
https://commitfest.postgresql.org/9/488/) I'm splitting timeline following
for logical slots into its own separate patch.

The attached patch fixes an issue I found while testing the prior revision:
it would read WAL from WAL segments on the old timeline up until the
timeline switch boundary, but this doesn't work if the last WAL segment on
the timeline has been renamed to append the .partial suffix.

Instead it's necessary to eagerly switch to reading the WAL segment from
the newest timeline on that segment. We'll still be reading WAL records
from the correct timeline since the partial WAL segment from the old
timeline gets copied to a new name on promotion, but we're reading it from
the newest copy of that segment, which is either complete and archived or
is still being written to by the current timeline.

For example, if the old master was on timeline 1 and writing
to 00010003 when it dies and we promote a streaming
replica, the replica will copy 00010003
to 00020003 and append its recovery checkpoint to the copy.
It renames 00010003 to 00010003.partial,
which means the xlogreader won't find it. If we're reading the record at
0/300 then even though 0/300 is on timeline 1, we have to read it
from the segment on timeline 2.

Fun, eh?

(I'm going to write a README.timelines to document some of this stuff soon,
since it has some pretty hairy corners and some of the code paths are a bit
special.)

I've written some initial TAP tests for timeline following that exploit the
fact that replication slots are preserved on a replica if the replica is
created with a filesystem level copy that includes pg_replslot, rather than
using pg_basebackup. They are not included here because they rely on TAP
support improvements (filesystem backup support, psql enhancements, etc)
that I'll submit separately, but they're how I found the .partial issue.

A subsequent patch can add testing of slot creation and advance on replicas
using a C test extension to prove that this approach can be used to achieve
practical logical failover for extensions.

I think this is ready to go as-is.

I don't want to hold it up waiting for test framework enhancements unless
those can be committed fairly easily because I think we need this in 9.6
and the tests demonstrate that it works when run separately.

See  for a git tree containing the timeline following patch, TAP
enhancements and the tests for timeline following.

https://github.com/2ndQuadrant/postgres/tree/dev/logical-decoding-timeline-following

-- 
 Craig Ringer   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services
From 579c1587a5574b790d257ea790fe42a5531bbb31 Mon Sep 17 00:00:00 2001
From: Craig Ringer 
Date: Thu, 11 Feb 2016 10:44:14 +0800
Subject: [PATCH 1/3] Allow logical slots to follow timeline switches

Make logical replication slots timeline-aware, so replay can
continue from a historical timeline onto the server's current
timeline.

This is required to make failover slots possible and may also
be used by extensions that CreateReplicationSlot on a standby
and replay from that slot once the replica is promoted.

This does NOT add support for replaying from a logical slot on
a standby or for syncing slots to replicas.
---
 src/backend/access/transam/xlogreader.c|  43 -
 src/backend/access/transam/xlogutils.c | 240 +++--
 src/backend/replication/logical/logicalfuncs.c |  38 +++-
 src/include/access/xlogreader.h|  35 +++-
 src/include/access/xlogutils.h |   2 +
 5 files changed, 323 insertions(+), 35 deletions(-)

diff --git a/src/backend/access/transam/xlogreader.c b/src/backend/access/transam/xlogreader.c
index fcb0872..5899f44 100644
--- a/src/backend/access/transam/xlogreader.c
+++ b/src/backend/access/transam/xlogreader.c
@@ -10,6 +10,9 @@
  *
  * NOTES
  *		See xlogreader.h for more notes on this facility.
+ *
+ * 		The xlogreader is compiled as both front-end and backend code so
+ * 		it may not use elog, server-defined static variables, etc.
  *-
  */
 
@@ -116,6 +119,9 @@ XLogReaderAllocate(XLogPageReadCB pagereadfunc, void *private_data)
 		return NULL;
 	}
 
+	/* Will be loaded on first read */
+	state->timelineHistory = NULL;
+
 	return state;
 }
 
@@ -135,6 +141,13 @@ XLogReaderFree(XLogReaderState *state)
 	pfree(state->errormsg_buf);
 	if (state->readRecordBuf)
 		pfree(state->readRecordBuf);
+#ifdef FRONTEND
+	/* FE code doesn't use this and we can't list_free_deep on FE */
+	Assert(state->timelineHistory == NULL);
+#else
+	if (state->timelineHistory)
+		list_free_deep(state->timelineHistory);
+#endif
 	pfree(state->readBuf);
 	pfree(state);
 }
@@ -208,9 +221,11 @@ XLogReadRecord(XLogReaderState *state, XLogRecPtr