Re: [HACKERS] [BUGS] Bug in Physical Replication Slots (at least 9.5)?

2017-10-26 Thread Michael Paquier
On Thu, Oct 26, 2017 at 3:05 AM, Kyotaro HORIGUCHI
 wrote:
> The largest obstacle to do that is that walreceiver is not
> utterly concerned to record internals. In other words, it doesn't
> know what a record is. Teaching that introduces much complexity
> and the complexity slows down the walreceiver.
>
> Addition to that, this "problem" occurs also on replication
> without a slot. The latest patch also help the case.

That's why replication slots have been introduced to begin with. The
WAL receiver gives no guarantee that a segment will be retained or not
based on the beginning of a record. That's sad that the WAL receiver
does not track properly restart LSN and instead just uses the flush
LSN. I am beginning to think that a new message type used to report
the restart LSN when a replication slot is in use would just be a
better and more stable solution. I haven't looked at the
implementation details yet though.
-- 
Michael


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [BUGS] Bug in Physical Replication Slots (at least 9.5)?

2017-10-26 Thread Kyotaro HORIGUCHI
Hello. Thank you for looking this.

At Mon, 16 Oct 2017 17:58:03 +0900, Michael Paquier  
wrote in 
> On Thu, Sep 7, 2017 at 12:33 PM, Kyotaro HORIGUCHI
>  wrote:
> > At Wed, 6 Sep 2017 12:23:53 -0700, Andres Freund  wrote 
> > in <20170906192353.ufp2dq7wm5fd6...@alap3.anarazel.de>
> >> I'm not following. All we need to use is the beginning of the relevant
> >> records, that's easy enough to keep track of. We don't need to read the
> >> WAL or anything.
> >
> > The beginning is already tracked and nothing more to do.
> 
> I have finally allocated time to look at your newly-proposed patch,
> sorry for the time it took. Patch 0002 forgot to include sys/stat.h to
> allow the debugging tool to compile :)
> 
> > The first *problem* was WaitForWALToBecomeAvailable requests the
> > beginning of a record, which is not on the page the function has
> > been told to fetch. Still tliRecPtr is required to determine the
> > TLI to request, it should request RecPtr to be streamed.
> 
> [...]
> 
> > The rest to do is let XLogPageRead retry other sources
> > immediately. To do this I made ValidXLogPageHeader@xlogreader.c
> > public (and renamed to XLogReaderValidatePageHeader).
> >
> > The patch attached fixes the problem and passes recovery
> > tests. However, the test for this problem is not added. It needs
> > to go to the last page in a segment then put a record continues
> > to the next segment, then kill the standby after receiving the
> > previous segment but before receiving the whole record.
> 
> +typedef struct XLogPageHeaderData *XLogPageHeader;
> [...]
> +/* Validate a page */
> +extern bool XLogReaderValidatePageHeader(XLogReaderState *state,
> +   XLogRecPtr recptr, XLogPageHeader 
> hdr);
> Instead of exposing XLogPageHeaderData, I would recommend just using
> char* and remove this declaration. The comment on top of
> XLogReaderValidatePageHeader needs to make clear what caller should
> provide.

Seems reasonable. Added several lines in the comment for the
function.

> +if (!XLogReaderValidatePageHeader(xlogreader, targetPagePtr,
> +  (XLogPageHeader) readBuf))
> +goto next_record_is_invalid;
> +
> [...]
> -ptr = tliRecPtr;
> +ptr = RecPtr;
>  tli = tliOfPointInHistory(tliRecPtr, 
> expectedTLEs);
> 
>  if (curFileTLI > 0 && tli < curFileTLI)
> The core of the patch is here (the patch has no comment so it is hard
> to understand what's the point of what is being done), and if I

Hmm, sorry. Added a brief comment there.

> understand that correctly, you allow the receiver to fetch the
> portions of a record spawned across multiple segments from different
> sources, and I am not sure that we'd want to break that promise.

We are allowing consecutive records at a segment boundary from
different sources are in the same series of xlog records. A
continuation records never spans over two TLIs but I might be
missing something here. (I found that an error message shows an
incorrect record pointer. The error message seems still be
useful.)

> Shouldn't we instead have the receiver side track the beginning of the
> record and send that position for the physical slot's restart_lsn?

The largest obstacle to do that is that walreceiver is not
utterly concerned to record internals. In other words, it doesn't
know what a record is. Teaching that introduces much complexity
and the complexity slows down the walreceiver.

Addition to that, this "problem" occurs also on replication
without a slot. The latest patch also help the case.

> This way the receiver would retain WAL segments from the real
> beginning of a record. restart_lsn for replication slots is set when
> processing the standby message in ProcessStandbyReplyMessage() using
> now the flush LSN, so a more correct value should be provided using
> that. Andres, what's your take on the matter?


regards,

-- 
Kyotaro Horiguchi
NTT Open Source Software Center
>From b23c1d69ad86fcbb992cb21c604f587d82441001 Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi 
Date: Thu, 7 Sep 2017 12:14:55 +0900
Subject: [PATCH 1/2] Allow switch WAL source midst of record.

The corrent recovery machinary assumes the whole of a record is
avaiable from single source. This prevents a standby from restarting
under a certain condition. This patch allows source switching during
reading a series of continuation records.
---
 src/backend/access/transam/xlog.c   | 14 --
 src/backend/access/transam/xlogreader.c | 28 ++--
 src/include/access/xlogreader.h |  4 
 3 files changed, 34 insertions(+), 12 deletions(-)

diff --git a/src/backend/access/transam/xlog.c 

Re: [HACKERS] [BUGS] Bug in Physical Replication Slots (at least 9.5)?

2017-10-16 Thread Michael Paquier
On Thu, Sep 7, 2017 at 12:33 PM, Kyotaro HORIGUCHI
 wrote:
> At Wed, 6 Sep 2017 12:23:53 -0700, Andres Freund  wrote 
> in <20170906192353.ufp2dq7wm5fd6...@alap3.anarazel.de>
>> I'm not following. All we need to use is the beginning of the relevant
>> records, that's easy enough to keep track of. We don't need to read the
>> WAL or anything.
>
> The beginning is already tracked and nothing more to do.

I have finally allocated time to look at your newly-proposed patch,
sorry for the time it took. Patch 0002 forgot to include sys/stat.h to
allow the debugging tool to compile :)

> The first *problem* was WaitForWALToBecomeAvailable requests the
> beginning of a record, which is not on the page the function has
> been told to fetch. Still tliRecPtr is required to determine the
> TLI to request, it should request RecPtr to be streamed.

[...]

> The rest to do is let XLogPageRead retry other sources
> immediately. To do this I made ValidXLogPageHeader@xlogreader.c
> public (and renamed to XLogReaderValidatePageHeader).
>
> The patch attached fixes the problem and passes recovery
> tests. However, the test for this problem is not added. It needs
> to go to the last page in a segment then put a record continues
> to the next segment, then kill the standby after receiving the
> previous segment but before receiving the whole record.

+typedef struct XLogPageHeaderData *XLogPageHeader;
[...]
+/* Validate a page */
+extern bool XLogReaderValidatePageHeader(XLogReaderState *state,
+   XLogRecPtr recptr, XLogPageHeader hdr);
Instead of exposing XLogPageHeaderData, I would recommend just using
char* and remove this declaration. The comment on top of
XLogReaderValidatePageHeader needs to make clear what caller should
provide.

+if (!XLogReaderValidatePageHeader(xlogreader, targetPagePtr,
+  (XLogPageHeader) readBuf))
+goto next_record_is_invalid;
+
[...]
-ptr = tliRecPtr;
+ptr = RecPtr;
 tli = tliOfPointInHistory(tliRecPtr, expectedTLEs);

 if (curFileTLI > 0 && tli < curFileTLI)
The core of the patch is here (the patch has no comment so it is hard
to understand what's the point of what is being done), and if I
understand that correctly, you allow the receiver to fetch the
portions of a record spawned across multiple segments from different
sources, and I am not sure that we'd want to break that promise.
Shouldn't we instead have the receiver side track the beginning of the
record and send that position for the physical slot's restart_lsn?
This way the receiver would retain WAL segments from the real
beginning of a record. restart_lsn for replication slots is set when
processing the standby message in ProcessStandbyReplyMessage() using
now the flush LSN, so a more correct value should be provided using
that. Andres, what's your take on the matter?
-- 
Michael


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [BUGS] Bug in Physical Replication Slots (at least 9.5)?

2017-09-06 Thread Kyotaro HORIGUCHI
Hello,

At Wed, 6 Sep 2017 12:23:53 -0700, Andres Freund  wrote in 
<20170906192353.ufp2dq7wm5fd6...@alap3.anarazel.de>
> On 2017-09-06 17:36:02 +0900, Kyotaro HORIGUCHI wrote:
> > The problem is that the current ReadRecord needs the first one of
> > a series of continuation records from the same source with the
> > other part, the master in the case.
> 
> What's the problem with that?  We can easily keep track of the beginning
> of a record, and only confirm the address before that.

After failure while reading a record locally, ReadRecored tries
streaming to read from the beginning of a record, which is not on
the master, then retry locally and.. This loops forever.

> > A (or the) solution closed in the standby side is allowing to
> > read a seris of continuation records from muliple sources.
> 
> I'm not following. All we need to use is the beginning of the relevant
> records, that's easy enough to keep track of. We don't need to read the
> WAL or anything.

The beginning is already tracked and nothing more to do. 

I reconsider that way and found that it doesn't need such
destructive refactoring.

The first *problem* was WaitForWALToBecomeAvaialble requests the
beginning of a record, which is not on the page the function has
been told to fetch. Still tliRecPtr is required to determine the
TLI to request, it should request RecPtr to be streamed.

The rest to do is let XLogPageRead retry other sources
immediately. To do this I made ValidXLogPageHeader@xlogreader.c
public (and renamed to XLogReaderValidatePageHeader).

The patch attached fixes the problem and passes recovery
tests. However, the test for this problem is not added. It needs
to go to the last page in a segment then put a record continues
to the next segment, then kill the standby after receiving the
previous segment but before receiving the whole record.

regards,

-- 
Kyotaro Horiguchi
NTT Open Source Software Center
>From 8932a390bd3d1acfe5722bc62f42fc7e381ca617 Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi 
Date: Thu, 7 Sep 2017 12:14:55 +0900
Subject: [PATCH 1/2] Allow switch WAL source midst of record.

The corrent recovery machinary assumes the whole of a record is
avaiable from single source. This prevents a standby from restarting
under a certain condition. This patch allows source switching during
reading a series of continuation records.
---
 src/backend/access/transam/xlog.c   |  7 ++-
 src/backend/access/transam/xlogreader.c | 12 +---
 src/include/access/xlogreader.h |  5 +
 3 files changed, 16 insertions(+), 8 deletions(-)

diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index df4843f..eef3a97 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -11566,6 +11566,11 @@ retry:
 	Assert(reqLen <= readLen);
 
 	*readTLI = curFileTLI;
+
+	if (!XLogReaderValidatePageHeader(xlogreader, targetPagePtr,
+	  (XLogPageHeader) readBuf))
+		goto next_record_is_invalid;
+
 	return readLen;
 
 next_record_is_invalid:
@@ -11700,7 +11705,7 @@ WaitForWALToBecomeAvailable(XLogRecPtr RecPtr, bool randAccess,
 		}
 		else
 		{
-			ptr = tliRecPtr;
+			ptr = RecPtr;
 			tli = tliOfPointInHistory(tliRecPtr, expectedTLEs);
 
 			if (curFileTLI > 0 && tli < curFileTLI)
diff --git a/src/backend/access/transam/xlogreader.c b/src/backend/access/transam/xlogreader.c
index 0781a7b..aa05e3f 100644
--- a/src/backend/access/transam/xlogreader.c
+++ b/src/backend/access/transam/xlogreader.c
@@ -27,8 +27,6 @@
 
 static bool allocate_recordbuf(XLogReaderState *state, uint32 reclength);
 
-static bool ValidXLogPageHeader(XLogReaderState *state, XLogRecPtr recptr,
-	XLogPageHeader hdr);
 static bool ValidXLogRecordHeader(XLogReaderState *state, XLogRecPtr RecPtr,
 	  XLogRecPtr PrevRecPtr, XLogRecord *record, bool randAccess);
 static bool ValidXLogRecord(XLogReaderState *state, XLogRecord *record,
@@ -545,7 +543,7 @@ ReadPageInternal(XLogReaderState *state, XLogRecPtr pageptr, int reqLen)
 
 		hdr = (XLogPageHeader) state->readBuf;
 
-		if (!ValidXLogPageHeader(state, targetSegmentPtr, hdr))
+		if (!XLogReaderValidatePageHeader(state, targetSegmentPtr, hdr))
 			goto err;
 	}
 
@@ -582,7 +580,7 @@ ReadPageInternal(XLogReaderState *state, XLogRecPtr pageptr, int reqLen)
 	/*
 	 * Now that we know we have the full header, validate it.
 	 */
-	if (!ValidXLogPageHeader(state, pageptr, hdr))
+	if (!XLogReaderValidatePageHeader(state, pageptr, hdr))
 		goto err;
 
 	/* update read state information */
@@ -709,9 +707,9 @@ ValidXLogRecord(XLogReaderState *state, XLogRecord *record, XLogRecPtr recptr)
 /*
  * Validate a page header
  */
-static bool
-ValidXLogPageHeader(XLogReaderState *state, XLogRecPtr recptr,
-	XLogPageHeader hdr)
+bool
+XLogReaderValidatePageHeader(XLogReaderState *state, XLogRecPtr recptr,
+			 XLogPageHeader hdr)
 {
 	XLogRecPtr	recaddr;
 	

Re: [HACKERS] [BUGS] Bug in Physical Replication Slots (at least 9.5)?

2017-09-06 Thread Andres Freund
On 2017-09-06 17:36:02 +0900, Kyotaro HORIGUCHI wrote:
> Hi,
> 
> At Mon, 4 Sep 2017 17:17:19 +0900, Michael Paquier 
>  wrote in 
> 

Re: [HACKERS] [BUGS] Bug in Physical Replication Slots (at least 9.5)?

2017-09-06 Thread Kyotaro HORIGUCHI
Hi,

At Mon, 4 Sep 2017 17:17:19 +0900, Michael Paquier  
wrote in 

Re: [HACKERS] [BUGS] Bug in Physical Replication Slots (at least 9.5)?

2017-09-04 Thread Michael Paquier
On Mon, Sep 4, 2017 at 4:04 PM, Andres Freund  wrote:
> I've not read through the thread, but this seems like the wrong approach
> to me. The receiving side should use a correct value, instead of putting
> this complexity on the sender's side.

Yes I agree with that. The current patch gives me a bad feeling to be
honest with the way it does things..
-- 
Michael


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [BUGS] Bug in Physical Replication Slots (at least 9.5)?

2017-09-04 Thread Andres Freund
Hi,

On 2017-09-04 15:51:51 +0900, Kyotaro HORIGUCHI wrote:
>   SpinLockAcquire(>mutex);
> + oldFlushPtr = walsnd->flush;
>   walsnd->write = writePtr;
>   walsnd->flush = flushPtr;
>   walsnd->apply = applyPtr;
> @@ -1821,7 +1836,93 @@ ProcessStandbyReplyMessage(void)
>   if (SlotIsLogical(MyReplicationSlot))
>   LogicalConfirmReceivedLocation(flushPtr);
>   else
> - PhysicalConfirmReceivedLocation(flushPtr);
> + {
> + /*
> +  * Recovery on standby requires that a continuation 
> record is
> +  * available from single WAL source. For the reason, 
> physical
> +  * replication slot should stay in the first segment of 
> the
> +  * multiple segments that a continued record is spanning
> +  * over. Since we look pages and don't look into 
> individual record
> +  * here, restartLSN may stay a bit too behind but it 
> doesn't
> +  * matter.
> +  *
> +  * Since the objective is avoiding to remove required 
> segments,
> +  * checking at the beginning of every segment is 
> enough. But once
> +  * restartLSN goes behind, check every page for quick 
> restoration.
> +  *
> +  * restartLSN has a valid value only when it is behind 
> flushPtr.
> +  */
> + boolcheck_contrecords = false;
> +
> + if (restartLSN != InvalidXLogRecPtr)
> + {
> + /*
> +  * We're retarding restartLSN, check 
> continuation records
> +  * at every page boundary for quick restoration.
> +  */
> + if (restartLSN / XLOG_BLCKSZ != flushPtr / 
> XLOG_BLCKSZ)
> + check_contrecords = true;
> + }
> + else if (oldFlushPtr != InvalidXLogRecPtr)
> + {
> + /*
> +  * We're not retarding restartLSN , check 
> continuation records
> +  * only at segment boundaries. No need to check 
> if
> +  */
> + if (oldFlushPtr / XLOG_SEG_SIZE != flushPtr / 
> XLOG_SEG_SIZE)
> + check_contrecords = true;
> + }
> +
> + if (check_contrecords)
> + {
> + XLogRecPtr rp;
> +
> + if (restartLSN == InvalidXLogRecPtr)
> + restartLSN = oldFlushPtr;
> +
> + rp = restartLSN - (restartLSN % XLOG_BLCKSZ);
> +
> + /*
> +  * We may have let the record at flushPtr be 
> sent, so it's
> +  * worth looking
> +  */
> + while (rp <= flushPtr)
> + {
> + XLogPageHeaderData header;
> +
> + /*
> +  * If the page header is not available 
> for now, don't move
> +  * restartLSN forward. We can read it 
> by the next chance.
> +  */
> + if(sentPtr - rp >= 
> sizeof(XLogPageHeaderData))
> + {
> + bool found;
> + /*
> +  * Fetch the page header of the 
> next page. Move
> +  * restartLSN forward only if 
> it is not a continuation
> +  * page.
> +  */
> + found = XLogRead((char 
> *), rp,
> + 
>  sizeof(XLogPageHeaderData), true);
> + if (found &&
> + (header.xlp_info & 
> XLP_FIRST_IS_CONTRECORD) == 0)
> + restartLSN = rp;
> + }
> + rp += XLOG_BLCKSZ;
> + }
> +
> + /*
> +  * If restartLSN is on the same page with 
> flushPtr, it 

Re: [HACKERS] [BUGS] Bug in Physical Replication Slots (at least 9.5)?

2017-09-04 Thread Kyotaro HORIGUCHI
Hello,

Thank you for reviewing this.

At Mon, 28 Aug 2017 20:14:54 +0900, Michael Paquier  
wrote in 
> On Mon, Aug 28, 2017 at 8:02 PM, Kyotaro HORIGUCHI
>  wrote:
> > The first patch (0001-) fixes this problem, preventing the
> > problematic state of WAL segments by retarding restart LSN of a
> > physical replication slot in a certain condition.
> 
> FWIW, I have this patch marked on my list of things to look at, so you
> can count me as a reviewer. There are also some approaches that I
> would like to test because I rely on replication slots for some
> infrastructure. Still...

This test patch modifies the code for easiness. The window for
this bug to occur is from receiving the first record of a segment
to in most cases receiving the second record or after receiving
several records. Intentionally emitting a record spanning two or
more segments would work?


> +if (oldFlushPtr != InvalidXLogRecPtr &&
> +(restartLSN == InvalidXLogRecPtr ?
> + oldFlushPtr / XLOG_SEG_SIZE != flushPtr / XLOG_SEG_SIZE :
> + restartLSN / XLOG_BLCKSZ != flushPtr / XLOG_BLCKSZ))
> I find such code patterns not readable.

Yeah, I agree. I rewrote there and the result in the attached
patch is far cleaner than the blob.

regards,

-- 
Kyotaro Horiguchi
NTT Open Source Software Center
>From c15473998467a17ac6070ed68405a14ca0f98ae6 Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi 
Date: Wed, 1 Feb 2017 16:07:22 +0900
Subject: [PATCH 1/2] Retard restart LSN of a slot when a segment starts with a
 contrecord.

A physical-replication standby can stop just at the boundary of WAL
segments. restart_lsn of a slot on the master can be assumed to be the
same location. The last segment on the master will be removed after
some checkpoints for the case. If the last record of the last
replicated segment continues to the next segment, the continuation
record is only on the master. The standby cannot start in the case
because the split record is not available from only one source.

This patch detains restart_lsn in the last sgement when the first page
of the next segment is a continuation record.
---
 src/backend/replication/walsender.c | 123 ++--
 1 file changed, 116 insertions(+), 7 deletions(-)

diff --git a/src/backend/replication/walsender.c b/src/backend/replication/walsender.c
index db346e6..30e222c 100644
--- a/src/backend/replication/walsender.c
+++ b/src/backend/replication/walsender.c
@@ -217,6 +217,13 @@ static struct
 	WalTimeSample last_read[NUM_SYNC_REP_WAIT_MODE];
 }			LagTracker;
 
+/*
+ * This variable corresponds to restart_lsn in pg_replication_slots for a
+ * physical slot. This has a valid value only when it differs from the current
+ * flush pointer.
+ */
+static XLogRecPtr	   restartLSN = InvalidXLogRecPtr;
+
 /* Signal handlers */
 static void WalSndLastCycleHandler(SIGNAL_ARGS);
 
@@ -251,7 +258,7 @@ static void LagTrackerWrite(XLogRecPtr lsn, TimestampTz local_flush_time);
 static TimeOffset LagTrackerRead(int head, XLogRecPtr lsn, TimestampTz now);
 static bool TransactionIdInRecentPast(TransactionId xid, uint32 epoch);
 
-static void XLogRead(char *buf, XLogRecPtr startptr, Size count);
+static bool XLogRead(char *buf, XLogRecPtr startptr, Size count, bool noutfoundok);
 
 
 /* Initialize walsender process before entering the main command loop */
@@ -546,6 +553,9 @@ StartReplication(StartReplicationCmd *cmd)
 			ereport(ERROR,
 	(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
 	 (errmsg("cannot use a logical replication slot for physical replication";
+
+		/* Restore restartLSN from replication slot */
+		restartLSN = MyReplicationSlot->data.restart_lsn;
 	}
 
 	/*
@@ -561,6 +571,10 @@ StartReplication(StartReplicationCmd *cmd)
 	else
 		FlushPtr = GetFlushRecPtr();
 
+	/* Set InvalidXLogRecPtr if catching up */
+	if (restartLSN == FlushPtr)
+		restartLSN = InvalidXLogRecPtr;
+
 	if (cmd->timeline != 0)
 	{
 		XLogRecPtr	switchpoint;
@@ -770,7 +784,7 @@ logical_read_xlog_page(XLogReaderState *state, XLogRecPtr targetPagePtr, int req
 		count = flushptr - targetPagePtr;	/* part of the page available */
 
 	/* now actually read the data, we know it's there */
-	XLogRead(cur_page, targetPagePtr, XLOG_BLCKSZ);
+	XLogRead(cur_page, targetPagePtr, XLOG_BLCKSZ, false);
 
 	return count;
 }
@@ -1738,7 +1752,7 @@ static void
 ProcessStandbyReplyMessage(void)
 {
 	XLogRecPtr	writePtr,
-flushPtr,
+flushPtr, oldFlushPtr,
 applyPtr;
 	bool		replyRequested;
 	TimeOffset	writeLag,
@@ -1798,6 +1812,7 @@ ProcessStandbyReplyMessage(void)
 		WalSnd	   *walsnd = MyWalSnd;
 
 		SpinLockAcquire(>mutex);
+		oldFlushPtr = walsnd->flush;
 		walsnd->write = writePtr;
 		walsnd->flush = flushPtr;
 		walsnd->apply = applyPtr;
@@ -1821,7 +1836,93 @@ 

Re: [HACKERS] [BUGS] Bug in Physical Replication Slots (at least 9.5)?

2017-08-28 Thread Michael Paquier
On Mon, Aug 28, 2017 at 8:02 PM, Kyotaro HORIGUCHI
 wrote:
> The first patch (0001-) fixes this problem, preventing the
> problematic state of WAL segments by retarding restart LSN of a
> physical replication slot in a certain condition.

FWIW, I have this patch marked on my list of things to look at, so you
can count me as a reviewer. There are also some approaches that I
would like to test because I rely on replication slots for some
infrastructure. Still...

+if (oldFlushPtr != InvalidXLogRecPtr &&
+(restartLSN == InvalidXLogRecPtr ?
+ oldFlushPtr / XLOG_SEG_SIZE != flushPtr / XLOG_SEG_SIZE :
+ restartLSN / XLOG_BLCKSZ != flushPtr / XLOG_BLCKSZ))
I find such code patterns not readable.
-- 
Michael


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [BUGS] Bug in Physical Replication Slots (at least 9.5)?

2017-08-28 Thread Kyotaro HORIGUCHI
Hello,

This problem still occurs on the master.
I rebased this to the current master.

At Mon, 3 Apr 2017 08:38:47 +0900, Michael Paquier  
wrote in 
> On Mon, Apr 3, 2017 at 7:19 AM, Venkata B Nagothi  wrote:
> > As we are already past the commitfest, I am not sure, what should i change
> > the patch status to ?
> 
> The commit fest finishes on the 7th of April. Even with the deadline
> passed, there is nothing preventing to work on bug fixes. So this item
> ought to be moved to the next CF with the same category.

The steps to reproduce the problem follows.

- Apply the second patch (0002-) attached and recompile. It
  effectively reproduces the problematic state of database.

- M(aster): initdb the master with wal_keep_segments = 0
(default), log_min_messages = debug2
- M: Create a physical repslot.
- S(tandby): Setup a standby database.
- S: Edit recovery.conf to use the replication slot above then
 start it.
- S: touch /tmp/hoge
- M: Run pgbench ...
- S: After a while, the standby stops.
  > LOG:   STOP THE SERVER

- M: Stop pgbench.
- M: Do 'checkpoint;' twice.
- S: rm /tmp/hoge
- S: Fails to catch up with the following error.

  > FATAL:  could not receive data from WAL stream: ERROR:  requested WAL 
segment 0001002B has already been removed


The first patch (0001-) fixes this problem, preventing the
problematic state of WAL segments by retarding restart LSN of a
physical replication slot in a certain condition.

regards,

-- 
Kyotaro Horiguchi
NTT Open Source Software Center
>From 3813599b74299f1da8d0567ed90542c5f35ed48b Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi 
Date: Wed, 1 Feb 2017 16:07:22 +0900
Subject: [PATCH 1/2] Retard restart LSN of a slot when a segment starts with a
 contrecord.

A physical-replication standby can stop just at the boundary of WAL
segments. restart_lsn of a slot on the master can be assumed to be the
same location. The last segment on the master will be removed after
some checkpoints for the case. If the last record of the last
replicated segment continues to the next segment, the continuation
record is only on the master. The standby cannot start in the case
because the split record is not available from only one source.

This patch detains restart_lsn in the last sgement when the first page
of the next segment is a continuation record.
---
 src/backend/replication/walsender.c | 105 +---
 1 file changed, 98 insertions(+), 7 deletions(-)

diff --git a/src/backend/replication/walsender.c b/src/backend/replication/walsender.c
index 03e1cf4..30c80af 100644
--- a/src/backend/replication/walsender.c
+++ b/src/backend/replication/walsender.c
@@ -217,6 +217,13 @@ static struct
 	WalTimeSample last_read[NUM_SYNC_REP_WAIT_MODE];
 }			LagTracker;
 
+/*
+ * This variable corresponds to restart_lsn in pg_replication_slots for a
+ * physical slot. This has a valid value only when it differs from the current
+ * flush pointer.
+ */
+static XLogRecPtr	   restartLSN = InvalidXLogRecPtr;
+
 /* Signal handlers */
 static void WalSndLastCycleHandler(SIGNAL_ARGS);
 
@@ -251,7 +258,7 @@ static void LagTrackerWrite(XLogRecPtr lsn, TimestampTz local_flush_time);
 static TimeOffset LagTrackerRead(int head, XLogRecPtr lsn, TimestampTz now);
 static bool TransactionIdInRecentPast(TransactionId xid, uint32 epoch);
 
-static void XLogRead(char *buf, XLogRecPtr startptr, Size count);
+static bool XLogRead(char *buf, XLogRecPtr startptr, Size count, bool noutfoundok);
 
 
 /* Initialize walsender process before entering the main command loop */
@@ -546,6 +553,9 @@ StartReplication(StartReplicationCmd *cmd)
 			ereport(ERROR,
 	(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
 	 (errmsg("cannot use a logical replication slot for physical replication";
+
+		/* Restore restartLSN from replication slot */
+		restartLSN = MyReplicationSlot->data.restart_lsn;
 	}
 
 	/*
@@ -561,6 +571,10 @@ StartReplication(StartReplicationCmd *cmd)
 	else
 		FlushPtr = GetFlushRecPtr();
 
+	/* Set InvalidXLogRecPtr if catching up */
+	if (restartLSN == FlushPtr)
+		restartLSN = InvalidXLogRecPtr;
+
 	if (cmd->timeline != 0)
 	{
 		XLogRecPtr	switchpoint;
@@ -770,7 +784,7 @@ logical_read_xlog_page(XLogReaderState *state, XLogRecPtr targetPagePtr, int req
 		count = flushptr - targetPagePtr;	/* part of the page available */
 
 	/* now actually read the data, we know it's there */
-	XLogRead(cur_page, targetPagePtr, XLOG_BLCKSZ);
+	XLogRead(cur_page, targetPagePtr, XLOG_BLCKSZ, false);
 
 	return count;
 }
@@ -1738,7 +1752,7 @@ static void
 ProcessStandbyReplyMessage(void)
 {
 	XLogRecPtr	writePtr,
-flushPtr,
+flushPtr, oldFlushPtr,
 applyPtr;
 	bool		replyRequested;
 	TimeOffset	writeLag,
@@ -1798,6 +1812,7 @@ ProcessStandbyReplyMessage(void)
 		WalSnd	   

Re: [HACKERS] [BUGS] Bug in Physical Replication Slots (at least 9.5)?

2017-04-02 Thread Michael Paquier
On Mon, Apr 3, 2017 at 7:19 AM, Venkata B Nagothi  wrote:
> As we are already past the commitfest, I am not sure, what should i change
> the patch status to ?

The commit fest finishes on the 7th of April. Even with the deadline
passed, there is nothing preventing to work on bug fixes. So this item
ought to be moved to the next CF with the same category.
-- 
Michael


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [BUGS] Bug in Physical Replication Slots (at least 9.5)?

2017-04-02 Thread Venkata B Nagothi
On Fri, Mar 31, 2017 at 4:05 PM, Kyotaro HORIGUCHI <
horiguchi.kyot...@lab.ntt.co.jp> wrote:

> Thank you having a look on this.
>
> # I removed -bugs in CC:.
>
> At Fri, 31 Mar 2017 13:40:00 +1100, Venkata B Nagothi 
> wrote in  gmail.com>
> > On Fri, Mar 17, 2017 at 6:48 PM, Kyotaro HORIGUCHI <
> > horiguchi.kyot...@lab.ntt.co.jp> wrote:
> >
> > > Hello,
> > >
> > > At Mon, 13 Mar 2017 11:06:00 +1100, Venkata B Nagothi <
> nag1...@gmail.com>
> > > wrote in  > > gmail.com>
> > > > On Tue, Jan 17, 2017 at 9:36 PM, Kyotaro HORIGUCHI <
> > > > horiguchi.kyot...@lab.ntt.co.jp> wrote:
> > > > > I managed to reproduce this. A little tweak as the first patch
> > > > > lets the standby to suicide as soon as walreceiver sees a
> > > > > contrecord at the beginning of a segment.
> > > > >
> > > > > - M(aster): createdb as a master with wal_keep_segments = 0
> > > > > (default), min_log_messages = debug2
> > > > > - M: Create a physical repslot.
> > > > > - S(tandby): Setup a standby database.
> > > > > - S: Edit recovery.conf to use the replication slot above then
> > > > >  start it.
> > > > > - S: touch /tmp/hoge
> > > > > - M: Run pgbench ...
> > > > > - S: After a while, the standby stops.
> > > > >   > LOG:   STOP THE SERVER
> > > > >
> > > > > - M: Stop pgbench.
> > > > > - M: Do 'checkpoint;' twice.
> > > > > - S: rm /tmp/hoge
> > > > > - S: Fails to catch up with the following error.
> > > > >
> > > > >   > FATAL:  could not receive data from WAL stream: ERROR:
> requested
> > > WAL
> > > > > segment 0001002B has already been removed
> > > > >
> > > > >
> > > > I have been testing / reviewing the latest patch
> > > > "0001-Fix-a-bug-of-physical-replication-slot.patch" and i think, i
> might
> > > > need some more clarification on this.
> > > >
> > > > Before applying the patch, I tried re-producing the above error -
> > > >
> > > > - I had master->standby in streaming replication
> > > > - Took the backup of master
> > > >- with a low max_wal_size and wal_keep_segments = 0
> > > > - Configured standby with recovery.conf
> > > > - Created replication slot on master
> > > > - Configured the replication slot on standby and started the standby
> > >
> > > I suppose the "configure" means primary_slot_name in recovery.conf.
> > >
> > > > - I got the below error
> > > >
> > > >>> 2017-03-10 11:58:15.704 AEDT [478] LOG:  invalid record length
> at
> > > > 0/F2000140: wanted 24, got 0
> > > >>> 2017-03-10 11:58:15.706 AEDT [481] LOG:  started streaming WAL
> from
> > > > primary at 0/F200 on timeline 1
> > > >>> 2017-03-10 11:58:15.706 AEDT [481] FATAL:  could not receive
> data
> > > > from WAL stream: ERROR:  requested WAL segment
> 000100F2
> > > has
> > > > already been removed
> > >
> > > Maybe you created the master slot with non-reserve (default) mode
> > > and put a some-minites pause after making the backup and before
> > > starting the standby. For the case the master slot doesn't keep
> > > WAL segments unless the standby connects so a couple of
> > > checkpoints can blow away the first segment required by the
> > > standby. This is quite reasonable behavior. The following steps
> > > makes this more sure.
> > >
> > > > - Took the backup of master
> > > >- with a low max_wal_size = 2 and wal_keep_segments = 0
> > > > - Configured standby with recovery.conf
> > > > - Created replication slot on master
> > > + - SELECT pg_switch_wal(); on master twice.
> > > + - checkpoint; on master twice.
> > > > - Configured the replication slot on standby and started the standby
> > >
> > > Creating the slot with the following command will save it.
> > >
> > > =# select pg_create_physical_replication_slot('s1', true);
> > >
> >
> > I did a test again, by applying the patch and I am not sure if the patch
> is
> > doing the right thing ?
>
> This is a bit difficult to make it happen.
>

Yes, it seems to be quite difficult to re-produce.


>
> > Here is test case -
> >
> > - I ran pgbench
> > - I took the backup of the master first
> >
> > - Below are the WALs on master after the stop backup -
> >
> > postgres=# select pg_stop_backup();
> >
> > NOTICE:  WAL archiving is not enabled; you must ensure that all required
> > WAL segments are copied through other means to complete the backup
> >  pg_stop_backup
> > 
> >  0/8C000130
> > (1 row)
> >
> > postgres=# \q
> > [dba@buildhost data]$ ls -ltrh pgdata-10dev-prsb-1/pg_wal/
> > total 65M
> > drwx--. 2 dba dba 4.0K Mar 31 09:36 archive_status
> > -rw---. 1 dba dba  16M Mar 31 11:09 0001008E
> > -rw---. 1 dba dba  16M Mar 31 11:17 0001008F
> > -rw---. 1 dba dba  16M Mar 31 11:18 0001008C
> > -rw---. 1 dba dba  16M Mar 31 11:18 0001008D
> >
> > - After 

Re: [HACKERS] [BUGS] Bug in Physical Replication Slots (at least 9.5)?

2017-03-30 Thread Kyotaro HORIGUCHI
Thank you having a look on this.

# I removed -bugs in CC:.

At Fri, 31 Mar 2017 13:40:00 +1100, Venkata B Nagothi  wrote 
in 
> On Fri, Mar 17, 2017 at 6:48 PM, Kyotaro HORIGUCHI <
> horiguchi.kyot...@lab.ntt.co.jp> wrote:
> 
> > Hello,
> >
> > At Mon, 13 Mar 2017 11:06:00 +1100, Venkata B Nagothi 
> > wrote in  > gmail.com>
> > > On Tue, Jan 17, 2017 at 9:36 PM, Kyotaro HORIGUCHI <
> > > horiguchi.kyot...@lab.ntt.co.jp> wrote:
> > > > I managed to reproduce this. A little tweak as the first patch
> > > > lets the standby to suicide as soon as walreceiver sees a
> > > > contrecord at the beginning of a segment.
> > > >
> > > > - M(aster): createdb as a master with wal_keep_segments = 0
> > > > (default), min_log_messages = debug2
> > > > - M: Create a physical repslot.
> > > > - S(tandby): Setup a standby database.
> > > > - S: Edit recovery.conf to use the replication slot above then
> > > >  start it.
> > > > - S: touch /tmp/hoge
> > > > - M: Run pgbench ...
> > > > - S: After a while, the standby stops.
> > > >   > LOG:   STOP THE SERVER
> > > >
> > > > - M: Stop pgbench.
> > > > - M: Do 'checkpoint;' twice.
> > > > - S: rm /tmp/hoge
> > > > - S: Fails to catch up with the following error.
> > > >
> > > >   > FATAL:  could not receive data from WAL stream: ERROR:  requested
> > WAL
> > > > segment 0001002B has already been removed
> > > >
> > > >
> > > I have been testing / reviewing the latest patch
> > > "0001-Fix-a-bug-of-physical-replication-slot.patch" and i think, i might
> > > need some more clarification on this.
> > >
> > > Before applying the patch, I tried re-producing the above error -
> > >
> > > - I had master->standby in streaming replication
> > > - Took the backup of master
> > >- with a low max_wal_size and wal_keep_segments = 0
> > > - Configured standby with recovery.conf
> > > - Created replication slot on master
> > > - Configured the replication slot on standby and started the standby
> >
> > I suppose the "configure" means primary_slot_name in recovery.conf.
> >
> > > - I got the below error
> > >
> > >>> 2017-03-10 11:58:15.704 AEDT [478] LOG:  invalid record length at
> > > 0/F2000140: wanted 24, got 0
> > >>> 2017-03-10 11:58:15.706 AEDT [481] LOG:  started streaming WAL from
> > > primary at 0/F200 on timeline 1
> > >>> 2017-03-10 11:58:15.706 AEDT [481] FATAL:  could not receive data
> > > from WAL stream: ERROR:  requested WAL segment 000100F2
> > has
> > > already been removed
> >
> > Maybe you created the master slot with non-reserve (default) mode
> > and put a some-minites pause after making the backup and before
> > starting the standby. For the case the master slot doesn't keep
> > WAL segments unless the standby connects so a couple of
> > checkpoints can blow away the first segment required by the
> > standby. This is quite reasonable behavior. The following steps
> > makes this more sure.
> >
> > > - Took the backup of master
> > >- with a low max_wal_size = 2 and wal_keep_segments = 0
> > > - Configured standby with recovery.conf
> > > - Created replication slot on master
> > + - SELECT pg_switch_wal(); on master twice.
> > + - checkpoint; on master twice.
> > > - Configured the replication slot on standby and started the standby
> >
> > Creating the slot with the following command will save it.
> >
> > =# select pg_create_physical_replication_slot('s1', true);
> >
> 
> I did a test again, by applying the patch and I am not sure if the patch is
> doing the right thing ?

This is a bit difficult to make it happen.

> Here is test case -
> 
> - I ran pgbench
> - I took the backup of the master first
> 
> - Below are the WALs on master after the stop backup -
> 
> postgres=# select pg_stop_backup();
> 
> NOTICE:  WAL archiving is not enabled; you must ensure that all required
> WAL segments are copied through other means to complete the backup
>  pg_stop_backup
> 
>  0/8C000130
> (1 row)
> 
> postgres=# \q
> [dba@buildhost data]$ ls -ltrh pgdata-10dev-prsb-1/pg_wal/
> total 65M
> drwx--. 2 dba dba 4.0K Mar 31 09:36 archive_status
> -rw---. 1 dba dba  16M Mar 31 11:09 0001008E
> -rw---. 1 dba dba  16M Mar 31 11:17 0001008F
> -rw---. 1 dba dba  16M Mar 31 11:18 0001008C
> -rw---. 1 dba dba  16M Mar 31 11:18 0001008D
> 
> - After the backup, i created the physical replication slot
> 
> 
> postgres=# select pg_create_physical_replication_slot('repslot',true);
> 
>  pg_create_physical_replication_slot
> -
>  (repslot,0/8D28)
> (1 row)

At this point, segment 8C is not protected from removal. It's too
late if the first record in 8D is continued from 8C.

> postgres=# select 

Re: [HACKERS] [BUGS] Bug in Physical Replication Slots (at least 9.5)?

2017-03-30 Thread Venkata B Nagothi
On Fri, Mar 17, 2017 at 6:48 PM, Kyotaro HORIGUCHI <
horiguchi.kyot...@lab.ntt.co.jp> wrote:

> Hello,
>
> At Mon, 13 Mar 2017 11:06:00 +1100, Venkata B Nagothi 
> wrote in  gmail.com>
> > On Tue, Jan 17, 2017 at 9:36 PM, Kyotaro HORIGUCHI <
> > horiguchi.kyot...@lab.ntt.co.jp> wrote:
> > > I managed to reproduce this. A little tweak as the first patch
> > > lets the standby to suicide as soon as walreceiver sees a
> > > contrecord at the beginning of a segment.
> > >
> > > - M(aster): createdb as a master with wal_keep_segments = 0
> > > (default), min_log_messages = debug2
> > > - M: Create a physical repslot.
> > > - S(tandby): Setup a standby database.
> > > - S: Edit recovery.conf to use the replication slot above then
> > >  start it.
> > > - S: touch /tmp/hoge
> > > - M: Run pgbench ...
> > > - S: After a while, the standby stops.
> > >   > LOG:   STOP THE SERVER
> > >
> > > - M: Stop pgbench.
> > > - M: Do 'checkpoint;' twice.
> > > - S: rm /tmp/hoge
> > > - S: Fails to catch up with the following error.
> > >
> > >   > FATAL:  could not receive data from WAL stream: ERROR:  requested
> WAL
> > > segment 0001002B has already been removed
> > >
> > >
> > I have been testing / reviewing the latest patch
> > "0001-Fix-a-bug-of-physical-replication-slot.patch" and i think, i might
> > need some more clarification on this.
> >
> > Before applying the patch, I tried re-producing the above error -
> >
> > - I had master->standby in streaming replication
> > - Took the backup of master
> >- with a low max_wal_size and wal_keep_segments = 0
> > - Configured standby with recovery.conf
> > - Created replication slot on master
> > - Configured the replication slot on standby and started the standby
>
> I suppose the "configure" means primary_slot_name in recovery.conf.
>
> > - I got the below error
> >
> >>> 2017-03-10 11:58:15.704 AEDT [478] LOG:  invalid record length at
> > 0/F2000140: wanted 24, got 0
> >>> 2017-03-10 11:58:15.706 AEDT [481] LOG:  started streaming WAL from
> > primary at 0/F200 on timeline 1
> >>> 2017-03-10 11:58:15.706 AEDT [481] FATAL:  could not receive data
> > from WAL stream: ERROR:  requested WAL segment 000100F2
> has
> > already been removed
>
> Maybe you created the master slot with non-reserve (default) mode
> and put a some-minites pause after making the backup and before
> starting the standby. For the case the master slot doesn't keep
> WAL segments unless the standby connects so a couple of
> checkpoints can blow away the first segment required by the
> standby. This is quite reasonable behavior. The following steps
> makes this more sure.
>
> > - Took the backup of master
> >- with a low max_wal_size = 2 and wal_keep_segments = 0
> > - Configured standby with recovery.conf
> > - Created replication slot on master
> + - SELECT pg_switch_wal(); on master twice.
> + - checkpoint; on master twice.
> > - Configured the replication slot on standby and started the standby
>
> Creating the slot with the following command will save it.
>
> =# select pg_create_physical_replication_slot('s1', true);
>

I did a test again, by applying the patch and I am not sure if the patch is
doing the right thing ?

Here is test case -

- I ran pgbench
- I took the backup of the master first

- Below are the WALs on master after the stop backup -

postgres=# select pg_stop_backup();

NOTICE:  WAL archiving is not enabled; you must ensure that all required
WAL segments are copied through other means to complete the backup
 pg_stop_backup

 0/8C000130
(1 row)

postgres=# \q
[dba@buildhost data]$ ls -ltrh pgdata-10dev-prsb-1/pg_wal/
total 65M
drwx--. 2 dba dba 4.0K Mar 31 09:36 archive_status
-rw---. 1 dba dba  16M Mar 31 11:09 0001008E
-rw---. 1 dba dba  16M Mar 31 11:17 0001008F
-rw---. 1 dba dba  16M Mar 31 11:18 0001008C
-rw---. 1 dba dba  16M Mar 31 11:18 0001008D

- After the backup, i created the physical replication slot


postgres=# select pg_create_physical_replication_slot('repslot',true);

 pg_create_physical_replication_slot
-
 (repslot,0/8D28)
(1 row)

postgres=# select pg_walfile_name('0/8D28');

 pg_walfile_name
---
 0001008D
(1 row)

Here, When you start the standby, it would ask for the file
0001008C, which is the first file needed for the standby
and since i applied your patch, i am assuming that, the file
0001008C should also be retained without being removed -
correct ?

- I started the standby and the below error occurs

>> 2017-03-31 11:26:01.288 AEDT [17475] LOG:  invalid record length at
0/8C000108: wanted 24, got 0
>> 2017-03-31 11:26:01.291 AEDT [17486] LOG:  started streaming 

Re: [HACKERS] [BUGS] Bug in Physical Replication Slots (at least 9.5)?

2017-03-30 Thread Venkata B Nagothi
On Thu, Mar 30, 2017 at 4:46 PM, Kyotaro HORIGUCHI <
horiguchi.kyot...@lab.ntt.co.jp> wrote:

> Hello,
>
> At Thu, 30 Mar 2017 15:59:14 +1100, Venkata B Nagothi 
> wrote in 

Re: [HACKERS] [BUGS] Bug in Physical Replication Slots (at least 9.5)?

2017-03-29 Thread Kyotaro HORIGUCHI
Hello,

At Thu, 30 Mar 2017 15:59:14 +1100, Venkata B Nagothi  wrote 
in 

Re: [HACKERS] [BUGS] Bug in Physical Replication Slots (at least 9.5)?

2017-03-29 Thread Venkata B Nagothi
On Thu, Mar 30, 2017 at 3:51 PM, Kyotaro HORIGUCHI <
horiguchi.kyot...@lab.ntt.co.jp> wrote:

> At Thu, 30 Mar 2017 11:12:56 +1100, Venkata B Nagothi 
> wrote in  gmail.com>
> > On Thu, Mar 30, 2017 at 10:55 AM, Michael Paquier <
> michael.paqu...@gmail.com
> > > wrote:
> >
> > > On Thu, Mar 30, 2017 at 8:49 AM, Venkata B Nagothi 
> > > wrote:
> > > > On Tue, Mar 28, 2017 at 5:51 PM, Kyotaro HORIGUCHI
> > > >  wrote:
> > > > I tried applying this patch to latest master, it is not getting
> applied
> > > >
> > > > [dba@buildhost postgresql]$ git apply
> > > > /data/postgresql-patches/9.5-ReplicationSlots-Bug-Patch/
> > > 0001-Fix-a-bug-of-physical-replication-slot_a6f22e8.patch
> > > > /data/postgresql-patches/9.5-ReplicationSlots-Bug-Patch/
> > > 0001-Fix-a-bug-of-physical-replication-slot_a6f22e8.patch:28:
> > > > trailing whitespace.
> > > > /*
> > > 0001-Fix-a-bug-of-physical-replication-slot_a6f22e8.patch:29:
> > > > trailing whitespace.
> > > >  * This variable corresponds to restart_lsn in pg_replication_slots
> for a
> ...
> > > git apply and git am can be very picky sometimes, so you may want to
> > > fallback to patch -p1 if things don't work. In this case it does.
> > >
> >
> > patch -p1 seems to be working. Thanks !
>
> That's quite strange. The patch I sent doesn't cantain trailing
> spaces at all. The cited lines doesn't seem to contain them. It
> applied cleanly with "git am" for me.
>
> The file restored from the attachment of received mail also don't.
>
> The original files contains the following,
>
> 0002440  66  6f  72  20  61  0a  2b  20  2a  20  70  68  79  73  69  63
>   f   o   r   a  \n   +   *   p   h   y   s   i   c
>
> The corresponding part of the file restored from mail on Windows
> is the following,
> 0002460  63  61  74  69  6f  6e  5f  73  6c  6f  74  73  20  66  6f  72
>   c   a   t   i   o   n   _   s   l   o   t   s   f   o   r
> 0002500  20  61  0d  0a  2b  20  2a  20  70  68  79  73  69  63  61  6c
>   a  \r  \n   +   *   p   h   y   s   i   c   a   l
>
> Both doesn't contain a space at the end of a line. How did you
> retrieve the patch from the mail?
>

Yes, downloaded from the email on Windows and copied across to Linux and
did "git apply".

Regards,

Venkata B N
Database Consultant


Re: [HACKERS] [BUGS] Bug in Physical Replication Slots (at least 9.5)?

2017-03-29 Thread Kyotaro HORIGUCHI
At Thu, 30 Mar 2017 11:12:56 +1100, Venkata B Nagothi  wrote 
in 
> On Thu, Mar 30, 2017 at 10:55 AM, Michael Paquier  > wrote:
> 
> > On Thu, Mar 30, 2017 at 8:49 AM, Venkata B Nagothi 
> > wrote:
> > > On Tue, Mar 28, 2017 at 5:51 PM, Kyotaro HORIGUCHI
> > >  wrote:
> > > I tried applying this patch to latest master, it is not getting applied
> > >
> > > [dba@buildhost postgresql]$ git apply
> > > /data/postgresql-patches/9.5-ReplicationSlots-Bug-Patch/
> > 0001-Fix-a-bug-of-physical-replication-slot_a6f22e8.patch
> > > /data/postgresql-patches/9.5-ReplicationSlots-Bug-Patch/
> > 0001-Fix-a-bug-of-physical-replication-slot_a6f22e8.patch:28:
> > > trailing whitespace.
> > > /*
> > 0001-Fix-a-bug-of-physical-replication-slot_a6f22e8.patch:29:
> > > trailing whitespace.
> > >  * This variable corresponds to restart_lsn in pg_replication_slots for a
...
> > git apply and git am can be very picky sometimes, so you may want to
> > fallback to patch -p1 if things don't work. In this case it does.
> >
> 
> patch -p1 seems to be working. Thanks !

That's quite strange. The patch I sent doesn't cantain trailing
spaces at all. The cited lines doesn't seem to contain them. It
applied cleanly with "git am" for me.

The file restored from the attachment of received mail also don't.

The original files contains the following,

0002440  66  6f  72  20  61  0a  2b  20  2a  20  70  68  79  73  69  63
  f   o   r   a  \n   +   *   p   h   y   s   i   c

The corresponding part of the file restored from mail on Windows
is the following,
0002460  63  61  74  69  6f  6e  5f  73  6c  6f  74  73  20  66  6f  72
  c   a   t   i   o   n   _   s   l   o   t   s   f   o   r
0002500  20  61  0d  0a  2b  20  2a  20  70  68  79  73  69  63  61  6c
  a  \r  \n   +   *   p   h   y   s   i   c   a   l

Both doesn't contain a space at the end of a line. How did you
retrieve the patch from the mail?

regarsd,

-- 
Kyotaro Horiguchi
NTT Open Source Software Center



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [BUGS] Bug in Physical Replication Slots (at least 9.5)?

2017-03-29 Thread Tatsuo Ishii
> On Thu, Mar 30, 2017 at 9:12 AM, Tatsuo Ishii  wrote:
>> Committers will not apply patches which has trailing whitespace
>> issues. So the patch submitter needs to fix them anyway.
> 
> I cannot comment on that point (committers are free to pick up things
> the way they want), but just using git commands to apply a patch
> should not be an obstacle for a review if a patch can be easily
> applied as long as they roughly respect GNU's diff format.

My point is, the coding standard. Having trainling whitespace is
against our coding standard and committers should not accept such a
code, I believe.

Best regards,
--
Tatsuo Ishii
SRA OSS, Inc. Japan
English: http://www.sraoss.co.jp/index_en.php
Japanese:http://www.sraoss.co.jp


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [BUGS] Bug in Physical Replication Slots (at least 9.5)?

2017-03-29 Thread Michael Paquier
On Thu, Mar 30, 2017 at 9:12 AM, Tatsuo Ishii  wrote:
> Committers will not apply patches which has trailing whitespace
> issues. So the patch submitter needs to fix them anyway.

I cannot comment on that point (committers are free to pick up things
the way they want), but just using git commands to apply a patch
should not be an obstacle for a review if a patch can be easily
applied as long as they roughly respect GNU's diff format.
-- 
Michael


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [BUGS] Bug in Physical Replication Slots (at least 9.5)?

2017-03-29 Thread Venkata B Nagothi
On Thu, Mar 30, 2017 at 10:55 AM, Michael Paquier  wrote:

> On Thu, Mar 30, 2017 at 8:49 AM, Venkata B Nagothi 
> wrote:
> > On Tue, Mar 28, 2017 at 5:51 PM, Kyotaro HORIGUCHI
> >  wrote:
> > I tried applying this patch to latest master, it is not getting applied
> >
> > [dba@buildhost postgresql]$ git apply
> > /data/postgresql-patches/9.5-ReplicationSlots-Bug-Patch/
> 0001-Fix-a-bug-of-physical-replication-slot_a6f22e8.patch
> > /data/postgresql-patches/9.5-ReplicationSlots-Bug-Patch/
> 0001-Fix-a-bug-of-physical-replication-slot_a6f22e8.patch:28:
> > trailing whitespace.
> > /*
> > /data/postgresql-patches/9.5-ReplicationSlots-Bug-Patch/
> 0001-Fix-a-bug-of-physical-replication-slot_a6f22e8.patch:29:
> > trailing whitespace.
> >  * This variable corresponds to restart_lsn in pg_replication_slots for a
> > /data/postgresql-patches/9.5-ReplicationSlots-Bug-Patch/
> 0001-Fix-a-bug-of-physical-replication-slot_a6f22e8.patch:30:
> > trailing whitespace.
> >  * physical slot. This has a valid value only when it differs from the
> > current
> > /data/postgresql-patches/9.5-ReplicationSlots-Bug-Patch/
> 0001-Fix-a-bug-of-physical-replication-slot_a6f22e8.patch:31:
> > trailing whitespace.
> >  * flush pointer.
> > /data/postgresql-patches/9.5-ReplicationSlots-Bug-Patch/
> 0001-Fix-a-bug-of-physical-replication-slot_a6f22e8.patch:32:
> > trailing whitespace.
> >  */
> > error: patch failed: src/backend/replication/walsender.c:210
> > error: src/backend/replication/walsender.c: patch does not apply
>
> git apply and git am can be very picky sometimes, so you may want to
> fallback to patch -p1 if things don't work. In this case it does.
>

patch -p1 seems to be working. Thanks !

Regards,

Venkata B N
Database Consultant


Re: [HACKERS] [BUGS] Bug in Physical Replication Slots (at least 9.5)?

2017-03-29 Thread Tatsuo Ishii
> On Thu, Mar 30, 2017 at 8:49 AM, Venkata B Nagothi  wrote:
>> On Tue, Mar 28, 2017 at 5:51 PM, Kyotaro HORIGUCHI
>>  wrote:
>> I tried applying this patch to latest master, it is not getting applied
>>
>> [dba@buildhost postgresql]$ git apply
>> /data/postgresql-patches/9.5-ReplicationSlots-Bug-Patch/0001-Fix-a-bug-of-physical-replication-slot_a6f22e8.patch
>> /data/postgresql-patches/9.5-ReplicationSlots-Bug-Patch/0001-Fix-a-bug-of-physical-replication-slot_a6f22e8.patch:28:
>> trailing whitespace.
>> /*
>> /data/postgresql-patches/9.5-ReplicationSlots-Bug-Patch/0001-Fix-a-bug-of-physical-replication-slot_a6f22e8.patch:29:
>> trailing whitespace.
>>  * This variable corresponds to restart_lsn in pg_replication_slots for a
>> /data/postgresql-patches/9.5-ReplicationSlots-Bug-Patch/0001-Fix-a-bug-of-physical-replication-slot_a6f22e8.patch:30:
>> trailing whitespace.
>>  * physical slot. This has a valid value only when it differs from the
>> current
>> /data/postgresql-patches/9.5-ReplicationSlots-Bug-Patch/0001-Fix-a-bug-of-physical-replication-slot_a6f22e8.patch:31:
>> trailing whitespace.
>>  * flush pointer.
>> /data/postgresql-patches/9.5-ReplicationSlots-Bug-Patch/0001-Fix-a-bug-of-physical-replication-slot_a6f22e8.patch:32:
>> trailing whitespace.
>>  */
>> error: patch failed: src/backend/replication/walsender.c:210
>> error: src/backend/replication/walsender.c: patch does not apply
> 
> git apply and git am can be very picky sometimes, so you may want to
> fallback to patch -p1 if things don't work. In this case it does.

Committers will not apply patches which has trailing whitespace
issues. So the patch submitter needs to fix them anyway.

Best regards,
--
Tatsuo Ishii
SRA OSS, Inc. Japan
English: http://www.sraoss.co.jp/index_en.php
Japanese:http://www.sraoss.co.jp


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [BUGS] Bug in Physical Replication Slots (at least 9.5)?

2017-03-29 Thread Michael Paquier
On Thu, Mar 30, 2017 at 8:49 AM, Venkata B Nagothi  wrote:
> On Tue, Mar 28, 2017 at 5:51 PM, Kyotaro HORIGUCHI
>  wrote:
> I tried applying this patch to latest master, it is not getting applied
>
> [dba@buildhost postgresql]$ git apply
> /data/postgresql-patches/9.5-ReplicationSlots-Bug-Patch/0001-Fix-a-bug-of-physical-replication-slot_a6f22e8.patch
> /data/postgresql-patches/9.5-ReplicationSlots-Bug-Patch/0001-Fix-a-bug-of-physical-replication-slot_a6f22e8.patch:28:
> trailing whitespace.
> /*
> /data/postgresql-patches/9.5-ReplicationSlots-Bug-Patch/0001-Fix-a-bug-of-physical-replication-slot_a6f22e8.patch:29:
> trailing whitespace.
>  * This variable corresponds to restart_lsn in pg_replication_slots for a
> /data/postgresql-patches/9.5-ReplicationSlots-Bug-Patch/0001-Fix-a-bug-of-physical-replication-slot_a6f22e8.patch:30:
> trailing whitespace.
>  * physical slot. This has a valid value only when it differs from the
> current
> /data/postgresql-patches/9.5-ReplicationSlots-Bug-Patch/0001-Fix-a-bug-of-physical-replication-slot_a6f22e8.patch:31:
> trailing whitespace.
>  * flush pointer.
> /data/postgresql-patches/9.5-ReplicationSlots-Bug-Patch/0001-Fix-a-bug-of-physical-replication-slot_a6f22e8.patch:32:
> trailing whitespace.
>  */
> error: patch failed: src/backend/replication/walsender.c:210
> error: src/backend/replication/walsender.c: patch does not apply

git apply and git am can be very picky sometimes, so you may want to
fallback to patch -p1 if things don't work. In this case it does.
-- 
Michael


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [BUGS] Bug in Physical Replication Slots (at least 9.5)?

2017-03-29 Thread Venkata B Nagothi
Regards,

Venkata B N
Database Consultant


On Tue, Mar 28, 2017 at 5:51 PM, Kyotaro HORIGUCHI <
horiguchi.kyot...@lab.ntt.co.jp> wrote:

> This conflicts with 6912acc (replication lag tracker) so just
> rebased on a6f22e8.
>

I tried applying this patch to latest master, it is not getting applied

[dba@buildhost postgresql]$ git apply
/data/postgresql-patches/9.5-ReplicationSlots-Bug-Patch/0001-Fix-a-bug-of-physical-replication-slot_a6f22e8.patch
/data/postgresql-patches/9.5-ReplicationSlots-Bug-Patch/0001-Fix-a-bug-of-physical-replication-slot_a6f22e8.patch:28:
trailing whitespace.
/*
/data/postgresql-patches/9.5-ReplicationSlots-Bug-Patch/0001-Fix-a-bug-of-physical-replication-slot_a6f22e8.patch:29:
trailing whitespace.
 * This variable corresponds to restart_lsn in pg_replication_slots for a
/data/postgresql-patches/9.5-ReplicationSlots-Bug-Patch/0001-Fix-a-bug-of-physical-replication-slot_a6f22e8.patch:30:
trailing whitespace.
 * physical slot. This has a valid value only when it differs from the
current
/data/postgresql-patches/9.5-ReplicationSlots-Bug-Patch/0001-Fix-a-bug-of-physical-replication-slot_a6f22e8.patch:31:
trailing whitespace.
 * flush pointer.
/data/postgresql-patches/9.5-ReplicationSlots-Bug-Patch/0001-Fix-a-bug-of-physical-replication-slot_a6f22e8.patch:32:
trailing whitespace.
 */
error: patch failed: src/backend/replication/walsender.c:210
error: src/backend/replication/walsender.c: patch does not apply


Regards,

Venkata Balaji N
Database Consultant


Re: [HACKERS] [BUGS] Bug in Physical Replication Slots (at least 9.5)?

2017-03-28 Thread Kyotaro HORIGUCHI
This conflicts with 6912acc (replication lag tracker) so just
rebased on a6f22e8.

At Fri, 17 Mar 2017 16:48:27 +0900 (Tokyo Standard Time), Kyotaro HORIGUCHI 
 wrote in 
<20170317.164827.46663014.horiguchi.kyot...@lab.ntt.co.jp>
> Hello,
> 
> At Mon, 13 Mar 2017 11:06:00 +1100, Venkata B Nagothi  
> wrote in 
> > On Tue, Jan 17, 2017 at 9:36 PM, Kyotaro HORIGUCHI <
> > horiguchi.kyot...@lab.ntt.co.jp> wrote:
> > > I managed to reproduce this. A little tweak as the first patch
> > > lets the standby to suicide as soon as walreceiver sees a
> > > contrecord at the beginning of a segment.
> > >
> > > - M(aster): createdb as a master with wal_keep_segments = 0
> > > (default), min_log_messages = debug2
> > > - M: Create a physical repslot.
> > > - S(tandby): Setup a standby database.
> > > - S: Edit recovery.conf to use the replication slot above then
> > >  start it.
> > > - S: touch /tmp/hoge
> > > - M: Run pgbench ...
> > > - S: After a while, the standby stops.
> > >   > LOG:   STOP THE SERVER
> > >
> > > - M: Stop pgbench.
> > > - M: Do 'checkpoint;' twice.
> > > - S: rm /tmp/hoge
> > > - S: Fails to catch up with the following error.
> > >
> > >   > FATAL:  could not receive data from WAL stream: ERROR:  requested WAL
> > > segment 0001002B has already been removed
> > >
> > >
> > I have been testing / reviewing the latest patch
> > "0001-Fix-a-bug-of-physical-replication-slot.patch" and i think, i might
> > need some more clarification on this.
> > 
> > Before applying the patch, I tried re-producing the above error -
> > 
> > - I had master->standby in streaming replication
> > - Took the backup of master
> >- with a low max_wal_size and wal_keep_segments = 0
> > - Configured standby with recovery.conf
> > - Created replication slot on master
> > - Configured the replication slot on standby and started the standby
> 
> I suppose the "configure" means primary_slot_name in recovery.conf.
> 
> > - I got the below error
> > 
> >>> 2017-03-10 11:58:15.704 AEDT [478] LOG:  invalid record length at
> > 0/F2000140: wanted 24, got 0
> >>> 2017-03-10 11:58:15.706 AEDT [481] LOG:  started streaming WAL from
> > primary at 0/F200 on timeline 1
> >>> 2017-03-10 11:58:15.706 AEDT [481] FATAL:  could not receive data
> > from WAL stream: ERROR:  requested WAL segment 000100F2 has
> > already been removed
> 
> Maybe you created the master slot with non-reserve (default) mode
> and put a some-minites pause after making the backup and before
> starting the standby. For the case the master slot doesn't keep
> WAL segments unless the standby connects so a couple of
> checkpoints can blow away the first segment required by the
> standby. This is quite reasonable behavior. The following steps
> makes this more sure.
> 
> > - Took the backup of master
> >- with a low max_wal_size = 2 and wal_keep_segments = 0
> > - Configured standby with recovery.conf
> > - Created replication slot on master
> + - SELECT pg_switch_wal(); on master twice.
> + - checkpoint; on master twice.
> > - Configured the replication slot on standby and started the standby
> 
> Creating the slot with the following command will save it.
> 
> =# select pg_create_physical_replication_slot('s1', true);
> 
> 
> > and i could notice that the file "000100F2" was removed
> > from the master. This can be easily re-produced and this occurs
> > irrespective of configuring replication slots.
> > 
> > As long as the file "000100F2" is available on the master,
> > standby continues to stream WALs without any issues.
> ...
> > If the scenario i created to reproduce the error is correct, then, applying
> > the patch is not making a difference.
> 
> Yes, the patch is not for saving this case. The patch saves the
> case where the previous segment to the first required segment by
> standby was removed and it contains the first part of a record
> continues to the first required segment. On the other hand this
> case is that the segment at the start point of standby is just
> removed.
> 
> > I think, i need help in building a specific test case which will re-produce
> > the specific BUG related to physical replication slots as reported ?
> > 
> > Will continue to review the patch, once i have any comments on this.
> 
> Thaks a lot!

-- 
Kyotaro Horiguchi
NTT Open Source Software Center
>From 9046125563b1b02e9bc0839bde5d77e77c940bbd Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi 
Date: Wed, 1 Feb 2017 16:07:22 +0900
Subject: [PATCH] Fix a bug of physical replication slot.

A physical-replication standby can stop just at the boundary of WAL
segments. restart_lsn of the slot on the master can be assumed to be
the same location. The last segment on the master will be removed
after some checkpoints for 

Re: [HACKERS] [BUGS] Bug in Physical Replication Slots (at least 9.5)?

2017-03-17 Thread Kyotaro HORIGUCHI
Hello,

At Mon, 13 Mar 2017 11:06:00 +1100, Venkata B Nagothi  wrote 
in 
> On Tue, Jan 17, 2017 at 9:36 PM, Kyotaro HORIGUCHI <
> horiguchi.kyot...@lab.ntt.co.jp> wrote:
> > I managed to reproduce this. A little tweak as the first patch
> > lets the standby to suicide as soon as walreceiver sees a
> > contrecord at the beginning of a segment.
> >
> > - M(aster): createdb as a master with wal_keep_segments = 0
> > (default), min_log_messages = debug2
> > - M: Create a physical repslot.
> > - S(tandby): Setup a standby database.
> > - S: Edit recovery.conf to use the replication slot above then
> >  start it.
> > - S: touch /tmp/hoge
> > - M: Run pgbench ...
> > - S: After a while, the standby stops.
> >   > LOG:   STOP THE SERVER
> >
> > - M: Stop pgbench.
> > - M: Do 'checkpoint;' twice.
> > - S: rm /tmp/hoge
> > - S: Fails to catch up with the following error.
> >
> >   > FATAL:  could not receive data from WAL stream: ERROR:  requested WAL
> > segment 0001002B has already been removed
> >
> >
> I have been testing / reviewing the latest patch
> "0001-Fix-a-bug-of-physical-replication-slot.patch" and i think, i might
> need some more clarification on this.
> 
> Before applying the patch, I tried re-producing the above error -
> 
> - I had master->standby in streaming replication
> - Took the backup of master
>- with a low max_wal_size and wal_keep_segments = 0
> - Configured standby with recovery.conf
> - Created replication slot on master
> - Configured the replication slot on standby and started the standby

I suppose the "configure" means primary_slot_name in recovery.conf.

> - I got the below error
> 
>>> 2017-03-10 11:58:15.704 AEDT [478] LOG:  invalid record length at
> 0/F2000140: wanted 24, got 0
>>> 2017-03-10 11:58:15.706 AEDT [481] LOG:  started streaming WAL from
> primary at 0/F200 on timeline 1
>>> 2017-03-10 11:58:15.706 AEDT [481] FATAL:  could not receive data
> from WAL stream: ERROR:  requested WAL segment 000100F2 has
> already been removed

Maybe you created the master slot with non-reserve (default) mode
and put a some-minites pause after making the backup and before
starting the standby. For the case the master slot doesn't keep
WAL segments unless the standby connects so a couple of
checkpoints can blow away the first segment required by the
standby. This is quite reasonable behavior. The following steps
makes this more sure.

> - Took the backup of master
>- with a low max_wal_size = 2 and wal_keep_segments = 0
> - Configured standby with recovery.conf
> - Created replication slot on master
+ - SELECT pg_switch_wal(); on master twice.
+ - checkpoint; on master twice.
> - Configured the replication slot on standby and started the standby

Creating the slot with the following command will save it.

=# select pg_create_physical_replication_slot('s1', true);


> and i could notice that the file "000100F2" was removed
> from the master. This can be easily re-produced and this occurs
> irrespective of configuring replication slots.
> 
> As long as the file "000100F2" is available on the master,
> standby continues to stream WALs without any issues.
...
> If the scenario i created to reproduce the error is correct, then, applying
> the patch is not making a difference.

Yes, the patch is not for saving this case. The patch saves the
case where the previous segment to the first required segment by
standby was removed and it contains the first part of a record
continues to the first required segment. On the other hand this
case is that the segment at the start point of standby is just
removed.

> I think, i need help in building a specific test case which will re-produce
> the specific BUG related to physical replication slots as reported ?
> 
> Will continue to review the patch, once i have any comments on this.

Thaks a lot!

regards,

-- 
Kyotaro Horiguchi
NTT Open Source Software Center



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [BUGS] Bug in Physical Replication Slots (at least 9.5)?

2017-03-12 Thread Venkata B Nagothi
On Tue, Jan 17, 2017 at 9:36 PM, Kyotaro HORIGUCHI <
horiguchi.kyot...@lab.ntt.co.jp> wrote:

> Hello. I added pgsql-hackers.
>
> This occurs also on git master and back to 9.4.
>
> At Fri, 13 Jan 2017 08:47:06 -0600, Jonathon Nelson 
> wrote in  gmail.com>
> > On Mon, Nov 28, 2016 at 1:39 PM, Jonathon Nelson 
> wrote:
> > > First, the postgresql configuration differs only minimally from the
> stock
> > > config:
> > >
> > > Assume wal_keep_segments = 0.
> > > Assume the use of physical replication slots.
> > > Assume one master, one standby.
> > >
> > > Lastly, we have observed the behavior "in the wild" at least twice and
> in
> > > the lab a dozen or so times.
> > >
> > > EXAMPLE #1 (logs are from the replica):
> > >
> > > user=,db=,app=,client= DEBUG:  creating and filling new WAL file
> > > user=,db=,app=,client= DEBUG:  done creating and filling new WAL file
> > > user=,db=,app=,client= DEBUG:  sending write 6/8B00 flush
> 6/8A00
> > > apply 5/748425A0
> > > user=,db=,app=,client= DEBUG:  sending write 6/8B00 flush
> 6/8B00
> > > apply 5/74843020
> > > 
> > > user=,db=,app=,client= DEBUG:  postmaster received signal 2
> > > user=,db=,app=,client= LOG:  received fast shutdown request
> > > user=,db=,app=,client= LOG:  aborting any active transactions
> > >
> > > And, upon restart:
> > >
> > > user=,db=,app=,client= LOG:  restartpoint starting: xlog
> > > user=,db=,app=,client= DEBUG:  updated min recovery point to
> 6/67002390 on
> > > timeline 1
> > > user=,db=,app=,client= DEBUG:  performing replication slot checkpoint
> > > user=,db=,app=,client= DEBUG:  updated min recovery point to
> 6/671768C0 on
> > > timeline 1
> > > user=,db=,app=,client= CONTEXT:  writing block 589 of relation
> > > base/13294/16501
> > > user=,db=,app=,client= LOG:  invalid magic number  in log segment
> > > 00010006008B, offset 0
> > > user=,db=,app=,client= DEBUG:  switched WAL source from archive to
> stream
> > > after failure
> > > user=,db=,app=,client= LOG:  started streaming WAL from primary at
> > > 6/8A00 on timeline 1
> > > user=,db=,app=,client= FATAL:  could not receive data from WAL stream:
> > > ERROR:  requested WAL segment 00010006008A has already been
> > > removed
>
> I managed to reproduce this. A little tweak as the first patch
> lets the standby to suicide as soon as walreceiver sees a
> contrecord at the beginning of a segment.
>
> - M(aster): createdb as a master with wal_keep_segments = 0
> (default), min_log_messages = debug2
> - M: Create a physical repslot.
> - S(tandby): Setup a standby database.
> - S: Edit recovery.conf to use the replication slot above then
>  start it.
> - S: touch /tmp/hoge
> - M: Run pgbench ...
> - S: After a while, the standby stops.
>   > LOG:   STOP THE SERVER
>
> - M: Stop pgbench.
> - M: Do 'checkpoint;' twice.
> - S: rm /tmp/hoge
> - S: Fails to catch up with the following error.
>
>   > FATAL:  could not receive data from WAL stream: ERROR:  requested WAL
> segment 0001002B has already been removed
>
>
I have been testing / reviewing the latest patch
"0001-Fix-a-bug-of-physical-replication-slot.patch" and i think, i might
need some more clarification on this.

Before applying the patch, I tried re-producing the above error -

- I had master->standby in streaming replication
- Took the backup of master
   - with a low max_wal_size and wal_keep_segments = 0
- Configured standby with recovery.conf
- Created replication slot on master
- Configured the replication slot on standby and started the standby
- I got the below error

   >> 2017-03-10 11:58:15.704 AEDT [478] LOG:  invalid record length at
0/F2000140: wanted 24, got 0
   >> 2017-03-10 11:58:15.706 AEDT [481] LOG:  started streaming WAL from
primary at 0/F200 on timeline 1
   >> 2017-03-10 11:58:15.706 AEDT [481] FATAL:  could not receive data
from WAL stream: ERROR:  requested WAL segment 000100F2 has
already been removed

and i could notice that the file "000100F2" was removed
from the master. This can be easily re-produced and this occurs
irrespective of configuring replication slots.

As long as the file "000100F2" is available on the master,
standby continues to stream WALs without any issues.

some more details -

Contents of the file  "000100F2" on standby before
pg_stop_backup()

rmgr: Standby len (rec/tot): 24/50, tx:  0, lsn:
0/F228, prev 0/F198, desc: RUNNING_XACTS nextXid 638
latestCompletedXid 637 oldestRunningXid 638
rmgr: Standby len (rec/tot): 24/50, tx:  0, lsn:
0/F260, prev 0/F228, desc: RUNNING_XACTS nextXid 638
latestCompletedXid 637 oldestRunningXid 638
rmgr: XLOGlen (rec/tot): 80/   106, tx:  0, lsn:
0/F298, prev 0/F260, desc: CHECKPOINT_ONLINE 

Re: [HACKERS] [BUGS] Bug in Physical Replication Slots (at least 9.5)?

2017-02-02 Thread Kyotaro HORIGUCHI
At Thu, 2 Feb 2017 15:34:33 +0900, Michael Paquier  
wrote in 
> On Thu, Feb 2, 2017 at 1:26 AM, Fujii Masao  wrote:
> > I'm afraid that many WAL segments would start with a continuation record
> > when there are the workload of short transactions (e.g., by pgbench), and
> > which would make restart_lsn go behind very much. No?
> 
> I don't quite understand this argument. Even if there are many small
> transactions, that would cause restart_lsn to just be late by one
> segment, all the time.
> 
> > The discussion on this thread just makes me think that restart_lsn should
> > indicate the replay location instead of flush location. This seems safer.
> 
> That would penalize WAL retention on the primary with standbys using
> recovery_min_apply_delay and a slot for example...
> 
> We can attempt to address this problem two ways. The patch proposed
> (ugly btw and there are two typos!) is doing it in the WAL sender by
> not making restart_lsn jump to the next segment if a continuation
> record is found.

Sorry for the ug..:p Anyway, the previous version was not the
latest. The attached one is the revised version. (Sorry, I
haven't find a typo by myself..)

>  Or we could have the standby request for the next
> segment instead if the record it wants to replay from is at a boundary
> and that it locally has the beginning of the record, and it has it
> because it already confirmed to the primary that it flushed to the
> next segment. Not sure which fix is better though.

We could it as I said, with some refactoring ReadRecord involving
reader plugin mechanism..

regards,

-- 
Kyotaro Horiguchi
NTT Open Source Software Center
>From d835bf248e6869f7b843d339c9213a082e332297 Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi 
Date: Wed, 1 Feb 2017 16:07:22 +0900
Subject: [PATCH] Fix a bug of physical replication slot.

A physical-replication standby can stop just at the boundary of WAL
segments. restart_lsn of the slot on the master can be assumed to be
the same location. The last segment on the master will be removed
after some checkpoints for the case. If the first record of the next
segment is a continuation record, it is only on the master and its
beginning is only on the standby so the standby cannot restart because
the record to read is scattered to two sources.

This patch detains restart_lsn in the last sgement when the first page
of the next segment is a continuation record.
---
 src/backend/replication/walsender.c | 104 +---
 1 file changed, 97 insertions(+), 7 deletions(-)

diff --git a/src/backend/replication/walsender.c b/src/backend/replication/walsender.c
index 76f09fb..0ec7ba9 100644
--- a/src/backend/replication/walsender.c
+++ b/src/backend/replication/walsender.c
@@ -188,6 +188,13 @@ static volatile sig_atomic_t replication_active = false;
 static LogicalDecodingContext *logical_decoding_ctx = NULL;
 static XLogRecPtr logical_startptr = InvalidXLogRecPtr;
 
+/*
+ * This variable corresponds to restart_lsn in pg_replication_slots for a
+ * physical slot. This has a valid value only when it differs from the current
+ * flush pointer.
+ */
+static XLogRecPtr	   restartLSN = InvalidXLogRecPtr;
+
 /* Signal handlers */
 static void WalSndSigHupHandler(SIGNAL_ARGS);
 static void WalSndXLogSendHandler(SIGNAL_ARGS);
@@ -220,7 +227,7 @@ static void WalSndPrepareWrite(LogicalDecodingContext *ctx, XLogRecPtr lsn, Tran
 static void WalSndWriteData(LogicalDecodingContext *ctx, XLogRecPtr lsn, TransactionId xid, bool last_write);
 static XLogRecPtr WalSndWaitForWal(XLogRecPtr loc);
 
-static void XLogRead(char *buf, XLogRecPtr startptr, Size count);
+static bool XLogRead(char *buf, XLogRecPtr startptr, Size count, bool noutfoundok);
 
 
 /* Initialize walsender process before entering the main command loop */
@@ -504,6 +511,9 @@ StartReplication(StartReplicationCmd *cmd)
 			ereport(ERROR,
 	(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
 	 (errmsg("cannot use a logical replication slot for physical replication";
+
+		/* Restore restartLSN from replication slot */
+		restartLSN = MyReplicationSlot->data.restart_lsn;
 	}
 
 	/*
@@ -519,6 +529,10 @@ StartReplication(StartReplicationCmd *cmd)
 	else
 		FlushPtr = GetFlushRecPtr();
 
+	/* Set InvalidXLogRecPtr if catching up */
+	if (restartLSN == FlushPtr)
+		restartLSN = InvalidXLogRecPtr;
+
 	if (cmd->timeline != 0)
 	{
 		XLogRecPtr	switchpoint;
@@ -727,7 +741,7 @@ logical_read_xlog_page(XLogReaderState *state, XLogRecPtr targetPagePtr, int req
 		count = flushptr - targetPagePtr;
 
 	/* now actually read the data, we know it's there */
-	XLogRead(cur_page, targetPagePtr, XLOG_BLCKSZ);
+	XLogRead(cur_page, targetPagePtr, XLOG_BLCKSZ, false);
 
 	return count;
 }
@@ -1486,7 +1500,7 @@ static void
 ProcessStandbyReplyMessage(void)
 {
 	XLogRecPtr	writePtr,
-

Re: [HACKERS] [BUGS] Bug in Physical Replication Slots (at least 9.5)?

2017-02-01 Thread Michael Paquier
On Thu, Feb 2, 2017 at 1:26 AM, Fujii Masao  wrote:
> I'm afraid that many WAL segments would start with a continuation record
> when there are the workload of short transactions (e.g., by pgbench), and
> which would make restart_lsn go behind very much. No?

I don't quite understand this argument. Even if there are many small
transactions, that would cause restart_lsn to just be late by one
segment, all the time.

> The discussion on this thread just makes me think that restart_lsn should
> indicate the replay location instead of flush location. This seems safer.

That would penalize WAL retention on the primary with standbys using
recovery_min_apply_delay and a slot for example...

We can attempt to address this problem two ways. The patch proposed
(ugly btw and there are two typos!) is doing it in the WAL sender by
not making restart_lsn jump to the next segment if a continuation
record is found. Or we could have the standby request for the next
segment instead if the record it wants to replay from is at a boundary
and that it locally has the beginning of the record, and it has it
because it already confirmed to the primary that it flushed to the
next segment. Not sure which fix is better though.
-- 
Michael


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [BUGS] Bug in Physical Replication Slots (at least 9.5)?

2017-02-01 Thread Kyotaro HORIGUCHI
Thank you for the comment.

At Thu, 2 Feb 2017 01:26:03 +0900, Fujii Masao  wrote in 

Re: [HACKERS] [BUGS] Bug in Physical Replication Slots (at least 9.5)?

2017-02-01 Thread Fujii Masao
On Thu, Jan 19, 2017 at 6:37 PM, Kyotaro HORIGUCHI
 wrote:
> Hello,
>
> At Wed, 18 Jan 2017 12:34:51 +0900, Michael Paquier 
>  wrote in 
> 
>> On Tue, Jan 17, 2017 at 7:36 PM, Kyotaro HORIGUCHI
>>  wrote:
>> > I managed to reproduce this. A little tweak as the first patch
>> > lets the standby to suicide as soon as walreceiver sees a
>> > contrecord at the beginning of a segment.
>>
>> Good idea.
>
> Thanks. Fortunately(?), the problematic situation seems to happen
> at almost all segment boundary.
>
>> > I believe that a continuation record cannot be span over three or
>> > more *segments* (is it right?), so keeping one spare segment
>> > would be enough. The attached second patch does this.
>>
>> I have to admit that I did not think about this problem much yet (I
>> bookmarked this report weeks ago to be honest as something to look
>> at), but that does not look right to me. Couldn't a record be spawned
>> across even more segments? Take a random string longer than 64MB or
>> event longer for example.
>
> Though I haven't look closer to how a modification is splitted
> into WAL records. A tuple cannot be so long. As a simple test, I
> observed rechder->xl_tot_len at the end of XLogRecordAssemble
> inserting an about 400KB not-so-compressable string into a text
> column, but I saw a series of many records with shorter than
> several thousand bytes.
>
>> > Other possible measures might be,
>> >
>> > - Allowing switching wal source while reading a continuation
>> >   record. Currently ReadRecord assumes that a continuation record
>> >   can be read from single source. But this needs refactoring
>> >   involving xlog.c, xlogreader.c and relatives.
>>
>> This is scary thinking about back-branches.
>
> Yes. It would be no longer a bug fix. (Or becomes a quite ugly hack..)
>
>> > - Delaying recycling a segment until the last partial record on it
>> >   completes. This seems doable in page-wise (coarse resolution)
>> >   but would cost additional reading of past xlog files (page
>> >   header of past pages is required).
>>
>> Hm, yes. That looks like the least invasive way to go. At least that
>> looks more correct than the others.
>
> The attached patch does that. Usually it reads page headers only
> on segment boundaries, but once continuation record found (or
> failed to read the next page header, that is, the first record on
> the first page in the next segment has not been replicated), it
> becomes to happen on every page boundary until non-continuation
> page comes.

I'm afraid that many WAL segments would start with a continuation record
when there are the workload of short transactions (e.g., by pgbench), and
which would make restart_lsn go behind very much. No?

The discussion on this thread just makes me think that restart_lsn should
indicate the replay location instead of flush location. This seems safer.

Regards,

-- 
Fujii Masao


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [BUGS] Bug in Physical Replication Slots (at least 9.5)?

2017-01-31 Thread Kyotaro HORIGUCHI
Hello, I'll add the rebased version to the next CF.

At Fri, 20 Jan 2017 11:07:29 +0900 (Tokyo Standard Time), Kyotaro HORIGUCHI 
 wrote in 
<20170120.110729.107284864.horiguchi.kyot...@lab.ntt.co.jp>
> > > > - Delaying recycling a segment until the last partial record on it
> > > >   completes. This seems doable in page-wise (coarse resolution)
> > > >   but would cost additional reading of past xlog files (page
> > > >   header of past pages is required).
> > > 
> > > Hm, yes. That looks like the least invasive way to go. At least that
> > > looks more correct than the others.
> > 
> > The attached patch does that. Usually it reads page headers only
> > on segment boundaries, but once continuation record found (or
> > failed to read the next page header, that is, the first record on
> > the first page in the next segment has not been replicated), it
> > becomes to happen on every page boundary until non-continuation
> > page comes.
> > 
> > I leave a debug info (at LOG level) in the attached file shown on
> > every state change of keep pointer. At least for pgbench, the
> > cost seems ignorable.
> 
> I revised it. It became neater and less invasive.
> 
>  - Removed added keep from struct WalSnd. It is never referrenced
>from other processes. It is static variable now.
> 
>  - Restore keepPtr from replication slot on starting.

keepPtr is renamed to a more meaningful name restartLSN.

>  - Moved the main part to more appropriate position.

- Removed the debug print code.

regards,

-- 
Kyotaro Horiguchi
NTT Open Source Software Center
>From 47f9f867305934cbc5fdbd9629e61be65353fc9c Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi 
Date: Wed, 1 Feb 2017 16:07:22 +0900
Subject: [PATCH] Fix a bug of physical replication slot.

A physical-replication standby can stop just at the boundary of WAL
segments. restart_lsn of the slot on the master can be assumed to be
the same location. The last segment on the master will be removed
after some checkpoints for the case. If the first record of the next
segment is a continuation record, it is only on the master and its
beginning is only on the standby so the standby cannot restart because
the record to read is scattered to two sources.

This patch detains restart_lsn in the last sgement when the first page
of the next segment is a continuation record.
---
 src/backend/replication/walsender.c | 107 +---
 1 file changed, 100 insertions(+), 7 deletions(-)

diff --git a/src/backend/replication/walsender.c b/src/backend/replication/walsender.c
index 5909b7d..cfbe70e 100644
--- a/src/backend/replication/walsender.c
+++ b/src/backend/replication/walsender.c
@@ -188,6 +188,12 @@ static volatile sig_atomic_t replication_active = false;
 static LogicalDecodingContext *logical_decoding_ctx = NULL;
 static XLogRecPtr logical_startptr = InvalidXLogRecPtr;
 
+/*
+ * Segment keep pointer for physical slots. Has a valid value only when it
+ * differs from the current flush pointer.
+ */
+static XLogRecPtr	   keepPtr = InvalidXLogRecPtr;
+
 /* Signal handlers */
 static void WalSndSigHupHandler(SIGNAL_ARGS);
 static void WalSndXLogSendHandler(SIGNAL_ARGS);
@@ -220,7 +226,7 @@ static void WalSndPrepareWrite(LogicalDecodingContext *ctx, XLogRecPtr lsn, Tran
 static void WalSndWriteData(LogicalDecodingContext *ctx, XLogRecPtr lsn, TransactionId xid, bool last_write);
 static XLogRecPtr WalSndWaitForWal(XLogRecPtr loc);
 
-static void XLogRead(char *buf, XLogRecPtr startptr, Size count);
+static bool XLogRead(char *buf, XLogRecPtr startptr, Size count, bool noutfoundok);
 
 
 /* Initialize walsender process before entering the main command loop */
@@ -541,6 +547,9 @@ StartReplication(StartReplicationCmd *cmd)
 			ereport(ERROR,
 	(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
 	 (errmsg("cannot use a logical replication slot for physical replication";
+
+		/* Restore keepPtr from replication slot */
+		keepPtr = MyReplicationSlot->data.restart_lsn;
 	}
 
 	/*
@@ -556,6 +565,10 @@ StartReplication(StartReplicationCmd *cmd)
 	else
 		FlushPtr = GetFlushRecPtr();
 
+	/* Set InvalidXLogRecPtr if catching up */
+	if (keepPtr == FlushPtr)
+		keepPtr = InvalidXLogRecPtr;
+
 	if (cmd->timeline != 0)
 	{
 		XLogRecPtr	switchpoint;
@@ -777,7 +790,7 @@ logical_read_xlog_page(XLogReaderState *state, XLogRecPtr targetPagePtr, int req
 		count = flushptr - targetPagePtr;
 
 	/* now actually read the data, we know it's there */
-	XLogRead(cur_page, targetPagePtr, XLOG_BLCKSZ);
+	XLogRead(cur_page, targetPagePtr, XLOG_BLCKSZ, false);
 
 	return count;
 }
@@ -1563,7 +1576,7 @@ static void
 ProcessStandbyReplyMessage(void)
 {
 	XLogRecPtr	writePtr,
-flushPtr,
+flushPtr, oldFlushPtr,
 applyPtr;
 	bool		replyRequested;
 
@@ -1592,6 +1605,7 @@ ProcessStandbyReplyMessage(void)
 		WalSnd	   *walsnd = MyWalSnd;
 
 		SpinLockAcquire(>mutex);
+		oldFlushPtr = walsnd->flush;
 		

Re: [HACKERS] [BUGS] Bug in Physical Replication Slots (at least 9.5)?

2017-01-19 Thread Kyotaro HORIGUCHI
Hello,

At Thu, 19 Jan 2017 18:37:31 +0900 (Tokyo Standard Time), Kyotaro HORIGUCHI 
 wrote in 
<20170119.183731.223893446.horiguchi.kyot...@lab.ntt.co.jp>
> > > - Delaying recycling a segment until the last partial record on it
> > >   completes. This seems doable in page-wise (coarse resolution)
> > >   but would cost additional reading of past xlog files (page
> > >   header of past pages is required).
> > 
> > Hm, yes. That looks like the least invasive way to go. At least that
> > looks more correct than the others.
> 
> The attached patch does that. Usually it reads page headers only
> on segment boundaries, but once continuation record found (or
> failed to read the next page header, that is, the first record on
> the first page in the next segment has not been replicated), it
> becomes to happen on every page boundary until non-continuation
> page comes.
> 
> I leave a debug info (at LOG level) in the attached file shown on
> every state change of keep pointer. At least for pgbench, the
> cost seems ignorable.

I revised it. It became neater and less invasive.

 - Removed added keep from struct WalSnd. It is never referrenced
   from other processes. It is static variable now.

 - Restore keepPtr from replication slot on starting.

 - Moved the main part to more appropriate position.

regards,

-- 
Kyotaro Horiguchi
NTT Open Source Software Center
diff --git a/src/backend/replication/walsender.c b/src/backend/replication/walsender.c
index f3082c3..0270474 100644
--- a/src/backend/replication/walsender.c
+++ b/src/backend/replication/walsender.c
@@ -185,6 +185,12 @@ static volatile sig_atomic_t replication_active = false;
 static LogicalDecodingContext *logical_decoding_ctx = NULL;
 static XLogRecPtr logical_startptr = InvalidXLogRecPtr;
 
+/*
+ * Segment keep pointer for physical slots. Has a valid value only when it
+ * differs from the current flush pointer.
+ */
+static XLogRecPtr	   keepPtr = InvalidXLogRecPtr;
+
 /* Signal handlers */
 static void WalSndSigHupHandler(SIGNAL_ARGS);
 static void WalSndXLogSendHandler(SIGNAL_ARGS);
@@ -217,7 +223,7 @@ static void WalSndPrepareWrite(LogicalDecodingContext *ctx, XLogRecPtr lsn, Tran
 static void WalSndWriteData(LogicalDecodingContext *ctx, XLogRecPtr lsn, TransactionId xid, bool last_write);
 static XLogRecPtr WalSndWaitForWal(XLogRecPtr loc);
 
-static void XLogRead(char *buf, XLogRecPtr startptr, Size count);
+static bool XLogRead(char *buf, XLogRecPtr startptr, Size count, bool noutfoundok);
 
 
 /* Initialize walsender process before entering the main command loop */
@@ -538,6 +544,9 @@ StartReplication(StartReplicationCmd *cmd)
 			ereport(ERROR,
 	(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
 	 (errmsg("cannot use a logical replication slot for physical replication";
+
+		/* Restore keepPtr from replication slot */
+		keepPtr = MyReplicationSlot->data.restart_lsn;
 	}
 
 	/*
@@ -553,6 +562,10 @@ StartReplication(StartReplicationCmd *cmd)
 	else
 		FlushPtr = GetFlushRecPtr();
 
+	/* Set InvalidXLogRecPtr if catching up */
+	if (keepPtr == FlushPtr)
+		keepPtr = InvalidXLogRecPtr;
+	
 	if (cmd->timeline != 0)
 	{
 		XLogRecPtr	switchpoint;
@@ -774,7 +787,7 @@ logical_read_xlog_page(XLogReaderState *state, XLogRecPtr targetPagePtr, int req
 		count = flushptr - targetPagePtr;
 
 	/* now actually read the data, we know it's there */
-	XLogRead(cur_page, targetPagePtr, XLOG_BLCKSZ);
+	XLogRead(cur_page, targetPagePtr, XLOG_BLCKSZ, false);
 
 	return count;
 }
@@ -1551,7 +1564,7 @@ static void
 ProcessStandbyReplyMessage(void)
 {
 	XLogRecPtr	writePtr,
-flushPtr,
+flushPtr, oldFlushPtr,
 applyPtr;
 	bool		replyRequested;
 
@@ -1580,6 +1593,7 @@ ProcessStandbyReplyMessage(void)
 		WalSnd	   *walsnd = MyWalSnd;
 
 		SpinLockAcquire(>mutex);
+		oldFlushPtr = walsnd->flush;
 		walsnd->write = writePtr;
 		walsnd->flush = flushPtr;
 		walsnd->apply = applyPtr;
@@ -1597,7 +1611,78 @@ ProcessStandbyReplyMessage(void)
 		if (SlotIsLogical(MyReplicationSlot))
 			LogicalConfirmReceivedLocation(flushPtr);
 		else
-			PhysicalConfirmReceivedLocation(flushPtr);
+		{
+			/*
+			 * On recovery, a continuation reocrd must be available from
+			 * single WAL source. So physical replication slot should stay in
+			 * the first segment for a continuation record spanning multiple
+			 * segments. Since this doesn't look into individual record,
+			 * keepPtr may stay a bit too behind.
+			 *
+			 * Since the objective is avoding to remove required segments,
+			 * checking every segment is enough. But once keepPtr goes behind,
+			 * check every page for quick restoration.
+			 *
+			 * keepPtr has a valid value only when it is behind flushPtr.
+			 */
+			if (oldFlushPtr != InvalidXLogRecPtr &&
+(keepPtr == InvalidXLogRecPtr ?
+ oldFlushPtr / XLOG_SEG_SIZE != flushPtr / XLOG_SEG_SIZE :
+ keepPtr / XLOG_BLCKSZ != flushPtr / XLOG_BLCKSZ))
+			{
+XLogRecPtr rp;
+XLogRecPtr oldKeepPtr 

Re: [HACKERS] [BUGS] Bug in Physical Replication Slots (at least 9.5)?

2017-01-19 Thread Kyotaro HORIGUCHI
Hello,

At Wed, 18 Jan 2017 12:34:51 +0900, Michael Paquier  
wrote in 
> On Tue, Jan 17, 2017 at 7:36 PM, Kyotaro HORIGUCHI
>  wrote:
> > I managed to reproduce this. A little tweak as the first patch
> > lets the standby to suicide as soon as walreceiver sees a
> > contrecord at the beginning of a segment.
> 
> Good idea.

Thanks. Fortunately(?), the problematic situation seems to happen
at almost all segment boundary.

> > I believe that a continuation record cannot be span over three or
> > more *segments* (is it right?), so keeping one spare segment
> > would be enough. The attached second patch does this.
> 
> I have to admit that I did not think about this problem much yet (I
> bookmarked this report weeks ago to be honest as something to look
> at), but that does not look right to me. Couldn't a record be spawned
> across even more segments? Take a random string longer than 64MB or
> event longer for example.

Though I haven't look closer to how a modification is splitted
into WAL records. A tuple cannot be so long. As a simple test, I
observed rechder->xl_tot_len at the end of XLogRecordAssemble
inserting an about 400KB not-so-compressable string into a text
column, but I saw a series of many records with shorter than
several thousand bytes.

> > Other possible measures might be,
> >
> > - Allowing switching wal source while reading a continuation
> >   record. Currently ReadRecord assumes that a continuation record
> >   can be read from single source. But this needs refactoring
> >   involving xlog.c, xlogreader.c and relatives.
> 
> This is scary thinking about back-branches.

Yes. It would be no longer a bug fix. (Or becomes a quite ugly hack..)

> > - Delaying recycling a segment until the last partial record on it
> >   completes. This seems doable in page-wise (coarse resolution)
> >   but would cost additional reading of past xlog files (page
> >   header of past pages is required).
> 
> Hm, yes. That looks like the least invasive way to go. At least that
> looks more correct than the others.

The attached patch does that. Usually it reads page headers only
on segment boundaries, but once continuation record found (or
failed to read the next page header, that is, the first record on
the first page in the next segment has not been replicated), it
becomes to happen on every page boundary until non-continuation
page comes.

I leave a debug info (at LOG level) in the attached file shown on
every state change of keep pointer. At least for pgbench, the
cost seems ignorable.

> > - Delaying write/flush feedback until the current record is
> >   completed. walreceiver is not conscious of a WAL record and
> >   this might break synchronous replication.
> 
> Not sure about this one yet.

I'm not sure, too:p

regards,

-- 
Kyotaro Horiguchi
NTT Open Source Software Center
diff --git a/src/backend/replication/walsender.c b/src/backend/replication/walsender.c
index f3082c3..6b3abc5 100644
--- a/src/backend/replication/walsender.c
+++ b/src/backend/replication/walsender.c
@@ -217,7 +217,7 @@ static void WalSndPrepareWrite(LogicalDecodingContext *ctx, XLogRecPtr lsn, Tran
 static void WalSndWriteData(LogicalDecodingContext *ctx, XLogRecPtr lsn, TransactionId xid, bool last_write);
 static XLogRecPtr WalSndWaitForWal(XLogRecPtr loc);
 
-static void XLogRead(char *buf, XLogRecPtr startptr, Size count);
+static bool XLogRead(char *buf, XLogRecPtr startptr, Size count, bool noutfoundok);
 
 
 /* Initialize walsender process before entering the main command loop */
@@ -774,7 +774,7 @@ logical_read_xlog_page(XLogReaderState *state, XLogRecPtr targetPagePtr, int req
 		count = flushptr - targetPagePtr;
 
 	/* now actually read the data, we know it's there */
-	XLogRead(cur_page, targetPagePtr, XLOG_BLCKSZ);
+	XLogRead(cur_page, targetPagePtr, XLOG_BLCKSZ, false);
 
 	return count;
 }
@@ -1551,8 +1551,9 @@ static void
 ProcessStandbyReplyMessage(void)
 {
 	XLogRecPtr	writePtr,
-flushPtr,
-applyPtr;
+flushPtr, oldFlushPtr,
+applyPtr,
+keepPtr;
 	bool		replyRequested;
 
 	/* the caller already consumed the msgtype byte */
@@ -1580,24 +1581,99 @@ ProcessStandbyReplyMessage(void)
 		WalSnd	   *walsnd = MyWalSnd;
 
 		SpinLockAcquire(>mutex);
+		keepPtr = walsnd->keep;
+		oldFlushPtr = walsnd->flush;
 		walsnd->write = writePtr;
 		walsnd->flush = flushPtr;
 		walsnd->apply = applyPtr;
 		SpinLockRelease(>mutex);
 	}
 
+	/*
+	 * If we are managed by a replication slot, maintain keepPtr on the page
+	 * where the first fragment of the continuation record at flushPtr. Since
+	 * this doesn't look into individual record, keepPtr may stay a bit too
+	 * behind.
+	 */
+	if (MyReplicationSlot &&
+		flushPtr != InvalidXLogRecPtr && oldFlushPtr != InvalidXLogRecPtr)
+	{
+		/*
+		 * If keepPtr is cathing up, we do nothing until the next segment
+		 * comes. Otherwise check on 

Re: [HACKERS] [BUGS] Bug in Physical Replication Slots (at least 9.5)?

2017-01-17 Thread Michael Paquier
On Tue, Jan 17, 2017 at 7:36 PM, Kyotaro HORIGUCHI
 wrote:
> I managed to reproduce this. A little tweak as the first patch
> lets the standby to suicide as soon as walreceiver sees a
> contrecord at the beginning of a segment.

Good idea.

> I believe that a continuation record cannot be span over three or
> more *segments* (is it right?), so keeping one spare segment
> would be enough. The attached second patch does this.

I have to admit that I did not think about this problem much yet (I
bookmarked this report weeks ago to be honest as something to look
at), but that does not look right to me. Couldn't a record be spawned
across even more segments? Take a random string longer than 64MB or
event longer for example.

> Other possible measures might be,
>
> - Allowing switching wal source while reading a continuation
>   record. Currently ReadRecord assumes that a continuation record
>   can be read from single source. But this needs refactoring
>   involving xlog.c, xlogreader.c and relatives.

This is scary thinking about back-branches.

> - Delaying recycling a segment until the last partial record on it
>   completes. This seems doable in page-wise (coarse resolution)
>   but would cost additional reading of past xlog files (page
>   header of past pages is required).

Hm, yes. That looks like the least invasive way to go. At least that
looks more correct than the others.

> - Delaying write/flush feedback until the current record is
>   completed. walreceiver is not conscious of a WAL record and
>   this might break synchronous replication.

Not sure about this one yet.
-- 
Michael


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [BUGS] Bug in Physical Replication Slots (at least 9.5)?

2017-01-17 Thread Kyotaro HORIGUCHI
Auch! It is wrong.

Not decrement keep, decrement segno.

2017年1月17日(火) 19:37 Kyotaro HORIGUCHI :

> Hello. I added pgsql-hackers.
>
> This occurs also on git master and back to 9.4.
>
> At Fri, 13 Jan 2017 08:47:06 -0600, Jonathon Nelson 
> wrote in <
> cacjqam1ydczcd5docp+y5hkwto1zegw+mj8uk7avqctbgjo...@mail.gmail.com>
> > On Mon, Nov 28, 2016 at 1:39 PM, Jonathon Nelson 
> wrote:
> > > First, the postgresql configuration differs only minimally from the
> stock
> > > config:
> > >
> > > Assume wal_keep_segments = 0.
> > > Assume the use of physical replication slots.
> > > Assume one master, one standby.
> > >
> > > Lastly, we have observed the behavior "in the wild" at least twice and
> in
> > > the lab a dozen or so times.
> > >
> > > EXAMPLE #1 (logs are from the replica):
> > >
> > > user=,db=,app=,client= DEBUG:  creating and filling new WAL file
> > > user=,db=,app=,client= DEBUG:  done creating and filling new WAL file
> > > user=,db=,app=,client= DEBUG:  sending write 6/8B00 flush
> 6/8A00
> > > apply 5/748425A0
> > > user=,db=,app=,client= DEBUG:  sending write 6/8B00 flush
> 6/8B00
> > > apply 5/74843020
> > > 
> > > user=,db=,app=,client= DEBUG:  postmaster received signal 2
> > > user=,db=,app=,client= LOG:  received fast shutdown request
> > > user=,db=,app=,client= LOG:  aborting any active transactions
> > >
> > > And, upon restart:
> > >
> > > user=,db=,app=,client= LOG:  restartpoint starting: xlog
> > > user=,db=,app=,client= DEBUG:  updated min recovery point to
> 6/67002390 on
> > > timeline 1
> > > user=,db=,app=,client= DEBUG:  performing replication slot checkpoint
> > > user=,db=,app=,client= DEBUG:  updated min recovery point to
> 6/671768C0 on
> > > timeline 1
> > > user=,db=,app=,client= CONTEXT:  writing block 589 of relation
> > > base/13294/16501
> > > user=,db=,app=,client= LOG:  invalid magic number  in log segment
> > > 00010006008B, offset 0
> > > user=,db=,app=,client= DEBUG:  switched WAL source from archive to
> stream
> > > after failure
> > > user=,db=,app=,client= LOG:  started streaming WAL from primary at
> > > 6/8A00 on timeline 1
> > > user=,db=,app=,client= FATAL:  could not receive data from WAL stream:
> > > ERROR:  requested WAL segment 00010006008A has already been
> > > removed
>
> I managed to reproduce this. A little tweak as the first patch
> lets the standby to suicide as soon as walreceiver sees a
> contrecord at the beginning of a segment.
>
> - M(aster): createdb as a master with wal_keep_segments = 0
> (default), min_log_messages = debug2
> - M: Create a physical repslot.
> - S(tandby): Setup a standby database.
> - S: Edit recovery.conf to use the replication slot above then
>  start it.
> - S: touch /tmp/hoge
> - M: Run pgbench ...
> - S: After a while, the standby stops.
>   > LOG:   STOP THE SERVER
>
> - M: Stop pgbench.
> - M: Do 'checkpoint;' twice.
> - S: rm /tmp/hoge
> - S: Fails to catch up with the following error.
>
>   > FATAL:  could not receive data from WAL stream: ERROR:  requested WAL
> segment 0001002B has already been removed
>
>
> This problem occurs when only the earlier parts in a continued
> record is replicated then the segment is removed on the
> master. In other words, the first half is only on standby, and
> the second half is only on the master.
>
> I believe that a continuation record cannot be span over three or
> more *segments* (is it right?), so kepping one spare segment
> would be enough. The attached second patch does this.
>
>
> Other possible measures might be,
>
> - Allowing switching wal source while reading a continuation
>   record. Currently ReadRecord assumes that a continuation record
>   can be read from single source. But this needs refactoring
>   involving xlog.c, xlogreader.c and relatives.
>
> - Delaying recycing a segment until the last partial record on it
>   completes. This seems doable in page-wise (coarse resolution)
>   but would cost additional reading of past xlog files (page
>   header of past pages is required).
>
> - Delaying write/flush feedback until the current record is
>   completed. walreceiver is not conscious of a WAL record and
>   this might break synchronous replication.
>
> Any thoughts?
>
>
> =
> > > A physical analysis shows that the WAL file 00010006008B is
> > > 100% zeroes (ASCII NUL).
>
> I suppose it is on the standby so the segment file is the one
> where the next transferred record will be written onto.
>
> > > The results of querying pg_replication_slots shows a restart_lsn that
> > > matches ….6/8B.
>
> It is the beginning of the next record to be replicatd as
> documentation. In other words, just after the last transferred
> record (containing padding).
>
> > > Pg_controldata shows values like:
> > > Minimum recovery ending location: 

Re: [HACKERS] [BUGS] Bug in Physical Replication Slots (at least 9.5)?

2017-01-17 Thread Kyotaro HORIGUCHI
Hello. I added pgsql-hackers.

This occurs also on git master and back to 9.4.

At Fri, 13 Jan 2017 08:47:06 -0600, Jonathon Nelson  wrote in 

> On Mon, Nov 28, 2016 at 1:39 PM, Jonathon Nelson  wrote:
> > First, the postgresql configuration differs only minimally from the stock
> > config:
> >
> > Assume wal_keep_segments = 0.
> > Assume the use of physical replication slots.
> > Assume one master, one standby.
> >
> > Lastly, we have observed the behavior "in the wild" at least twice and in
> > the lab a dozen or so times.
> >
> > EXAMPLE #1 (logs are from the replica):
> >
> > user=,db=,app=,client= DEBUG:  creating and filling new WAL file
> > user=,db=,app=,client= DEBUG:  done creating and filling new WAL file
> > user=,db=,app=,client= DEBUG:  sending write 6/8B00 flush 6/8A00
> > apply 5/748425A0
> > user=,db=,app=,client= DEBUG:  sending write 6/8B00 flush 6/8B00
> > apply 5/74843020
> > 
> > user=,db=,app=,client= DEBUG:  postmaster received signal 2
> > user=,db=,app=,client= LOG:  received fast shutdown request
> > user=,db=,app=,client= LOG:  aborting any active transactions
> >
> > And, upon restart:
> >
> > user=,db=,app=,client= LOG:  restartpoint starting: xlog
> > user=,db=,app=,client= DEBUG:  updated min recovery point to 6/67002390 on
> > timeline 1
> > user=,db=,app=,client= DEBUG:  performing replication slot checkpoint
> > user=,db=,app=,client= DEBUG:  updated min recovery point to 6/671768C0 on
> > timeline 1
> > user=,db=,app=,client= CONTEXT:  writing block 589 of relation
> > base/13294/16501
> > user=,db=,app=,client= LOG:  invalid magic number  in log segment
> > 00010006008B, offset 0
> > user=,db=,app=,client= DEBUG:  switched WAL source from archive to stream
> > after failure
> > user=,db=,app=,client= LOG:  started streaming WAL from primary at
> > 6/8A00 on timeline 1
> > user=,db=,app=,client= FATAL:  could not receive data from WAL stream:
> > ERROR:  requested WAL segment 00010006008A has already been
> > removed

I managed to reproduce this. A little tweak as the first patch
lets the standby to suicide as soon as walreceiver sees a
contrecord at the beginning of a segment.

- M(aster): createdb as a master with wal_keep_segments = 0
(default), min_log_messages = debug2
- M: Create a physical repslot.
- S(tandby): Setup a standby database.
- S: Edit recovery.conf to use the replication slot above then
 start it.
- S: touch /tmp/hoge
- M: Run pgbench ...
- S: After a while, the standby stops.
  > LOG:   STOP THE SERVER

- M: Stop pgbench.
- M: Do 'checkpoint;' twice.
- S: rm /tmp/hoge
- S: Fails to catch up with the following error.

  > FATAL:  could not receive data from WAL stream: ERROR:  requested WAL 
segment 0001002B has already been removed


This problem occurs when only the earlier parts in a continued
record is replicated then the segment is removed on the
master. In other words, the first half is only on standby, and
the second half is only on the master.

I believe that a continuation record cannot be span over three or
more *segments* (is it right?), so kepping one spare segment
would be enough. The attached second patch does this.


Other possible measures might be,

- Allowing switching wal source while reading a continuation
  record. Currently ReadRecord assumes that a continuation record
  can be read from single source. But this needs refactoring
  involving xlog.c, xlogreader.c and relatives.

- Delaying recycing a segment until the last partial record on it
  completes. This seems doable in page-wise (coarse resolution)
  but would cost additional reading of past xlog files (page
  header of past pages is required).

- Delaying write/flush feedback until the current record is
  completed. walreceiver is not conscious of a WAL record and
  this might break synchronous replication.

Any thoughts?


=
> > A physical analysis shows that the WAL file 00010006008B is
> > 100% zeroes (ASCII NUL).

I suppose it is on the standby so the segment file is the one
where the next transferred record will be written onto.

> > The results of querying pg_replication_slots shows a restart_lsn that
> > matches ….6/8B.

It is the beginning of the next record to be replicatd as
documentation. In other words, just after the last transferred
record (containing padding).

> > Pg_controldata shows values like:
> > Minimum recovery ending location: 6/8Axx

It is the beginning of the last applied record.

> > How can the master show a position that is greater than the minimum
> > recovery ending location?

So it is natural that the former is larger than the latter.

> > EXAMPLE #2:
> >
> > Minimum recovery ending location: 19DD/73E0
> > Log segment 000119DD0073 was not available.
> > The