Re: [bug fix] Cascaded standby cannot start after a clean shutdown

2018-06-17 Thread Michael Paquier
On Sun, Jun 17, 2018 at 07:33:01PM -0700, Andres Freund wrote:
> On 2018-06-17 22:31:02 -0400, Tom Lane wrote:
>> Yeah, for me parallelized check-world only works in >= 9.6.  My (vague)
>> recollection is that multiple fixes were needed to get to that point,
>> so I doubt it's worth trying to fix in older branches.
> 
> There were multiple fixes, that's right - I'd personally however still
> be in favor of trying to backpatch those.

It took me some time, but I have been able to find the thread which
discusses all the fixes needed to get parallel builds to work on older
branches without having to port PostgresNode.pm, which is a way higher
level of efforts:
https://www.postgresql.org/message-id/CAB7nPqQTGveO3_zvnXQAs32cMosX8c76_ewe4L1cNL=1xzm...@mail.gmail.com
Could that be considered?

> Running a check world serially for the backbranches seriously slows
> down things for me.

Me too.  That was already frustrating one year ago, and it still is.
--
Michael


signature.asc
Description: PGP signature


Re: [bug fix] Cascaded standby cannot start after a clean shutdown

2018-06-17 Thread Andres Freund
Hi,

On 2018-06-17 22:31:02 -0400, Tom Lane wrote:
> Michael Paquier  writes:
> > Trying to run regression tests in parallel in ~9.5 leads to spurious
> > failures, which is annoying...  I had a patch fixing that but I cannot
> > put my finger on the thread where this has been discussed.
> 
> Yeah, for me parallelized check-world only works in >= 9.6.  My (vague)
> recollection is that multiple fixes were needed to get to that point,
> so I doubt it's worth trying to fix in older branches.

There were multiple fixes, that's right - I'd personally however still
be in favor of trying to backpatch those. Running a check world serially
for the backbranches seriously slows down things for me.

Greetings,

Andres Freund



Re: [bug fix] Cascaded standby cannot start after a clean shutdown

2018-06-17 Thread Tom Lane
Michael Paquier  writes:
> Trying to run regression tests in parallel in ~9.5 leads to spurious
> failures, which is annoying...  I had a patch fixing that but I cannot
> put my finger on the thread where this has been discussed.

Yeah, for me parallelized check-world only works in >= 9.6.  My (vague)
recollection is that multiple fixes were needed to get to that point,
so I doubt it's worth trying to fix in older branches.

regards, tom lane



Re: [bug fix] Cascaded standby cannot start after a clean shutdown

2018-06-17 Thread Michael Paquier
On Wed, Jun 13, 2018 at 09:00:47AM +0900, Michael Paquier wrote:
> Note for everybody on this list: I will be out for a couple of days at
> the end of this week, and my intention is to finish wrapping this patch
> at the beginning of next week, with a back-patch down to 9.5 where
> palloc_extended has been introduced as this has been seen a couple of
> times on -bugs or -hackers.  Feel free of course to comment and look at
> the thread for more background on the matter.

Okay, done this way.

Trying to run regression tests in parallel in ~9.5 leads to spurious
failures, which is annoying...  I had a patch fixing that but I cannot
put my finger on the thread where this has been discussed.
--
Michael


signature.asc
Description: PGP signature


Re: [bug fix] Cascaded standby cannot start after a clean shutdown

2018-06-12 Thread Michael Paquier
On Tue, Jun 12, 2018 at 04:27:50PM +0900, Michael Paquier wrote:
> On Tue, Jun 12, 2018 at 06:30:49AM +, Tsunakawa, Takayuki wrote:
>> Thank you so much.  This version looks better.
>> 
>> + * this would cause the instance to stop suddendly with a hard failure,
>> 
>> suddendly -> suddenly
> 
> Yep.  Thanks for the extra lookup.

Note for everybody on this list: I will be out for a couple of days at
the end of this week, and my intention is to finish wrapping this patch
at the beginning of next week, with a back-patch down to 9.5 where
palloc_extended has been introduced as this has been seen a couple of
times on -bugs or -hackers.  Feel free of course to comment and look at
the thread for more background on the matter.
--
Michael


signature.asc
Description: PGP signature


Re: [bug fix] Cascaded standby cannot start after a clean shutdown

2018-06-12 Thread Michael Paquier
On Tue, Jun 12, 2018 at 06:30:49AM +, Tsunakawa, Takayuki wrote:
> Thank you so much.  This version looks better.
> 
> +  * this would cause the instance to stop suddendly with a hard failure,
> 
> suddendly -> suddenly

Yep.  Thanks for the extra lookup.
--
Michael


signature.asc
Description: PGP signature


RE: [bug fix] Cascaded standby cannot start after a clean shutdown

2018-06-12 Thread Tsunakawa, Takayuki
> From: Michael Paquier [mailto:mich...@paquier.xyz]
> As this is one of those small bug fixes for which we can do something, 
> attached
> is an updated patch with a commit description, and tutti-quanti.  At the
> end, I have moved the size check within allocate_recordbuf().  Even if the
> size calculated is not going to be higher than total_len, it feels safer
> in the long term for two reasons:

Thank you so much.  This version looks better.

+* this would cause the instance to stop suddendly with a hard failure,

suddendly -> suddenly


Regards
Takayuki Tsunakawa







Re: [bug fix] Cascaded standby cannot start after a clean shutdown

2018-06-11 Thread Michael Paquier
On Sun, Mar 18, 2018 at 08:49:01AM +0900, Michael Paquier wrote:
> On Fri, Mar 16, 2018 at 06:02:25AM +, Tsunakawa, Takayuki wrote:
>> Ouch, you're right.  If memory allocation fails, the startup process
>> would emit a LOG message and continue to fetch new WAL records.  Then,
>> I'm completely happy with your patch.
> 
> Thanks for double-checking, Tsunakawa-san.

As this is one of those small bug fixes for which we can do something,
attached is an updated patch with a commit description, and
tutti-quanti.  At the end, I have moved the size check within
allocate_recordbuf().  Even if the size calculated is not going to be
higher than total_len, it feels safer in the long term for two reasons:
- Allocation size calculation could be changed, or let's say made
smarter.
- New callers of allocate_recordbuf would not be able to see the problem
for the backend, and this could become a problem if the WAL reader
facility is extended.

Thoughts?
--
Michael
From 7688dceee5590b8430a90cfae2b98917f710e824 Mon Sep 17 00:00:00 2001
From: Michael Paquier 
Date: Tue, 12 Jun 2018 14:52:35 +0900
Subject: [PATCH] Prevent hard failures of standbys caused by recycled WAL
 segments

When a standby stops reading WAL from a WAL stream, it writes data to
the current WAL segment without having priorily zero'ed the page written
to, which can cause the WAL reader to read junk data from a past
recycled segment and then it would try to get a record from it.  While
sanity checks in place provide most of the protection needed, in some
rare circumstances, with chances increasing when a record header crosses
a page boundary, then the startup process could fail violently on an
allocation failure, which is unhelpful as this would require in the
worst case a manual restart of the instance, impacting potentially the
availability of the cluster.

The chances of seeing failures are higher if the connection between the
standby and its root node is flacky, causing WAL pages to be written in
the middle.  A couple of approaches have been discussed, like zero-ing
new WAL pages within the WAL receiver itself but this has the
disadvantage of impacting performance of any existing instances as this
breaks the serializable writes done by the WAL receiver.  This commit
deals with the problem with a more simple approach, which has no
performance impact without reducing the detection of the problem: if a
record is found with a length higher than 1GB for backends, then do not
try any allocation and report a soft failure.  It could be possible that
the allocation call passes and that an unnecessary amount of memory is
allocated, however follow-up checks on records would just fail, making
this allocation short-lived anyway.

This patch owes a great deal to Tsunakawa Takayuki for reporting the
failure first, and then discussing a couple of potential approaches to
the problem.

Reported-by: Tsunakawa Takayuki
Reviewed-by: Tsunakawa Takayuki
Author: Michael Paquier
Discussion: https://postgr.es/m/0A3221C70F24FB45833433255569204D1F8B57AD@G01JPEXMBYT05
---
 src/backend/access/transam/xlogreader.c | 25 +
 1 file changed, 25 insertions(+)

diff --git a/src/backend/access/transam/xlogreader.c b/src/backend/access/transam/xlogreader.c
index a7953f323b..5929ce30af 100644
--- a/src/backend/access/transam/xlogreader.c
+++ b/src/backend/access/transam/xlogreader.c
@@ -25,6 +25,10 @@
 #include "common/pg_lzcompress.h"
 #include "replication/origin.h"
 
+#ifndef FRONTEND
+#include "utils/memutils.h"
+#endif
+
 static bool allocate_recordbuf(XLogReaderState *state, uint32 reclength);
 
 static bool ValidXLogRecordHeader(XLogReaderState *state, XLogRecPtr RecPtr,
@@ -160,6 +164,27 @@ allocate_recordbuf(XLogReaderState *state, uint32 reclength)
 	newSize += XLOG_BLCKSZ - (newSize % XLOG_BLCKSZ);
 	newSize = Max(newSize, 5 * Max(BLCKSZ, XLOG_BLCKSZ));
 
+#ifndef FRONTEND
+
+	/*
+	 * Note that in much unlucky circumstances, the random data read from a
+	 * recycled segment can cause this routine to be called with a size
+	 * causing a hard failure at allocation.  While for frontend utilities
+	 * this likely means that the end of WAL has been reached, for a standby
+	 * this would cause the instance to stop suddendly with a hard failure,
+	 * preventing it to retry fetching WAL from one of its sources which could
+	 * allow it to move on with replay without a manual restart. If the data
+	 * comes from a past recycled segment and is still valid, then the
+	 * allocation would succeed but record checks are going to fail so this
+	 * would be short-lived.  If the allocation fails because of a memory
+	 * shortage, then this is not a hard failure either per the guarantee
+	 * given by MCXT_ALLOC_NO_OOM.
+	 */
+	if (!AllocSizeIsValid(newSize))
+		return false;
+
+#endif
+
 	if (state->readRecordBuf)
 		pfree(state->readRecordBuf);
 	state->readRecordBuf =
-- 
2.17.1



signature.asc
Description: PGP signature


Re: [bug fix] Cascaded standby cannot start after a clean shutdown

2018-03-17 Thread Michael Paquier
On Fri, Mar 16, 2018 at 06:02:25AM +, Tsunakawa, Takayuki wrote:
> Ouch, you're right.  If memory allocation fails, the startup process
> would emit a LOG message and continue to fetch new WAL records.  Then,
> I'm completely happy with your patch.

Thanks for double-checking, Tsunakawa-san.
--
Michael


signature.asc
Description: PGP signature


RE: [bug fix] Cascaded standby cannot start after a clean shutdown

2018-03-16 Thread Tsunakawa, Takayuki
From: Michael Paquier [mailto:mich...@paquier.xyz]
> We use palloc_extended with MCXT_ALLOC_NO_OOM in 9.5~, and malloc() further
> down, so once you remove the FATAL error caused by a record whose length
> is higher than 1GB, you discard all the hard failures, no?

Ouch, you're right.  If memory allocation fails, the startup process would emit 
a LOG message and continue to fetch new WAL records.  Then, I'm completely 
happy with your patch.

I added you as the author, and marked this as ready for committer.

Regards
Takayuki Tsunakawa





Re: [bug fix] Cascaded standby cannot start after a clean shutdown

2018-03-15 Thread Michael Paquier
On Fri, Mar 16, 2018 at 05:27:58AM +, Tsunakawa, Takayuki wrote:
> Honestly, I'm fine with either patch.  I like your simpler and cleaner
> one that has no performance impact.  But please note that the
> allocation attempt could amount to nearly 1 GB.  That can fail due to
> memory shortage, which leads to FATAL error (ereport(ERROR) results in
> FATAL in startup process), and cause standby to shut down. 

We use palloc_extended with MCXT_ALLOC_NO_OOM in 9.5~, and malloc()
further down, so once you remove the FATAL error caused by a record
whose length is higher than 1GB, you discard all the hard failures, no?
--
Michael


signature.asc
Description: PGP signature


RE: [bug fix] Cascaded standby cannot start after a clean shutdown

2018-03-15 Thread Tsunakawa, Takayuki
From: Michael Paquier [mailto:mich...@paquier.xyz]
> Even with that, the resulting patch is sort of ugly...  So it seems to me
> that the conclusion to this thread is that there is no clear winner, and
> that the problem is so unlikely to happen that it is not worth the performance
> impact to zero all the WAL pages fetched.  Still, the attached has the
> advantage to not cause the startup process to fail suddendly because of
> the allocation failure but to fail afterwards when validating the record's
> contents, which has the disadvantage to allocate temporarily up to 1GB of
> memory for some of the garbage, but that allocation is short-lived.  So
> that would switch the failure from a FATAL allocation failure taking down
> Postgres to something which will fetching of new WAL records to be tried
> again.
> 
> All in all, that would be a win for the case reported on this thread.

I'm sorry to be late to reply, and thanks for another patch.

Honestly, I'm fine with either patch.  I like your simpler and cleaner one that 
has no performance impact.  But please note that the allocation attempt could 
amount to nearly 1 GB.  That can fail due to memory shortage, which leads to 
FATAL error (ereport(ERROR) results in FATAL in startup process), and cause 
standby to shut down.

Regards
Takayuki Tsunakawa





RE: [bug fix] Cascaded standby cannot start after a clean shutdown

2018-02-26 Thread Tsunakawa, Takayuki
From: Michael Paquier [mailto:mich...@paquier.xyz]
> By the way, as long as I have my mind of it.  Another strategy would be
> to just make the checks in XLogReadRecord() a bit smarter if the whole record
> header is not on the page.  If we check at least for
> AllocSizeIsValid(total_len) then there this code would not fail on an
> allocation as you user reported.  Still this misses the case where a record
> size is lower than 1GB but invalid so you would allocate allocate_recordbuf
> for nothing :(

That was my first thought, and I gave it up.  As you say, XLogReadRecord() 
could allocate up to 1 GB of memory for a garbage.  That allocation can fail 
due to memory shortage, which prevents the recovery from proceeding.


Regards
Takayuki Tsunakawa







Re: [bug fix] Cascaded standby cannot start after a clean shutdown

2018-02-26 Thread Michael Paquier
On Mon, Feb 26, 2018 at 05:08:49PM +0900, Michael Paquier wrote:
> This was mentioned back in 2001 by the way, but this did not count much
> for the case discussed here:
> https://www.postgresql.org/message-id/24901.995381770%40sss.pgh.pa.us
> The issue here is that the streaming case makes it easier to hit the
> problem as it opens more easily access to not-completely written WAL
> pages depending on the message frequency during replication.  At the
> same time, we are discussing about a very low-probability issue.  Note
> that if the XLOG reader is bumping into this problem, then at the next
> WAL receiver wake up, recovery would begin from the beginning of the
> last segment, and if the primary has produced some more WAL then the
> standby would be able to actually avoid the random junk.  It is also
> possible to bypass the problem by zeroing manually the areas in
> question, or to actually wait for the standby to generate more WAL so as
> the garbage is overwritten automatically.  And you really need to be
> very, very unlucky to have random garbage able to bypass the header
> validation checks.

By the way, as long as I have my mind of it.  Another strategy would be
to just make the checks in XLogReadRecord() a bit smarter if the whole
record header is not on the page.  If we check at least for
AllocSizeIsValid(total_len) then there this code would not fail on an
allocation as you user reported.  Still this misses the case where a
record size is lower than 1GB but invalid so you would allocate
allocate_recordbuf for nothing :(

At least this extra check is costless, and avoids all kind of hard
failures.
--
Michael


signature.asc
Description: PGP signature


Re: [bug fix] Cascaded standby cannot start after a clean shutdown

2018-02-26 Thread Michael Paquier
On Mon, Feb 26, 2018 at 07:25:46AM +, Tsunakawa, Takayuki wrote:
> From: Michael Paquier [mailto:mich...@paquier.xyz]
>> The WAL receiver approach also has a drawback.  If WAL is streamed at full
>> speed, then the primary sends data with a maximum of 6 WAL pages.
>> When beginning streaming with a new segment, then the WAL sent stops at
>> page boundary.  But if you stop once in the middle of a page then you need
>> to zero-fill the page until the current segment is finished streaming.  So
>> if the workload generates spiky WAL then the WAL receiver can would a lot
>> of extra lseek() calls with the patch applied, while all the writes would
>> be sequential on HEAD, so that's not performant-wise IMO.
> 
> Does even the non-cascading standby stop in the middle of a page?  I
> thought the master always the whole WAL blocks without stopping in the
> middle of a page.

You even have problems on normal standbys.  I have a small script which
is able to reproduce that if you want (need a small rewrite as it is
adapted to my test framework) which introduces a garbage set of WAL
segments on a stopped standby.  With the small monitoring patch I
mentioned upthread then you can see the XLOG reader finding garbage data
as well before validating the record header.  With any fixes on the WAL
receiver, your first patch included, then the garbage read goes away,
and XLOG reader complains about a record with an incorrect length
(invalid record length at XX/YYY: wanted 24, got 0) instead of complains
from header validation part.  One key point is to cleanly stop the
primary to as it forces the standby's WAL receiver to write to its WAL
segment in the middle of a page. 

>> Another idea I am thinking about would be to zero-fill the segments when
>> recycled instead of being just renamed when doing InstallXLogFileSegment().
>> This would also have the advantage of making the segments ahead more
>> compressible, which is a gain for custom backups, and the WAL receiver does
>> not need any tweaks as it would write the data on a clean file.  Zero-filling
>> the segments is done only when a new segment is created (see XLogFileInit).
> 
> Yes, I was (and am) inclined to take this approach; this is easy and
> clean, but not good for performance...  I hope there's something which
> justifies this approach. 

InstallXLogFileSegment uses a plain durable_link_or_rename() to recycle
the past segment which syncs the old segment before the rename anyway,
so the I/O effort will be there, no?

This was mentioned back in 2001 by the way, but this did not count much
for the case discussed here:
https://www.postgresql.org/message-id/24901.995381770%40sss.pgh.pa.us
The issue here is that the streaming case makes it easier to hit the
problem as it opens more easily access to not-completely written WAL
pages depending on the message frequency during replication.  At the
same time, we are discussing about a very low-probability issue.  Note
that if the XLOG reader is bumping into this problem, then at the next
WAL receiver wake up, recovery would begin from the beginning of the
last segment, and if the primary has produced some more WAL then the
standby would be able to actually avoid the random junk.  It is also
possible to bypass the problem by zeroing manually the areas in
question, or to actually wait for the standby to generate more WAL so as
the garbage is overwritten automatically.  And you really need to be
very, very unlucky to have random garbage able to bypass the header
validation checks.
--
Michael


signature.asc
Description: PGP signature


RE: [bug fix] Cascaded standby cannot start after a clean shutdown

2018-02-25 Thread Tsunakawa, Takayuki
From: Michael Paquier [mailto:mich...@paquier.xyz]
> I have been playing more with that this morning, and trying to tweak the
> XLOG reader so as the fetched page is zeroed where necessary does not help
> much.  XLogReaderState->EndRecPtr is updated once the last record is set
> so it is possible to use it and zero the rest of the page which would prevent
> a lookup.  But this is actually not wise for performance:
> - The data after EndRecPtr could be perfectly valid, so you could zero data
> which could be reused later.
> - It is necessary to drop the quick-exit checks in PageReadInternal.

First of all, thanks for your many experiments and ideas.

Yes, the above method doesn't work.  The reason is not only performance but 
also correctness.  After you zero-fill the remaining portion of state->readbuf 
based on state->EndRecPtr, you need to read the same page upon the request of 
the next WAL record.  And that page read brings garbage again into 
state->readbuf.  After all, the callers of ReadPageInternal() has to face the 
garbage.


> The WAL receiver approach also has a drawback.  If WAL is streamed at full
> speed, then the primary sends data with a maximum of 6 WAL pages.
> When beginning streaming with a new segment, then the WAL sent stops at
> page boundary.  But if you stop once in the middle of a page then you need
> to zero-fill the page until the current segment is finished streaming.  So
> if the workload generates spiky WAL then the WAL receiver can would a lot
> of extra lseek() calls with the patch applied, while all the writes would
> be sequential on HEAD, so that's not performant-wise IMO.

Does even the non-cascading standby stop in the middle of a page?  I thought 
the master always the whole WAL blocks without stopping in the middle of a page.


> Another idea I am thinking about would be to zero-fill the segments when
> recycled instead of being just renamed when doing InstallXLogFileSegment().
> This would also have the advantage of making the segments ahead more
> compressible, which is a gain for custom backups, and the WAL receiver does
> not need any tweaks as it would write the data on a clean file.  Zero-filling
> the segments is done only when a new segment is created (see XLogFileInit).

Yes, I was (and am) inclined to take this approach; this is easy and clean, but 
not good for performance...  I hope there's something which justifies this 
approach.


Regards
Takayuki Tsunakawa





Re: [bug fix] Cascaded standby cannot start after a clean shutdown

2018-02-25 Thread Michael Paquier
On Fri, Feb 23, 2018 at 11:02:19PM +0900, Michael Paquier wrote:
> Tsunakawa-san has proposed upthread to fix the problem by zero-ing the
> page read in the WAL receiver.  While I agree that zeroing the page is
> the way to go, doing so in the WAL receiver does not take care of
> problems with the frontends.  I don't have the time to test that yet as
> it is late here, but based on my code lookups tweaking
> ReadPageInternal() so as the page is zero'ed correctly should do it for
> all the cases.  This way, the checks happening after a page read would
> fail because of the zero'ed fields consistently instead of the garbage
> accumulated.

I have been playing more with that this morning, and trying to tweak the
XLOG reader so as the fetched page is zeroed where necessary does not
help much.  XLogReaderState->EndRecPtr is updated once the last record
is set so it is possible to use it and zero the rest of the page which
would prevent a lookup.  But this is actually not wise for performance:
- The data after EndRecPtr could be perfectly valid, so you could zero
data which could be reused later.
- It is necessary to drop the quick-exit checks in PageReadInternal.

The WAL receiver approach also has a drawback.  If WAL is streamed at
full speed, then the primary sends data with a maximum of 6 WAL pages.
When beginning streaming with a new segment, then the WAL sent stops at
page boundary.  But if you stop once in the middle of a page then you
need to zero-fill the page until the current segment is finished
streaming.  So if the workload generates spiky WAL then the WAL receiver
can would a lot of extra lseek() calls with the patch applied, while all
the writes would be sequential on HEAD, so that's not performant-wise
IMO.

Another idea I am thinking about would be to zero-fill the segments
when recycled instead of being just renamed when doing
InstallXLogFileSegment().  This would also have the advantage of making
the segments ahead more compressible, which is a gain for custom
backups, and the WAL receiver does not need any tweaks as it would write
the data on a clean file.  Zero-filling the segments is done only when a
new segment is created (see XLogFileInit).

Each approach has its own cost. Thoughts?
--
Michael


signature.asc
Description: PGP signature


Re: [bug fix] Cascaded standby cannot start after a clean shutdown

2018-02-23 Thread Michael Paquier
On Fri, Feb 23, 2018 at 11:26:31AM +0900, Michael Paquier wrote:
> An other, evil, idea that I have on top of all those things is to
> directly hexedit the WAL segment of the standby just at the limit where
> it would receive a record from the primary and insert in it garbage
> data which would make the validation tests to blow up in xlogreader.c
> for the record allocation.

OK, I have been playing with such methods and finally I have been able
to check the theory of Tsunakawa-san here.  At first I played with
hexedit to pollute the data after the last record received by a standby,
record which is not at the end of a WAL page.  This can easily happen by
stopping the primary which will most likely only send data up to the
middle of a page.  Using that I have easily faced errors like that:
LOG: invalid resource manager ID 255 at 0/75353B8
And I have been able to see that garbage could be read as the length of
a record before the validation of the header is done.  With the patch
attached you can easily see a collection of garbage:
LOG: length of 77913214 fetched for record 0/30009D8
And this happens if a standby is reading a page with the last record in
the middle of it.

At this state, the XLOG reader is dealing with random data, so this
would most likely fail in ValidXLogRecordHeader(), which is what
happened with the invalid rmgr for example.  However, if one is really
unlucky, and the probability of facing that are really low, the random
data being read *can* pass the validation checks of the header, which
would cause a set of problems:
- In the backend, this means a failure on palloc_extended if the garbage
read is higher than 1GB.  This causes a standby to stop immediately
while it should normally continue to poll the primary by spawning a new
WAL receiver and begin streaming from the beginning of the segment where
it needs its data (assuming that there is no archive).
- In the frontend, this can cause problems for pg_waldump and
pg_rewind.  For pg_waldump, this is not actually a big deal because a
failure means the end of the data that can be decoded.  However the
problem is more serious for pg_rewind.  At the initial phase of the
tool, pg_rewind scans the WAL segments of the target server to look for
the modified blocks since the last checkpoint before promotion up to the
end of the stream.  If at the end of the stream it finds garbage data,
then there is a risk that it allocates a couple of GBs of data, likely
finishing by causing the process to be killed on OOM by the automatic
OOM killer, preventing the rewind to complete :(

After some thoughts, I have also come up with a more simple way to test
the stabilility of the XLOG reader:
1) Create a primary/standby cluster.
2) Stop the standby.
3) Add in the standby's pg_wal a set of junk WAL segments generated with
if=/dev/urandom of=junk_walseg bs=16MB count=1.  Note that recycled WAL
segments are simple copies of past ones when fetching an existing one,
so on a standby when a new segment writes based on a past existing
segment it writes data on top of some garbage.  So it is possible to
face this problem, this just makes it show up faster.
4) Start the standby again, and let it alive.
5) Generate on the primary enough records worth of more or less 8kB to
fill in a complete WAL page.
6) Restart the primary cleanly, sleep like 5s to let the standby the
time to stream the new partial page and return to 5).

When 5) and 6) loop, you will see the monitoring log of the attached
patch complain after a couple of restarts.

Tsunakawa-san has proposed upthread to fix the problem by zero-ing the
page read in the WAL receiver.  While I agree that zeroing the page is
the way to go, doing so in the WAL receiver does not take care of
problems with the frontends.  I don't have the time to test that yet as
it is late here, but based on my code lookups tweaking
ReadPageInternal() so as the page is zero'ed correctly should do it for
all the cases.  This way, the checks happening after a page read would
fail because of the zero'ed fields consistently instead of the garbage
accumulated.
--
Michael
diff --git a/src/backend/access/transam/xlogreader.c b/src/backend/access/transam/xlogreader.c
index 3a86f3497e..13b25e5236 100644
--- a/src/backend/access/transam/xlogreader.c
+++ b/src/backend/access/transam/xlogreader.c
@@ -303,6 +303,16 @@ XLogReadRecord(XLogReaderState *state, XLogRecPtr RecPtr, char **errormsg)
 	record = (XLogRecord *) (state->readBuf + RecPtr % XLOG_BLCKSZ);
 	total_len = record->xl_tot_len;
 
+#ifndef FRONTEND
+	/*
+	 * XXX: remove from final patch , just here to check that garbage
+	 * data can be fetched from blocks read.
+	 */
+	if (total_len > 5 * XLOG_BLCKSZ)
+		elog(LOG, "length of %u fetched for record %X/%X", total_len,
+			 (uint32) (RecPtr >> 32), (uint32) RecPtr);
+#endif
+
 	/*
 	 * If the whole record header is on this page, validate it immediately.
 	 * Otherwise do just a basic sanity check on xl_tot_len, and validate the



Re: [bug fix] Cascaded standby cannot start after a clean shutdown

2018-02-22 Thread Michael Paquier
On Thu, Feb 22, 2018 at 04:55:38PM +0900, Michael Paquier wrote:
> I am definitely ready to buy that it can be possible to have garbage
> being read the length field which can cause allocate_recordbuf to fail
> as that's the only code path in xlogreader.c which does such an
> allocation.  Still, it seems to me that we should first try to see if
> there are strange allocation patterns that happen and see if it is
> possible to have a reproduceable test case or a pattern which gives us
> confidence that we are on the right track.  One idea I have to
> monitor those allocations like the following:
> --- a/src/backend/access/transam/xlogreader.c
> +++ b/src/backend/access/transam/xlogreader.c
> @@ -162,6 +162,10 @@ allocate_recordbuf(XLogReaderState *state, uint32 
> reclength)
> newSize += XLOG_BLCKSZ - (newSize % XLOG_BLCKSZ);
> newSize = Max(newSize, 5 * Max(BLCKSZ, XLOG_BLCKSZ));
> 
> +#ifndef FRONTEND
> +   elog(LOG, "Allocation for xlogreader increased to %u", newSize);
> +#endif

So, I have been playing a bit more with that and defined the following
strategy to see if it is possible to create inconsistencies:
- Use a primary and a standby.
- Set up max_wal_size and min_wal_size to a minimum of 80MB so as the
segment recycling takes effect more quickly.
- Create a single table with a UUID column to increase the likelihood of
random data in INSERT records and FPWs, and insert enough data to
trigger a full WAL recycling.
- Every 5 seconds, insert a set of tuples into the table, using 110 to
120 tuples generates enough data for a bit more than a full WAL page.
And then restart the primary.  This causes the standby to catch up with
normally a page streamed which is not completely initialized as it
fetches the page in the middle.

With the monitoring mentioned in the upper comment block, I have let the
whole thing run for a couple of hours, but I have not been able to catch
up problems, except the usual "invalid record length at 0/XXX: wanted
24, got 0".  The allocation for recordbuf did not get higher than 40960
bytes as well, which matches with 5 WAL pages.

An other, evil, idea that I have on top of all those things is to
directly hexedit the WAL segment of the standby just at the limit where
it would receive a record from the primary and insert in it garbage
data which would make the validation tests to blow up in xlogreader.c
for the record allocation.
--
Michael


signature.asc
Description: PGP signature


Re: [bug fix] Cascaded standby cannot start after a clean shutdown

2018-02-21 Thread Michael Paquier
On Mon, Feb 19, 2018 at 03:01:15AM +, Tsunakawa, Takayuki wrote:
> From: Michael Paquier [mailto:mich...@paquier.xyz]

Sorry for my late reply.  I was looking at this problem for the last
couple of days here and there, still thinking about it.

>> It seems to me that the consolidation of the page read should happen directly
>> in xlogreader.c and not even in one of its callbacks so as no garbage data
>> is presented back to the caller using its own XLogReader.
>> I think that you need to initialize XLogReaderState->readBuf directly in
>> ReadPageInternal() before reading a page and you should be good.  With your
>> patch you get visibly only one portion of things addressed, what of other
>> code paths using xlogreader.c's APIs like pg_rewind, 2PC code and such?
> 
> ReadPageInternal() doesn't know where the end of valid WAL is, so it
> cannot determine where to do memset(0).  For example, in non-streaming
> cases, it reads the entire WAL block into readbuf, including the
> garbage.

Why couldn't it know about it?  It would be perfectly fine to feed to it
the end LSN position as well and it is an internal API to xlogreader.c.
Note that both its callers, XLogFindNextRecord or XLogReadRecord know
about that as well.

>   /*
>* If the current segment is being streamed from master, calculate how
>* much of the current page we have received already. We know the
>* requested record has been received, but this is for the benefit of
>* future calls, to allow quick exit at the top of this function.
>*/
>   if (readSource == XLOG_FROM_STREAM)
>   {
>   if (((targetPagePtr) / XLOG_BLCKSZ) != (receivedUpto / 
> XLOG_BLCKSZ))
>   readLen = XLOG_BLCKSZ;
>   else
>   readLen = receivedUpto % XLogSegSize - targetPageOff;
>   }
>   else
>   readLen = XLOG_BLCKSZ;
> 
> So we have to zero-fill the empty space of a WAL block before writing
> it.  Currently, the master does that in AdvanceXLInsertBuffer(),
> StartupXLOG(), XLogFileCopy().  The cascading standby receives WAL
> from the master block by block, so it doesn't suffer from the garbage.
> pg_receivexlog zero-fills a new WAL file.

I cannot completely parse this statement.  Physical replication sends up
to 16 WAL pages per message, rounded down to the last page completed.
Even a cascading standby sends WAL following this protocol, using the
last flush position as a base for the maximum amount of data sent.

>> +if (zbuffer == NULL)
>> +zbuffer = palloc0(XLOG_BLCKSZ);
>> You could just use a static buffer which is MemSet'd with 0 only if 
>> necessary.
> 
> I wanted to allocate memory only when necessary.  Currently, only the
> cascaded standby needs this.  To that end, I moved the memory
> allocation code to a right place.  Thanks for letting me notice this. 

Still, it seems to me that any problems are not related to the fact that
we are using cascading standbys.  As [1] says as well, similar problems
have been reported on a standby after restarting its primary.

I am definitely ready to buy that it can be possible to have garbage
being read the length field which can cause allocate_recordbuf to fail
as that's the only code path in xlogreader.c which does such an
allocation.  Still, it seems to me that we should first try to see if
there are strange allocation patterns that happen and see if it is
possible to have a reproduceable test case or a pattern which gives us
confidence that we are on the right track.  One idea I have to
monitor those allocations like the following:
--- a/src/backend/access/transam/xlogreader.c
+++ b/src/backend/access/transam/xlogreader.c
@@ -162,6 +162,10 @@ allocate_recordbuf(XLogReaderState *state, uint32 
reclength)
newSize += XLOG_BLCKSZ - (newSize % XLOG_BLCKSZ);
newSize = Max(newSize, 5 * Max(BLCKSZ, XLOG_BLCKSZ));

+#ifndef FRONTEND
+   elog(LOG, "Allocation for xlogreader increased to %u", newSize);
+#endif

This could be upgraded to a PANIC or such if it sees a larger
allocation, as pgbench generates records with known lengths, that would
be a good fit to see if there is some garbage being read.  No need to go
up to 1GB before seeing a failure.

[1]:
https://www.postgresql.org/message-id/CAE2gYzzVZNsGn%3D-E6grO4sVQs04J02zNKQofQEO8gu8%3DqCFR%2BQ%40mail.gmail.com
--
Michael


signature.asc
Description: PGP signature


RE: [bug fix] Cascaded standby cannot start after a clean shutdown

2018-02-18 Thread Tsunakawa, Takayuki
Thank you for reviewing.

From: Michael Paquier [mailto:mich...@paquier.xyz]
> It seems to me that the consolidation of the page read should happen directly
> in xlogreader.c and not even in one of its callbacks so as no garbage data
> is presented back to the caller using its own XLogReader.
> I think that you need to initialize XLogReaderState->readBuf directly in
> ReadPageInternal() before reading a page and you should be good.  With your
> patch you get visibly only one portion of things addressed, what of other
> code paths using xlogreader.c's APIs like pg_rewind, 2PC code and such?

ReadPageInternal() doesn't know where the end of valid WAL is, so it cannot 
determine where to do memset(0).  For example, in non-streaming cases, it reads 
the entire WAL block into readbuf, including the garbage.

/*
 * If the current segment is being streamed from master, calculate how
 * much of the current page we have received already. We know the
 * requested record has been received, but this is for the benefit of
 * future calls, to allow quick exit at the top of this function.
 */
if (readSource == XLOG_FROM_STREAM)
{
if (((targetPagePtr) / XLOG_BLCKSZ) != (receivedUpto / 
XLOG_BLCKSZ))
readLen = XLOG_BLCKSZ;
else
readLen = receivedUpto % XLogSegSize - targetPageOff;
}
else
readLen = XLOG_BLCKSZ;


So we have to zero-fill the empty space of a WAL block before writing it.  
Currently, the master does that in AdvanceXLInsertBuffer(), StartupXLOG(), 
XLogFileCopy().  The cascading standby receives WAL from the master block by 
block, so it doesn't suffer from the garbage.  pg_receivexlog zero-fills a new 
WAL file.

> Please note that your patch has some unnecessary garbage in two places:

Ouch, cleaned up.


> + if (zbuffer == NULL)
> + zbuffer = palloc0(XLOG_BLCKSZ);
> You could just use a static buffer which is MemSet'd with 0 only if necessary.

I wanted to allocate memory only when necessary.  Currently, only the cascaded 
standby needs this.  To that end, I moved the memory allocation code to a right 
place.  Thanks for letting me notice this.

Regards
Takayuki Tsunakawa




zerofill_walblock_on_standby_v2.patch
Description: zerofill_walblock_on_standby_v2.patch


Re: [bug fix] Cascaded standby cannot start after a clean shutdown

2018-02-16 Thread Michael Paquier
On Fri, Feb 16, 2018 at 04:19:00PM +0900, Michael Paquier wrote:
> Wait a minute here, when recycled past WAL segments would be filled with
> zeros before being used.

Please feel free to ignore this part.  I pushed the "Send" button
without seeing it, and I was thinking uner which circumstances segments
get zero-filled, which is moot for this thread.
--
Michael


signature.asc
Description: PGP signature


Re: [bug fix] Cascaded standby cannot start after a clean shutdown

2018-02-15 Thread Michael Paquier
On Wed, Feb 14, 2018 at 04:37:05AM +, Tsunakawa, Takayuki wrote:
> The PostgreSQL version is 9.5.  The cluster consists of a master, a
> cascading standby (SB1), and a cascaded standby (SB2).  The WAL flows
> like this: master -> SB1 -> SB2. 
> 
> The user shut down SB2 and tried to restart it, but failed with the
> following message: 
> 
> FATAL:  invalid memory alloc request size 3075129344
> 
> The master and SB1 continued to run.

This matches a couple of recent bug reports we have seen around like
this one:
https://www.postgresql.org/message-id/CAE2gYzzVZNsGn%3D-E6grO4sVQs04J02zNKQofQEO8gu8%3DqCFR%2BQ%40mail.gmail.com
I recalled as well this one but in this case the user shot his own foot
with the failover scenario he designed:
https://www.postgresql.org/message-id/CAAc9rOyKAyzipiq7ee1%3DVbcRy2fRqV_hRujLbC0mrBkL07%3D7wQ%40mail.gmail.com

> CAUSE
> ==
> 
> total_len in the code below was about 3GB, so palloc() rejected the
> memory allocation request.

Yes, it is limited to 1GB.

> [xlogreader.c]
>   record = (XLogRecord *) (state->readBuf + RecPtr % XLOG_BLCKSZ);
>   total_len = record->xl_tot_len;
> ...
>   /*
>* Enlarge readRecordBuf as needed.
>*/
>   if (total_len > state->readRecordBufSize &&
>   !allocate_recordbuf(state, total_len))
>   {
> 
> Why was XLogRecord->xl_tot_len so big?  That's because the XLOG reader
> read the garbage portion in a WAL file, which is immediately after the
> end of valid WAL records. 
> 
> Why was there the garbage?  Because the cascading standby sends just
> the valid WAL records, not the whole WAL blocks, to the cascaded
> standby.  When the cascaded standby receives those WAL records and
> write them in a recycled WAL file, the last WAL block in the file
> contains the mix of valid WAL records and the garbage left over.

Wait a minute here, when recycled past WAL segments would be filled with
zeros before being used.

> Why did the XLOG reader try to allocate memory for a garbage?  Doesn't
> it stop reading the WAL?  As the following code shows, when the WAL
> record header crosses a WAL block boundary, the XLOG reader first
> allocates memory for the whole record, and then checks the validity of
> the record header as soon as it reads the entire header.
>
> [...]
> 
> FIX
> ==
> 
> One idea is to defer the memory allocation until the entire record
> header is read and ValidXLogRecordHeader() is called.  However,
> ValidXLogRecordHeader() might misjudge the validity if the garbage
> contains xl_prev seeming correct. 

It seems to me that the consolidation of the page read should happen
directly in xlogreader.c and not even in one of its callbacks so as no
garbage data is presented back to the caller using its own XLogReader.
I think that you need to initialize XLogReaderState->readBuf directly in
ReadPageInternal() before reading a page and you should be good.  With
your patch you get visibly only one portion of things addressed, what of
other code paths using xlogreader.c's APIs like pg_rewind, 2PC code and
such?

> FYI, the following unsolved problem may be solved, too.
> 
> https://www.postgresql.org/message-id/CAE2gYzzVZNsGn%3D-E6grO4sVQs04J02zNKQofQEO8gu8%3DqCFR%2BQ%40mail.gmail.com

Yeah, I noticed this one too before going through your message in
details ;)

Please note that your patch has some unnecessary garbage in two places:
- * Portions Copyright (c) 2010-2018, PostgreSQL Global Development Group
+ * Portions Copyright (c) 2010-2017, PostgreSQL Global Development Group
[...]
-* Only superusers and members of pg_read_all_stats can see details.
-* Other users only get the pid value
+* Only superusers can see details. Other users only get the pid valu

You may want to indent your patch as well before sending it.

+   if (zbuffer == NULL)
+   zbuffer = palloc0(XLOG_BLCKSZ);
You could just use a static buffer which is MemSet'd with 0 only if
necessary.
--
Michael


signature.asc
Description: PGP signature