subject:"RE\: prevent immature WAL streaming"

Re: prevent immature WAL streaming

2022-03-23 Thread Alvaro Herrera

On 2021-Sep-03, Alvaro Herrera wrote:

> The last commit is something I noticed in pg_rewind ...

I had missed this one; it's pushed now.

-- 
Álvaro HerreraBreisgau, Deutschland  —  https://www.EnterpriseDB.com/
"I can see support will not be a problem.  10 out of 10."(Simon Wittber)
  (http://archives.postgresql.org/pgsql-general/2004-12/msg00159.php)

Re: prevent immature WAL streaming

2021-11-26 Thread Alvaro Herrera

On 2021-Nov-26, Amul Sul wrote:

> On Fri, Nov 26, 2021 at 1:42 AM Tom Lane  wrote:
> >
> > Meh ... but given the simplicity of the write-side fix, maybe changing
> > it is appropriate.

Actually, fixing the other side is equally simple, and it is also more
correct.  What changed my mind is that upon completing a successful read
of a record, what we set as state->ReadRecPtr is the local variable
RecPtr -- so that is what becomes the true LSN of the record.  Using
state->currRecPtr is inconsistent with that definition.

> Unless I am missing something, I am not sure why need this adjustment
> if we are going to use state->currRecPtr value which doesn't seem to
> be changing at all. AFAICU, state->currRecPtr will be unchanged value
> whether going to set overwrittenRecPtr or abortedRecPtr. Do primary
> and standby see state->currRecPtr differently, I guess not, never?

You're right for the wrong reason.  We don't need the adjustment in the
verify routine.  The reason we don't is that we're not going to use
state->currRecPtr anymore, but rather RecPtr in both places.  You're
thinking that primary and standby would never "see state->currRecPtr
differently", but that's only if they are both running the same code.
If you had a primary running 14.1 and a standby running 14.2, with the
previously proposed fix (using state->currRecPtr), you would be in
trouble.  With this fix (using RecPtr) it works fine.

-- 
Álvaro Herrera   39°49'30"S 73°17'W  —  https://www.EnterpriseDB.com/

Re: prevent immature WAL streaming

2021-11-25 Thread Amul Sul

On Fri, Nov 26, 2021 at 1:42 AM Tom Lane  wrote:
>
> Alvaro Herrera  writes:
> > On 2021-Nov-25, Tom Lane wrote:
> >> Really?  AFAICS the WAL record contains the correct value, or at least
> >> we should define that one as being correct, for precisely this reason.
>
> > I don't know what is the correct value for a record that comes exactly
> > after the page header.  But here's a patch that fixes the problem; and
> > if a standby replays WAL written by an unpatched primary, it will be
> > able to read past instead of dying of FATAL.
>
> Meh ... but given the simplicity of the write-side fix, maybe changing
> it is appropriate.
>
> However, this seems too forgiving:
>
> +if (xlrec->overwritten_lsn != state->overwrittenRecPtr &&
> +xlrec->overwritten_lsn - SizeOfXLogShortPHD != 
> state->overwrittenRecPtr &&
> +xlrec->overwritten_lsn - SizeOfXLogLongPHD != 
> state->overwrittenRecPtr)
>

Unless I am missing something, I am not sure why need this adjustment
if we are going to use state->currRecPtr value which doesn't seem to
be changing at all. AFAICU, state->currRecPtr will be unchanged value
whether going to set overwrittenRecPtr or abortedRecPtr. Do primary
and standby see state->currRecPtr differently, I guess not, never?

Regards,
Amul

Re: prevent immature WAL streaming

2021-11-25 Thread Tom Lane

I wrote:
> However, this seems too forgiving:

... also, I don't know if you intended this already, but the
VerifyOverwriteContrecord change should only be applied in
back branches.  There's no need for it in HEAD.

regards, tom lane

Re: prevent immature WAL streaming

2021-11-25 Thread Tom Lane

Alvaro Herrera  writes:
> On 2021-Nov-25, Tom Lane wrote:
>> Really?  AFAICS the WAL record contains the correct value, or at least
>> we should define that one as being correct, for precisely this reason.

> I don't know what is the correct value for a record that comes exactly
> after the page header.  But here's a patch that fixes the problem; and
> if a standby replays WAL written by an unpatched primary, it will be
> able to read past instead of dying of FATAL.

Meh ... but given the simplicity of the write-side fix, maybe changing
it is appropriate.

However, this seems too forgiving:

+if (xlrec->overwritten_lsn != state->overwrittenRecPtr &&
+xlrec->overwritten_lsn - SizeOfXLogShortPHD != 
state->overwrittenRecPtr &&
+xlrec->overwritten_lsn - SizeOfXLogLongPHD != state->overwrittenRecPtr)

The latter two cases should only be accepted if overwrittenRecPtr is
exactly at a page boundary.

regards, tom lane

Re: prevent immature WAL streaming

2021-11-25 Thread Alvaro Herrera

On 2021-Nov-25, Tom Lane wrote:

> Alvaro Herrera  writes:
> 
> > The problem is that the bug occurs while writing the WAL record.  Fixed
> > servers won't produce such records, but if you run an unpatched server
> > and it happens to write one, without a mitigation you cannot get away
> > from FATAL during replay.
> 
> Really?  AFAICS the WAL record contains the correct value, or at least
> we should define that one as being correct, for precisely this reason.

I don't know what is the correct value for a record that comes exactly
after the page header.  But here's a patch that fixes the problem; and
if a standby replays WAL written by an unpatched primary, it will be
able to read past instead of dying of FATAL.

I originally wrote this to have a WARNING in VerifyOverwriteContrecord
(in the cases that are new), with the idea that it'd prompt people to
upgrade, but that's probably a waste of time.

-- 
Álvaro Herrera  Valdivia, Chile  —  https://www.EnterpriseDB.com/
"El hombre nunca sabe de lo que es capaz hasta que lo intenta" (C. Dickens)
>From c28d96dfeb0884dff58b989fc0b869ee0f7806e5 Mon Sep 17 00:00:00 2001
From: Alvaro Herrera 
Date: Thu, 25 Nov 2021 16:19:42 -0300
Subject: [PATCH] Fix LSN in OVERWRITTEN_CONTRECORD

If the record that was broken starts exactly at a page boundary, we
write the byte position at which the record data appears (skipping the
WAL page header), which isn't the actual LSN of the record.  Repair.

To avoid problems on production systems which might use the bogus code,
change the test so that LSNs at page boundaries are also considered
equal to the byte position for the record data.  This means that if an
unpatched server produces a bogus OVERWRITTEN_CONTRECORD, WAL replay can
continue after upgrading to a server containing this patch.
---
 src/backend/access/transam/xlog.c   | 9 -
 src/backend/access/transam/xlogreader.c | 2 +-
 2 files changed, 9 insertions(+), 2 deletions(-)

diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 5a8851a602..1f1ae23526 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -10706,7 +10706,14 @@ xlog_redo(XLogReaderState *record)
 static void
 VerifyOverwriteContrecord(xl_overwrite_contrecord *xlrec, XLogReaderState *state)
 {
-	if (xlrec->overwritten_lsn != state->overwrittenRecPtr)
+	/*
+	 * 14.1 and sibling server versions wrote XLOG_OVERWRITE_CONTRECORD with a
+	 * bogus recptr. See https://postgr.es/m/45597.1637694...@sss.pgh.pa.us for
+	 * details.
+	 */
+	if (xlrec->overwritten_lsn != state->overwrittenRecPtr &&
+		xlrec->overwritten_lsn - SizeOfXLogShortPHD != state->overwrittenRecPtr &&
+		xlrec->overwritten_lsn - SizeOfXLogLongPHD != state->overwrittenRecPtr)
 		elog(FATAL, "mismatching overwritten LSN %X/%X -> %X/%X",
 			 LSN_FORMAT_ARGS(xlrec->overwritten_lsn),
 			 LSN_FORMAT_ARGS(state->overwrittenRecPtr));
diff --git a/src/backend/access/transam/xlogreader.c b/src/backend/access/transam/xlogreader.c
index f39f8044a9..4772cd1664 100644
--- a/src/backend/access/transam/xlogreader.c
+++ b/src/backend/access/transam/xlogreader.c
@@ -586,7 +586,7 @@ err:
 		 * in turn signal downstream WAL consumers that the broken WAL record
 		 * is to be ignored.
 		 */
-		state->abortedRecPtr = RecPtr;
+		state->abortedRecPtr = state->currRecPtr;
 		state->missingContrecPtr = targetPagePtr;
 	}
 
-- 
2.30.2

Re: prevent immature WAL streaming

2021-11-25 Thread Tom Lane

Alvaro Herrera  writes:
> On 2021-Nov-25, Tom Lane wrote:
>> Uh, why?  The fix should remove the problem, and if it doesn't, we're
>> still looking at inconsistent WAL aren't we?

> The problem is that the bug occurs while writing the WAL record.  Fixed
> servers won't produce such records, but if you run an unpatched server
> and it happens to write one, without a mitigation you cannot get away
> from FATAL during replay.

Really?  AFAICS the WAL record contains the correct value, or at least
we should define that one as being correct, for precisely this reason.

regards, tom lane

Re: prevent immature WAL streaming

2021-11-25 Thread Alvaro Herrera

On 2021-Nov-25, Tom Lane wrote:

> Alvaro Herrera  writes:
> > Oh, but also I think I should push a mitigation in case a production
> > system hits this problem: maybe reduce the message from FATAL to WARNING
> > if the registered LSN is at a page boundary.
> 
> Uh, why?  The fix should remove the problem, and if it doesn't, we're
> still looking at inconsistent WAL aren't we?

The problem is that the bug occurs while writing the WAL record.  Fixed
servers won't produce such records, but if you run an unpatched server
and it happens to write one, without a mitigation you cannot get away
from FATAL during replay.

Since this bug exists in released minors, we should allow people to
upgrade to a newer version if they hit it.

-- 
Álvaro Herrera  Valdivia, Chile  —  https://www.EnterpriseDB.com/
"Nadie está tan esclavizado como el que se cree libre no siéndolo" (Goethe)

Re: prevent immature WAL streaming

2021-11-25 Thread Tom Lane

Alvaro Herrera  writes:
> Oh, but also I think I should push a mitigation in case a production
> system hits this problem: maybe reduce the message from FATAL to WARNING
> if the registered LSN is at a page boundary.

Uh, why?  The fix should remove the problem, and if it doesn't, we're
still looking at inconsistent WAL aren't we?

regards, tom lane

Re: prevent immature WAL streaming

2021-11-25 Thread Alvaro Herrera

Oh, but also I think I should push a mitigation in case a production
system hits this problem: maybe reduce the message from FATAL to WARNING
if the registered LSN is at a page boundary.

-- 
Álvaro Herrera  Valdivia, Chile  —  https://www.EnterpriseDB.com/
"Entristecido, Wutra (canción de Las Barreras)
echa a Freyr a rodar
y a nosotros al mar"

Re: prevent immature WAL streaming

2021-11-25 Thread Alvaro Herrera

On 2021-Nov-25, Amul Sul wrote:

> In XLogReadRecord(), both the variables being compared have
> inconsistency in the assignment -- one gets assigned from
> state->currRecPtr where other is from RecPtr.
> 
> .
> state->overwrittenRecPtr = state->currRecPtr;
> .
> state->abortedRecPtr = RecPtr;
> .
> 
> Before the place where assembled flag sets, there is a bunch of code
> that adjusts RecPtr. I think instead of  RecPtr, the latter assignment
> should use state->currRecPtr as well.

You're exactly right.  I managed to reproduce the problem shown by
buildfarm members, and indeed this fixes it.  And it makes sense: the
adjustment you refer to, is precisely to skip the page header when the
LSN is the start of the page, which is exactly the problem we're seeing
in the buildfarm ... except that on lapwing branch REL_11_STABLE, we're
seeing the LSN is off by 0x14 instead of 0x18.  That seems very strange.
I think the reason for this is that lapwing has MAXALIGN 4, so
MAXALIGN(sizeof(XLogPageHeaderData)) is 20, not 24 as is the case in the
other failing members.

... checks buildfarm ...

Yeah, all the others in Tom's list are x86-64.

I'm pushing the fix in a minute.

-- 
Álvaro Herrera PostgreSQL Developer  —  https://www.EnterpriseDB.com/
"¿Qué importan los años?  Lo que realmente importa es comprobar que
a fin de cuentas la mejor edad de la vida es estar vivo"  (Mafalda)

Re: prevent immature WAL streaming

2021-11-24 Thread Amul Sul

On Wed, Nov 24, 2021 at 2:10 AM Alvaro Herrera  wrote:
>
> On 2021-Nov-23, Tom Lane wrote:
>
> > We're *still* not out of the woods with 026_overwrite_contrecord.pl,
> > as we are continuing to see occasional "mismatching overwritten LSN"
> > failures, further down in the test where it tries to start up the
> > standby:
>
> Augh.
>
> > Looking at adjacent successful runs, it seems that the exact point
> > where the "missing contrecord" starts varies substantially, even after
> > our previous fix to disable autovacuum in this test.  How could that be?
>
> Well, there is intentionally some variability.  Maybe not as much as one
> would wish, but I expect that that should explain why that point is not
> always the same.
>
> > It's probably for the best though, because I think this is exposing
> > an actual bug that we would not have seen if the start point were
> > completely consistent.  I have not dug into the code, but it looks to
> > me like if the "consistent recovery state" is reached exactly at a
> > page boundary (0/1FFE000 in all these cases), then the standby expects
> > that to be what the OVERWRITE_CONTRECORD record will point at.  But
> > actually it points to the first WAL record on that page, resulting
> > in a bogus failure.
>
> So what is happening is that we set state->overwrittenRecPtr to the LSN
> of page start, ignoring the page header.  Is that the LSN of the first
> record in a page?  I'll see if I can reproduce the problem.
>

In XLogReadRecord(), both the variables being compared have
inconsistency in the assignment -- one gets assigned from
state->currRecPtr where other is from RecPtr.

.
state->overwrittenRecPtr = state->currRecPtr;
.
state->abortedRecPtr = RecPtr;
.

Before the place where assembled flag sets, there is a bunch of code
that adjusts RecPtr. I think instead of  RecPtr, the latter assignment
should use state->currRecPtr as well.

Regards,
Amul

Re: prevent immature WAL streaming

2021-11-23 Thread Alvaro Herrera

On 2021-Nov-23, Tom Lane wrote:

> We're *still* not out of the woods with 026_overwrite_contrecord.pl,
> as we are continuing to see occasional "mismatching overwritten LSN"
> failures, further down in the test where it tries to start up the
> standby:

Augh.

> Looking at adjacent successful runs, it seems that the exact point
> where the "missing contrecord" starts varies substantially, even after
> our previous fix to disable autovacuum in this test.  How could that be?

Well, there is intentionally some variability.  Maybe not as much as one
would wish, but I expect that that should explain why that point is not
always the same.

> It's probably for the best though, because I think this is exposing
> an actual bug that we would not have seen if the start point were
> completely consistent.  I have not dug into the code, but it looks to
> me like if the "consistent recovery state" is reached exactly at a
> page boundary (0/1FFE000 in all these cases), then the standby expects
> that to be what the OVERWRITE_CONTRECORD record will point at.  But
> actually it points to the first WAL record on that page, resulting
> in a bogus failure.

So what is happening is that we set state->overwrittenRecPtr to the LSN
of page start, ignoring the page header.  Is that the LSN of the first
record in a page?  I'll see if I can reproduce the problem.

-- 
Álvaro Herrera   39°49'30"S 73°17'W  —  https://www.EnterpriseDB.com/
"La persona que no quería pecar / estaba obligada a sentarse
 en duras y empinadas sillas/ desprovistas, por cierto
 de blandos atenuantes"  (Patricio Vogel)

Re: prevent immature WAL streaming

2021-11-23 Thread Tom Lane

We're *still* not out of the woods with 026_overwrite_contrecord.pl,
as we are continuing to see occasional "mismatching overwritten LSN"
failures, further down in the test where it tries to start up the
standby:

  sysname   |branch |  snapshot   | stage | 
l   
   
+---+-+---+
 spurfowl   | REL_13_STABLE | 2021-10-18 03:56:26 | recoveryCheck | 2021-10-18 
00:08:09.324 EDT [2455:6] FATAL:  mismatching overwritten LSN 0/1FFE018 -> 
0/1FFE000
 sidewinder | HEAD  | 2021-10-19 04:32:36 | recoveryCheck | 2021-10-19 
06:46:23.168 CEST [26393:6] FATAL:  mismatching overwritten LSN 0/1FFE018 -> 
0/1FFE000
 francolin  | REL9_6_STABLE | 2021-10-26 01:41:39 | recoveryCheck | 2021-10-26 
01:48:05.646 UTC [3417202][][1/0:0] FATAL:  mismatching overwritten LSN 
0/1FFE018 -> 0/1FFE000
 petalura   | HEAD  | 2021-11-05 00:20:03 | recoveryCheck | 2021-11-05 
02:58:12.146 CET [61848fb3.28d157:6] FATAL:  mismatching overwritten LSN 
0/1FFE018 -> 0/1FFE000
 lapwing| REL_11_STABLE | 2021-11-05 17:24:49 | recoveryCheck | 2021-11-05 
17:39:29.741 UTC [9831:6] FATAL:  mismatching overwritten LSN 0/1FFE014 -> 
0/1FFE000
 morepork   | HEAD  | 2021-11-10 02:51:12 | recoveryCheck | 2021-11-10 
04:03:33.576 CET [73561:6] FATAL:  mismatching overwritten LSN 0/1FFE018 -> 
0/1FFE000
 petalura   | HEAD  | 2021-11-16 15:20:03 | recoveryCheck | 2021-11-16 
18:16:47.875 CET [6193e77f.35b87f:6] FATAL:  mismatching overwritten LSN 
0/1FFE018 -> 0/1FFE000
 morepork   | HEAD  | 2021-11-17 03:45:36 | recoveryCheck | 2021-11-17 
04:57:04.359 CET [32089:6] FATAL:  mismatching overwritten LSN 0/1FFE018 -> 
0/1FFE000
 spurfowl   | REL_10_STABLE | 2021-11-22 22:21:03 | recoveryCheck | 2021-11-22 
17:29:35.520 EST [16011:6] FATAL:  mismatching overwritten LSN 0/1FFE018 -> 
0/1FFE000
(9 rows)

Looking at adjacent successful runs, it seems that the exact point
where the "missing contrecord" starts varies substantially, even after
our previous fix to disable autovacuum in this test.  How could that be?

It's probably for the best though, because I think this is exposing
an actual bug that we would not have seen if the start point were
completely consistent.  I have not dug into the code, but it looks to
me like if the "consistent recovery state" is reached exactly at a
page boundary (0/1FFE000 in all these cases), then the standby expects
that to be what the OVERWRITE_CONTRECORD record will point at.  But
actually it points to the first WAL record on that page, resulting
in a bogus failure.

regards, tom lane

Re: prevent immature WAL streaming

2021-11-10 Thread Dagfinn Ilmari Mannsåker

Tom Lane  writes:

> Also, I think we want
>
> -ok($initfile != $endfile, "$initfile differs from $endfile");
> +ok($initfile ne $endfile, "$initfile differs from $endfile");
>
> The existing coding works as long as all characters of these
> WAL segment names happen to be decimal digits, but ...

Even better style (IMO) would be:

  isnt($initfile, $endfile, "WAL file name has changed");

Or some other more descriptive message of _why_ it should have changed.

- ilmari

Re: prevent immature WAL streaming

2021-11-10 Thread Alvaro Herrera

On 2021-Nov-09, Tom Lane wrote:

> This is still happening off and on, which makes it look like a
> timing-sensitive problem.  Confirming that, I can make it fail
> every time by adding a long sleep just ahead of where
> 026_overwrite_contrecord.pl captures $initfile.  On reflection
> I think the problem is obvious: if autovacuum does anything
> concurrently with the test's startup portion, it will cause the
> carefully-established WAL insertion point to move into the
> next segment.  I propose to add "autovacuum = off" to the
> test's postmaster configuration.

Ooh, of course.

> Also, I think we want
> 
> -ok($initfile != $endfile, "$initfile differs from $endfile");
> +ok($initfile ne $endfile, "$initfile differs from $endfile");
> 
> The existing coding works as long as all characters of these
> WAL segment names happen to be decimal digits, but ...

Argh!

Thanks for taking care of these issues.

-- 
Álvaro Herrera PostgreSQL Developer  —  https://www.EnterpriseDB.com/
"La conclusión que podemos sacar de esos estudios es que
no podemos sacar ninguna conclusión de ellos" (Tanenbaum)

Re: prevent immature WAL streaming

2021-11-09 Thread Tom Lane

I wrote:
> Seems like this hasn't fixed the problem: skink still fails on
> this test occasionally.
> #   Failed test '00010002 differs from 
> 00010002'
> #   at t/026_overwrite_contrecord.pl line 61.

This is still happening off and on, which makes it look like a
timing-sensitive problem.  Confirming that, I can make it fail
every time by adding a long sleep just ahead of where
026_overwrite_contrecord.pl captures $initfile.  On reflection
I think the problem is obvious: if autovacuum does anything
concurrently with the test's startup portion, it will cause the
carefully-established WAL insertion point to move into the
next segment.  I propose to add "autovacuum = off" to the
test's postmaster configuration.

Also, I think we want

-ok($initfile != $endfile, "$initfile differs from $endfile");
+ok($initfile ne $endfile, "$initfile differs from $endfile");

The existing coding works as long as all characters of these
WAL segment names happen to be decimal digits, but ...

regards, tom lane

Re: prevent immature WAL streaming

2021-11-08 Thread Alvaro Herrera

On 2021-Nov-08, Alexander Lakhin wrote:

> Hello Alvaro,
> 14.10.2021 01:09, Alvaro Herrera wrote:
> >> Yea, let's go for your patch then. I've verified that at least locally it
> >> passes under valgrind.
> > Ah great, thanks.  Pushed then.
> >
> While translating messages I've noticed that the version of the patch
> ported to REL9_6_STABLE..REL_13_STABLE contains a typo "sucessfully".
> Please consider committing the fix.

Thanks, pushed.  I also modified the .po files that contained the typo
so as not to waste the translators' efforts.

I blamed the wrong commit in the 9.6 commit message.  Sigh.

-- 
Álvaro Herrera   39°49'30"S 73°17'W  —  https://www.EnterpriseDB.com/
Syntax error: function hell() needs an argument.
Please choose what hell you want to involve.

Re: prevent immature WAL streaming

2021-11-07 Thread Alexander Lakhin

Hello Alvaro,
14.10.2021 01:09, Alvaro Herrera wrote:
>> Yea, let's go for your patch then. I've verified that at least locally it
>> passes under valgrind.
> Ah great, thanks.  Pushed then.
>
While translating messages I've noticed that the version of the patch
ported to REL9_6_STABLE..REL_13_STABLE contains a typo "sucessfully".
Please consider committing the fix.

Best regards,
Alexander
diff --git a/src/backend/access/transam/xlog.c 
b/src/backend/access/transam/xlog.c
index 6d37b82330d..3d76fad1282 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -10383,7 +10383,7 @@ VerifyOverwriteContrecord(xl_overwrite_contrecord 
*xlrec, XLogReaderState *state
 (uint32) state->overwrittenRecPtr);
 
ereport(LOG,
-   (errmsg("sucessfully skipped missing contrecord at 
%X/%X, overwritten at %s",
+   (errmsg("successfully skipped missing contrecord at 
%X/%X, overwritten at %s",
(uint32) (xlrec->overwritten_lsn >> 32),
(uint32) xlrec->overwritten_lsn,

timestamptz_to_str(xlrec->overwrite_time;
diff --git a/src/test/recovery/t/026_overwrite_contrecord.pl 
b/src/test/recovery/t/026_overwrite_contrecord.pl
index c35e3717276..4fac44e2617 100644
--- a/src/test/recovery/t/026_overwrite_contrecord.pl
+++ b/src/test/recovery/t/026_overwrite_contrecord.pl
@@ -94,7 +94,7 @@ ok($node_standby->safe_psql('postgres', 'select * from foo') 
eq 'hello',
 my $log = slurp_file($node_standby->logfile);
 like(
$log,
-   qr[sucessfully skipped missing contrecord at],
+   qr[successfully skipped missing contrecord at],
"found log line in standby");
 
 $node->stop;

Re: prevent immature WAL streaming

2021-11-02 Thread Tom Lane

Alvaro Herrera  writes:
> On 2021-Oct-13, Andres Freund wrote:
>> Yea, let's go for your patch then. I've verified that at least locally it
>> passes under valgrind.

> Ah great, thanks.  Pushed then.

Seems like this hasn't fixed the problem: skink still fails on
this test occasionally.

https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=skink=2021-10-22%2013%3A52%3A00
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=skink=2021-11-02%2000%3A28%3A30

Both of these look like

#   Failed test '00010002 differs from 00010002'
#   at t/026_overwrite_contrecord.pl line 61.
# Looks like you failed 1 test of 3.
t/026_overwrite_contrecord.pl  
Dubious, test returned 1 (wstat 256, 0x100)
Failed 1/3 subtests 

which looks like the same thing we were seeing before.
010e52337 seems to have just decreased the probability of failure.

regards, tom lane

Re: prevent immature WAL streaming

2021-10-25 Thread Kyotaro Horiguchi

At Mon, 25 Oct 2021 10:34:27 +0530, Amul Sul  wrote in 
> On Mon, Oct 25, 2021 at 7:02 AM Kyotaro Horiguchi
>  wrote:
> >
> > At Fri, 22 Oct 2021 18:43:52 +0530, Amul Sul  wrote in
> > > Any thoughts about the patch posted previously?
> >
> > Honestly, xlogreader looks fine with the current shape. The reason is
> > that it seems cleaner as an interface boundary since the caller of
> > xlogreader doesn't need to know about the details of xlogreader. The
> > current code nicely hides the end+1 confusion.
> >
> > Even if we want to get rid of global variables in xlog.c, I don't
> > understand why we remove only abortedRecPtr. That change makes things
> > more complex as a whole by letting xlog.c be more conscious of
> > xlogreader's internals.  I'm not sure I like that aspect of the patch.
> >
> 
> Because we have other ways to get abortedRecPtr without having a
> global variable, but we don't have such a way for missingContrecPtr,
> AFAICU.

That depends on the reason why you want to get rid of the glboal
variables. Since we restart WAL reading before reading the two
variables so we can not rely on the xlogreader's corresponding
members. So we need another set of variables to preserve the values
beyond the restart.

> I agree using global variables makes things a bit easier, but those
> are inefficient when you want to share those with other processes --
> that would add extra burden to shared memory.

We could simply add a new member in XLogCtlData. Or we can create
another struct for ReadRecord's (not XLogReader's) state then allocate
shared memory to it.  I don't think it is the right solution to infer
it from another variable using knowledge of xlogreader's internals.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center

Re: prevent immature WAL streaming

2021-10-24 Thread Amul Sul

On Mon, Oct 25, 2021 at 7:02 AM Kyotaro Horiguchi
 wrote:
>
> At Fri, 22 Oct 2021 18:43:52 +0530, Amul Sul  wrote in
> > Any thoughts about the patch posted previously?
>
> Honestly, xlogreader looks fine with the current shape. The reason is
> that it seems cleaner as an interface boundary since the caller of
> xlogreader doesn't need to know about the details of xlogreader. The
> current code nicely hides the end+1 confusion.
>
> Even if we want to get rid of global variables in xlog.c, I don't
> understand why we remove only abortedRecPtr. That change makes things
> more complex as a whole by letting xlog.c be more conscious of
> xlogreader's internals.  I'm not sure I like that aspect of the patch.
>

Because we have other ways to get abortedRecPtr without having a
global variable, but we don't have such a way for missingContrecPtr,
AFAICU.

I agree using global variables makes things a bit easier, but those
are inefficient when you want to share those with other processes --
that would add extra burden to shared memory.

Regards,
Amul

Re: prevent immature WAL streaming

2021-10-24 Thread Kyotaro Horiguchi

At Fri, 22 Oct 2021 18:43:52 +0530, Amul Sul  wrote in 
> Any thoughts about the patch posted previously?

Honestly, xlogreader looks fine with the current shape. The reason is
that it seems cleaner as an interface boundary since the caller of
xlogreader doesn't need to know about the details of xlogreader. The
current code nicely hides the end+1 confusion.

Even if we want to get rid of global variables in xlog.c, I don't
understand why we remove only abortedRecPtr. That change makes things
more complex as a whole by letting xlog.c be more conscious of
xlogreader's internals.  I'm not sure I like that aspect of the patch.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center

Re: prevent immature WAL streaming

2021-10-22 Thread Amul Sul

On Thu, Oct 14, 2021 at 6:14 PM Amul Sul  wrote:
>
> On Wed, Oct 13, 2021 at 10:58 PM Alvaro Herrera  
> wrote:
> >
> > On 2021-Oct-13, Amul Sul wrote:
> >
> > > I have one more question, regarding the need for other global
> > > variables i.e. abortedRecPtr.  (Sorry for coming back after so long.)
> > >
> > > Instead of abortedRecPtr point, isn't enough to write
> > > overwrite-contrecord at XLogCtl->lastReplayedEndRecPtr?  I think both
> > > are pointing to the same location then can't we use
> > > lastReplayedEndRecPtr instead of abortedRecPtr to write
> > > overwrite-contrecord and remove need of extra global variable, like
> > > attached?
> >
> > I'm a bit fuzzy on the difference "the end+1" and "the start of the next
> > record".  Are they always the same?  We do have XLogRecPtrToBytePos()
> > and XLogBytePosToEndRecPtr() to convert unadorned XLogRecPtr values to
> > "usable byte positions", which suggests to me that the proposed patch
> > may fail if end+1 is a page or segment boundary.
> >
>
> Yes, you are correct, that could be a possible failure.
>
> How about calculating that from the lastReplayedEndRecPtr by
> converting it first to "usable byte positions" and then recalculating
> the record pointer from that, like attached?
>

Any thoughts about the patch posted previously?

Regards,
Amul

Re: prevent immature WAL streaming

2021-10-14 Thread Amul Sul

On Wed, Oct 13, 2021 at 10:58 PM Alvaro Herrera  wrote:
>
> On 2021-Oct-13, Amul Sul wrote:
>
> > I have one more question, regarding the need for other global
> > variables i.e. abortedRecPtr.  (Sorry for coming back after so long.)
> >
> > Instead of abortedRecPtr point, isn't enough to write
> > overwrite-contrecord at XLogCtl->lastReplayedEndRecPtr?  I think both
> > are pointing to the same location then can't we use
> > lastReplayedEndRecPtr instead of abortedRecPtr to write
> > overwrite-contrecord and remove need of extra global variable, like
> > attached?
>
> I'm a bit fuzzy on the difference "the end+1" and "the start of the next
> record".  Are they always the same?  We do have XLogRecPtrToBytePos()
> and XLogBytePosToEndRecPtr() to convert unadorned XLogRecPtr values to
> "usable byte positions", which suggests to me that the proposed patch
> may fail if end+1 is a page or segment boundary.
>

Yes, you are correct, that could be a possible failure.

How about calculating that from the lastReplayedEndRecPtr by
converting it first to "usable byte positions" and then recalculating
the record pointer from that, like attached?

> The other difference is that abortedRecPtr is set if we fail to read a
> record, but XLogCtl->lastReplayedEndRecPtr is set even if we read the
> record successfully.  So you'd have need a bool flag that the overwrite
> contrecord record needs to be written.  Your patch is using the fact
> that missingContrecPtr is non-invalid as such a flag ... I can't see
> anything wrong with that.  So maybe your patch is okay in this aspect.
>
> > You might wonder why I am so concerned about the global variable. The
> > reason is that I am working on another thread[1] where we are trying
> > to club all the WAL write operations that happen at the end of
> > StartupXLOG into a separate function. In the future, we might want to
> > allow executing this function from other processes (e.g.
> > Checkpointer). For that, we need to remove the dependency of those WAL
> > write operations having on the global variables which are mostly valid
> > in the startup process.
>
> Seems a fine goal.

Thanks for looking at the patch.

Regards,
Amul


v2_remove-abortedRecPtr-global-variable.patch
Description: Binary data

Re: prevent immature WAL streaming

2021-10-13 Thread Alvaro Herrera

On 2021-Oct-13, Andres Freund wrote:

> Hi,
> 
> On 2021-10-13 17:46:53 -0300, Alvaro Herrera wrote:
> > On 2021-Oct-13, Andres Freund wrote:
> > 
> > > Something very roughly like the attached. Perhaps that's going a bit 
> > > overboard
> > > though. But it seems like it might be something we could use in a few 
> > > tests?
> > 
> > Hah, our emails crossed.  If you want to turn this into a patch to the
> > 026 test file, please go ahead.  Failing that, I'd just push the patch I
> > just sent in my other reply.
> 
> Yea, let's go for your patch then. I've verified that at least locally it
> passes under valgrind.

Ah great, thanks.  Pushed then.

-- 
Álvaro Herrera  Valdivia, Chile  —  https://www.EnterpriseDB.com/
"But static content is just dynamic content that isn't moving!"
http://smylers.hates-software.com/2007/08/15/fe244d0c.html

Re: prevent immature WAL streaming

2021-10-13 Thread Andres Freund

Hi,

On 2021-10-13 17:46:53 -0300, Alvaro Herrera wrote:
> On 2021-Oct-13, Andres Freund wrote:
> 
> > Something very roughly like the attached. Perhaps that's going a bit 
> > overboard
> > though. But it seems like it might be something we could use in a few tests?
> 
> Hah, our emails crossed.  If you want to turn this into a patch to the
> 026 test file, please go ahead.  Failing that, I'd just push the patch I
> just sent in my other reply.

Yea, let's go for your patch then. I've verified that at least locally it
passes under valgrind.

Greetings,

Andres Freund

Re: prevent immature WAL streaming

2021-10-13 Thread Alvaro Herrera

Hi

On 2021-Oct-13, Andres Freund wrote:

> Something very roughly like the attached. Perhaps that's going a bit overboard
> though. But it seems like it might be something we could use in a few tests?

Hah, our emails crossed.  If you want to turn this into a patch to the
026 test file, please go ahead.  Failing that, I'd just push the patch I
just sent in my other reply.

-- 
Álvaro Herrera PostgreSQL Developer  —  https://www.EnterpriseDB.com/
"You're _really_ hosed if the person doing the hiring doesn't understand
relational systems: you end up with a whole raft of programmers, none of
whom has had a Date with the clue stick."  (Andrew Sullivan)

Re: prevent immature WAL streaming

2021-10-13 Thread Alvaro Herrera

This works nicely with the TAP test:



-- 
Álvaro Herrera PostgreSQL Developer  —  https://www.EnterpriseDB.com/
"Postgres is bloatware by design: it was built to house
 PhD theses." (Joey Hellerstein, SIGMOD annual conference 2002)
>From e9bcdb3b44f36ffa1a2377f252d73ff3d7210fd9 Mon Sep 17 00:00:00 2001
From: Alvaro Herrera 
Date: Wed, 13 Oct 2021 17:41:31 -0300
Subject: [PATCH] do away with small tuple writing

---
 .../recovery/t/026_overwrite_contrecord.pl| 39 +++
 1 file changed, 22 insertions(+), 17 deletions(-)

diff --git a/src/test/recovery/t/026_overwrite_contrecord.pl b/src/test/recovery/t/026_overwrite_contrecord.pl
index 1725a97531..b7d3b6e03d 100644
--- a/src/test/recovery/t/026_overwrite_contrecord.pl
+++ b/src/test/recovery/t/026_overwrite_contrecord.pl
@@ -22,27 +22,32 @@ $node->init(allows_streaming => 1);
 $node->append_conf('postgresql.conf', 'wal_keep_size=1GB');
 $node->start;
 
-$node->safe_psql('postgres', 'create table filler (a int)');
-# First, measure how many bytes does the insertion of 1000 rows produce
-my $start_lsn =
-  $node->safe_psql('postgres', q{select pg_current_wal_insert_lsn() - '0/0'});
-$node->safe_psql('postgres',
-	'insert into filler select * from generate_series(1, 1000)');
-my $end_lsn =
-  $node->safe_psql('postgres', q{select pg_current_wal_insert_lsn() - '0/0'});
-my $rows_walsize = $end_lsn - $start_lsn;
+$node->safe_psql('postgres', 'create table filler (a int, b text)');
 
 # Now consume all remaining room in the current WAL segment, leaving
 # space enough only for the start of a largish record.
 $node->safe_psql(
-	'postgres', qq{
-WITH setting AS (
-  SELECT setting::int AS wal_segsize
-FROM pg_settings WHERE name = 'wal_segment_size'
-)
-INSERT INTO filler
-SELECT g FROM setting,
-  generate_series(1, 1000 * (wal_segsize - ((pg_current_wal_insert_lsn() - '0/0') % wal_segsize)) / $rows_walsize) g
+	'postgres', q{
+DO $$
+DECLARE
+wal_segsize int := setting::int FROM pg_settings WHERE name = 'wal_segment_size';
+remain int;
+iters  int := 0;
+BEGIN
+LOOP
+INSERT into filler
+select g, repeat(md5(g::text), (random() * 60 + 1)::int)
+from generate_series(1, 10) g;
+
+remain := wal_segsize - (pg_current_wal_insert_lsn() - '0/0') % wal_segsize;
+IF remain < 2 * setting::int from pg_settings where name = 'block_size' THEN
+RAISE log 'exiting after % iterations, % bytes to end of WAL segment', iters, remain;
+EXIT;
+END IF;
+iters := iters + 1;
+END LOOP;
+END
+$$;
 });
 
 my $initfile = $node->safe_psql('postgres',
-- 
2.30.2

Re: prevent immature WAL streaming

2021-10-13 Thread Andres Freund

Hi,

On 2021-10-13 12:13:45 -0700, Andres Freund wrote:
> On 2021-10-13 15:52:46 -0300, Alvaro Herrera wrote:
> > I think I realized partway through writing the test that I could use
> > emit_message instead of using a batched row insert ... so, yeah, we
> > can use it here also.
>
> Cool. Even if we want to use inserts, lets at least make the rows wide...
>
> I think it'd be good to have a bit of variance in record width. So perhaps
> adding a bit of random() in to influence record width would be a good idea?

Something very roughly like the attached. Perhaps that's going a bit overboard
though. But it seems like it might be something we could use in a few tests?


fill_seg.sql
Description: application/sql

Re: prevent immature WAL streaming

2021-10-13 Thread Alvaro Herrera

On 2021-Oct-13, Andres Freund wrote:

> > > Another thing: filling a segment by inserting lots of very tiny rows is 
> > > pretty
> > > expensive. Can't we use something a bit wider? Perhaps even emit_message?
> 
> FWIW, the count of inserted rows is something like 171985 ;)

This does ~1600 iterations to fill one segment, 10 rows per iteration,
row size is variable; exits when two BLCKSZ remain to complete the WAL
segment:

create table filler (a int, b text);
do $$
declare
wal_segsize int := setting::int from pg_settings where name = 
'wal_segment_size';
remain int;
iters int := 0;
begin
loop
insert into filler
select g, repeat(md5(g::text), (random() * 60 + 1)::int)
from generate_series(1, 10) g;

remain := wal_segsize - (pg_current_wal_insert_lsn() - '0/0') % 
wal_segsize;
raise notice '(%) remain: %', iters, remain;
if remain < 2 * setting::int from pg_settings where name = 
'block_size' then
exit;
end if;
iters := iters + 1;
end loop;
end
$$ ;

(Of course, I'm not proposing that the 'raise notice' be there in the
committed form.)

If I enlarge the 'repeat' count, it gets worse (more iterations
required) because a lot of the rows become toasted and thus subject to
compression.  If I do 20 rows per iteration rather than 10, the risk is
that we'll do too many near the end of the segment and we'll have to
continue running until completing the next one.

So, this seems good enough.

-- 
Álvaro Herrera PostgreSQL Developer  —  https://www.EnterpriseDB.com/
"Once again, thank you and all of the developers for your hard work on
PostgreSQL.  This is by far the most pleasant management experience of
any database I've worked on." (Dan Harris)
http://archives.postgresql.org/pgsql-performance/2006-04/msg00247.php

Re: prevent immature WAL streaming

2021-10-13 Thread Andres Freund

Hi,

On 2021-10-13 15:52:46 -0300, Alvaro Herrera wrote:
> > Hm. I guess we can disable autovac. But that's not a great solution, there
> > might be WAL files due to catalog access etc too.
> 
> Well, we don't expect anything else to happen -- the cluster is
> otherwise idle.  I think we should do it regardless of any other
> changes, just to keep things steadier.

IDK, it seems good to have a bit of variance as well. But I don't have a
strong opinion on it.


> > Another thing: filling a segment by inserting lots of very tiny rows is 
> > pretty
> > expensive. Can't we use something a bit wider? Perhaps even emit_message?

FWIW, the count of inserted rows is something like 171985 ;)


> I think I realized partway through writing the test that I could use
> emit_message instead of using a batched row insert ... so, yeah, we
> can use it here also.

Cool. Even if we want to use inserts, lets at least make the rows wide...

I think it'd be good to have a bit of variance in record width. So perhaps
adding a bit of random() in to influence record width would be a good idea?

Greetings,

Andres Freund

Re: prevent immature WAL streaming

2021-10-13 Thread Alvaro Herrera

On 2021-Oct-13, Andres Freund wrote:

> I added LSNs to the error message:
> not ok 1 - 00010002 (0/2002350) differs from 
> 00010002 (0/2099600)
> 
> It appears that the problem is that inbetween the determination of
> rows_walsize the insert LSN moved to the next segment separately from the
> insertion, presumably due to autovacuum/analayze or such.
>  log_min_duration_statement=0>

Oh, of course.

> Hm. I guess we can disable autovac. But that's not a great solution, there
> might be WAL files due to catalog access etc too.

Well, we don't expect anything else to happen -- the cluster is
otherwise idle.  I think we should do it regardless of any other
changes, just to keep things steadier.

> Seems like it might be worth doing the "filling" of the segment with a loop in
> a DO block instead, where the end condition is to be within some distance to
> the end of the segment? With plenty headroom?

Eh, good idea ... didn't think of that, but it should keep things more
stable under strange conditions.

> Another thing: filling a segment by inserting lots of very tiny rows is pretty
> expensive. Can't we use something a bit wider? Perhaps even emit_message?

I think I realized partway through writing the test that I could use
emit_message instead of using a batched row insert ... so, yeah, we
can use it here also.

-- 
Álvaro Herrera  Valdivia, Chile  —  https://www.EnterpriseDB.com/
"La verdad no siempre es bonita, pero el hambre de ella sí"

Re: prevent immature WAL streaming

2021-10-13 Thread Andres Freund

Hi,

On 2021-10-13 11:03:39 -0700, Andres Freund wrote:
> It seems that 026_overwrite_contrecord.pl test often fails under valgrind. I
> first thought that related failures on skink were due to me migrating the
> animal to a new host (and then running out of space due to a mistake in ccache
> config). But it happened again after I fixed those, and I just reproduced the
> issue locally.
> 
> It's a bit odd that we didn't start to see these failures immediately, but
> only in the last few days. I'd blame skink being migrated to a new home,
> except that I can see the issue locally.
> 
> FWIW, the way skink runs all postgres instances through valgrind is by
> replacing the postgres binary. Here's my local version of that:
> 
> andres@awork3:~/build/postgres/dev-assert/vpath$ cat 
> tmp_install/home/andres/build/postgres/dev-assert/install/bin/postgres
> #!/bin/bash
> 
> exec /usr/bin/valgrind \
> --quiet \
> --error-exitcode=128 \
> --suppressions=/home/andres/src/postgresql/src/tools/valgrind.supp \
> --trace-children=yes --track-origins=yes --read-var-info=no \
> --leak-check=no \
> --run-libc-freeres=no \
> --vgdb=no \
> --error-markers=VALGRINDERROR-BEGIN,VALGRINDERROR-END \
> 
> /home/andres/build/postgres/dev-assert/vpath/tmp_install/home/andres/build/postgres/dev-assert/install/bin/postgres.orig
>  \
> "$@"
> 
> make -C src/test/recovery/ check PROVE_FLAGS='-v' 
> PROVE_TESTS='t/026_overwrite_contrecord.pl' NO_TEMP_INSTALL=1
> ...
> 
> not ok 1 - 00010002 differs from 00010002
> 
> #   Failed test '00010002 differs from 
> 00010002'

I added LSNs to the error message:
not ok 1 - 00010002 (0/2002350) differs from 
00010002 (0/2099600)

It appears that the problem is that inbetween the determination of
rows_walsize the insert LSN moved to the next segment separately from the
insertion, presumably due to autovacuum/analayze or such.



2021-10-13 11:23:25.659 PDT [1491455] 026_overwrite_contrecord.pl LOG:  
statement: insert into filler select * from generate_series(1, 1000)
2021-10-13 11:23:26.467 PDT [1491455] 026_overwrite_contrecord.pl LOG:  
duration: 861.112 ms
2021-10-13 11:23:27.055 PDT [1491458] 026_overwrite_contrecord.pl LOG:  
statement: select pg_current_wal_insert_lsn() - '0/0'
2021-10-13 11:23:27.357 PDT [1491458] 026_overwrite_contrecord.pl LOG:  
duration: 347.888 ms
2021-10-13 11:23:27.980 PDT [1491461] 026_overwrite_contrecord.pl LOG:  
statement: WITH setting AS (
  SELECT setting::int AS wal_segsize
FROM pg_settings WHERE name = 'wal_segment_size'
)
INSERT INTO filler
SELECT g FROM setting,
  generate_series(1, 1000 * (wal_segsize - 
((pg_current_wal_insert_lsn() - '0/0') % wal_segsize)) / 64232) g
2021-10-13 11:24:25.173 PDT [1491550] LOG:  automatic analyze of table 
"postgres.public.filler"
avg read rate: 3.185 MB/s, avg write rate: 0.039 MB/s
buffer usage: 96 hits, 731 misses, 9 dirtied
system usage: CPU: user: 1.79 s, system: 0.00 s, elapsed: 1.81 s
2021-10-13 11:24:26.255 PDT [1491461] 026_overwrite_contrecord.pl LOG:  
duration: 58360.811 ms
2021-10-13 11:24:26.857 PDT [1491557] 026_overwrite_contrecord.pl LOG:  
statement: SELECT pg_current_wal_insert_lsn()
2021-10-13 11:24:27.120 PDT [1491557] 026_overwrite_contrecord.pl LOG:  
duration: 300.562 ms


Hm. I guess we can disable autovac. But that's not a great solution, there
might be WAL files due to catalog access etc too.

Seems like it might be worth doing the "filling" of the segment with a loop in
a DO block instead, where the end condition is to be within some distance to
the end of the segment? With plenty headroom?


Another thing: filling a segment by inserting lots of very tiny rows is pretty
expensive. Can't we use something a bit wider? Perhaps even emit_message?

Greetings,

Andres Freund

Re: prevent immature WAL streaming

2021-10-13 Thread Andres Freund

Hi,

It seems that 026_overwrite_contrecord.pl test often fails under valgrind. I
first thought that related failures on skink were due to me migrating the
animal to a new host (and then running out of space due to a mistake in ccache
config). But it happened again after I fixed those, and I just reproduced the
issue locally.

It's a bit odd that we didn't start to see these failures immediately, but
only in the last few days. I'd blame skink being migrated to a new home,
except that I can see the issue locally.

FWIW, the way skink runs all postgres instances through valgrind is by
replacing the postgres binary. Here's my local version of that:

andres@awork3:~/build/postgres/dev-assert/vpath$ cat 
tmp_install/home/andres/build/postgres/dev-assert/install/bin/postgres
#!/bin/bash

exec /usr/bin/valgrind \
--quiet \
--error-exitcode=128 \
--suppressions=/home/andres/src/postgresql/src/tools/valgrind.supp \
--trace-children=yes --track-origins=yes --read-var-info=no \
--leak-check=no \
--run-libc-freeres=no \
--vgdb=no \
--error-markers=VALGRINDERROR-BEGIN,VALGRINDERROR-END \

/home/andres/build/postgres/dev-assert/vpath/tmp_install/home/andres/build/postgres/dev-assert/install/bin/postgres.orig
 \
"$@"

make -C src/test/recovery/ check PROVE_FLAGS='-v' 
PROVE_TESTS='t/026_overwrite_contrecord.pl' NO_TEMP_INSTALL=1
...

not ok 1 - 00010002 differs from 00010002

#   Failed test '00010002 differs from 00010002'


Regards,

Andres

Re: prevent immature WAL streaming

2021-10-13 Thread Alvaro Herrera

On 2021-Oct-13, Robert Haas wrote:

> On Wed, Oct 13, 2021 at 2:01 AM Amul Sul  wrote:
> > Instead of abortedRecPtr point, isn't enough to write
> > overwrite-contrecord at XLogCtl->lastReplayedEndRecPtr?  I think both
> > are pointing to the same location then can't we use
> > lastReplayedEndRecPtr instead of abortedRecPtr to write
> > overwrite-contrecord and remove need of extra global variable, like
> > attached?
> 
> I think you mean missingContrecPtr, not abortedRecPtr. If I understand
> correctly, abortedRecPtr is going to be the location in some WAL
> segment which we replayed where a long record began, but
> missingContrecPtr seems like it would have to point to the beginning
> of the first segment we were unable to find to continue replay; and
> thus it ought to be the same as lastReplayedEndRecPtr.

So abortedRecPtr and missingContrecPtr both point to the same long
record: the former is the start of the record, and the latter is some
intermediate position where we failed to find the contrecord.
lastReplayedEndRecPtr is the end+1 of the record *prior* to the long
record.

> But the committed code doesn't seem to check that these are the same
> or verify the relationship between them in any way, so I'm worried
> there is some other case here.

Yeah, the only reason they are the same is that xlogreader sets both to
Invalid when reading a record, and then sets both when a read fails.

> The comments in XLogReadRecord also suggest this:
> 
>  * We get here when a record that spans multiple pages needs to be
>  * assembled, but something went wrong -- perhaps a contrecord piece
>  * was lost.  If caller is WAL replay, it will know where the aborted
> 
> Saying that "perhaps" a contrecord piece was lost seems to imply that
> other explanations are possible as well, but I'm not sure what.

Other explanations are possible.  Imagine cosmic rays alter one byte in
the last contrecord.  WAL replay will stop there, and the contrecord
will have been found all right, but CRC check would have failed to pass,
so we would set xlogreader->missingContrecPtr to the final contrecord of
that record (I didn't actually verify this particular scenario.)

In fact, anything that causes xlogreader.c:XLogReadRecord to return NULL
after setting "assembled=true" would set both abortedRecPtr and
missingContrecPtr -- except DecodeXLogRecord failure, which perhaps
should be handled in the same way.

-- 
Álvaro Herrera PostgreSQL Developer  —  https://www.EnterpriseDB.com/
"Siempre hay que alimentar a los dioses, aunque la tierra esté seca" (Orual)

Re: prevent immature WAL streaming

2021-10-13 Thread Alvaro Herrera

On 2021-Oct-13, Amul Sul wrote:

> I have one more question, regarding the need for other global
> variables i.e. abortedRecPtr.  (Sorry for coming back after so long.)
> 
> Instead of abortedRecPtr point, isn't enough to write
> overwrite-contrecord at XLogCtl->lastReplayedEndRecPtr?  I think both
> are pointing to the same location then can't we use
> lastReplayedEndRecPtr instead of abortedRecPtr to write
> overwrite-contrecord and remove need of extra global variable, like
> attached?

I'm a bit fuzzy on the difference "the end+1" and "the start of the next
record".  Are they always the same?  We do have XLogRecPtrToBytePos()
and XLogBytePosToEndRecPtr() to convert unadorned XLogRecPtr values to
"usable byte positions", which suggests to me that the proposed patch
may fail if end+1 is a page or segment boundary.

The other difference is that abortedRecPtr is set if we fail to read a
record, but XLogCtl->lastReplayedEndRecPtr is set even if we read the
record successfully.  So you'd have need a bool flag that the overwrite
contrecord record needs to be written.  Your patch is using the fact
that missingContrecPtr is non-invalid as such a flag ... I can't see
anything wrong with that.  So maybe your patch is okay in this aspect.

> You might wonder why I am so concerned about the global variable. The
> reason is that I am working on another thread[1] where we are trying
> to club all the WAL write operations that happen at the end of
> StartupXLOG into a separate function. In the future, we might want to
> allow executing this function from other processes (e.g.
> Checkpointer). For that, we need to remove the dependency of those WAL
> write operations having on the global variables which are mostly valid
> in the startup process.

Seems a fine goal.

-- 
Álvaro Herrera PostgreSQL Developer  —  https://www.EnterpriseDB.com/

Re: prevent immature WAL streaming

2021-10-13 Thread Robert Haas

On Wed, Oct 13, 2021 at 2:01 AM Amul Sul  wrote:
> Instead of abortedRecPtr point, isn't enough to write
> overwrite-contrecord at XLogCtl->lastReplayedEndRecPtr?  I think both
> are pointing to the same location then can't we use
> lastReplayedEndRecPtr instead of abortedRecPtr to write
> overwrite-contrecord and remove need of extra global variable, like
> attached?

I think you mean missingContrecPtr, not abortedRecPtr. If I understand
correctly, abortedRecPtr is going to be the location in some WAL
segment which we replayed where a long record began, but
missingContrecPtr seems like it would have to point to the beginning
of the first segment we were unable to find to continue replay; and
thus it ought to be the same as lastReplayedEndRecPtr. But the
committed code doesn't seem to check that these are the same or verify
the relationship between them in any way, so I'm worried there is some
other case here. The comments in XLogReadRecord also suggest this:

 * We get here when a record that spans multiple pages needs to be
 * assembled, but something went wrong -- perhaps a contrecord piece
 * was lost.  If caller is WAL replay, it will know where the aborted

Saying that "perhaps" a contrecord piece was lost seems to imply that
other explanations are possible as well, but I'm not sure what.

-- 
Robert Haas
EDB: http://www.enterprisedb.com

RE: prevent immature WAL streaming

2021-10-13 Thread Jakub Wartak

On 2021-Sep-25, Alvaro Herrera wrote:
>> On 2021-Sep-24, Alvaro Herrera wrote:
>> 
>> > Here's the set for all branches, which I think are really final, in
>> > case somebody wants to play and reproduce their respective problem
>> scenarios.
>> 
>> I forgot to mention that I'll wait until 14.0 is tagged before getting 
>> anything
>> pushed.

Hi Alvaro, sorry for being late to the party, but to add some reassurance that 
v2-commited-fix this really fixes solves the initial production problem, I've 
done limited test on it (just like with the v1-patch idea earlier/ with using 
wal_keep_segments, wal_init_zero=on, archive_mode=on and 
archive_command='/bin/true')

- On 12.8, I was able like last time to manually reproduce it on 3 out of 3 
tries and I've got: 2x "invalid contrecord length", 1x "there is no contrecord 
flag" on standby.

- On soon-to-be-become-12.9 REL_12_STABLE (with commit 
1df0a914d58f2bdb03c11dfcd2cb9cd01c286d59 ) on 4 out of 4 tries, I've got 
beautiful insight into what happened:
LOG:  started streaming WAL from primary at 1/EC00 on timeline 1
LOG:  sucessfully skipped missing contrecord at 1/EBF8, overwritten at 
2021-10-13 11:22:37.48305+00
CONTEXT:  WAL redo at 1/EC28 for XLOG/OVERWRITE_CONTRECORD: lsn 1/EBF8; 
time 2021-10-13 11:22:37.48305+00
...and slave was able to carry-on automatically. In 4th test, the cascade was 
tested too (m -> s1 -> s11) and both {s1,s11} did behave properly and log the 
above message. Also additional check proved that after simulating ENOSPC crash 
on master the data contents were identical everywhere (m1=s1=s11). 

Thank you Alvaro and also to everybody else who participated in solving this 
challenging and really edge-case nasty bug.

-J.

Re: prevent immature WAL streaming

2021-10-13 Thread Amul Sul

Hi,

On Thu, Oct 7, 2021 at 6:57 PM Alvaro Herrera  wrote:
>
> On 2021-Oct-07, Amul Sul wrote:
>
> > Make sense, thanks for the explanation.
>
> You're welcome.  Also, I forgot: thank you for taking the time to review
> the code.  Much appreciated.

:)

>
>

I have one more question, regarding the need for other global
variables i.e. abortedRecPtr.  (Sorry for coming back after so long.)

Instead of abortedRecPtr point, isn't enough to write
overwrite-contrecord at XLogCtl->lastReplayedEndRecPtr?  I think both
are pointing to the same location then can't we use
lastReplayedEndRecPtr instead of abortedRecPtr to write
overwrite-contrecord and remove need of extra global variable, like
attached?

You might wonder why I am so concerned about the global variable. The
reason is that I am working on another thread[1] where we are trying
to club all the WAL write operations that happen at the end of
StartupXLOG into a separate function. In the future, we might want to
allow executing this function from other processes (e.g.
Checkpointer). For that, we need to remove the dependency of those WAL
write operations having on the global variables which are mostly valid
in the startup process. The easiest way to do that is simply copy all
the global variables into shared memory but that will not be an
optimised solution, the goal is to try to see if we could leverage the existing
information available into shared memory. I would be grateful if you
could share your thoughts on the same, thanks.

Regards,
Amul

1] 
https://postgr.es/m/caaj_b97kzzdjsffwrk7w0xu5hnxkcgkgtr69t8cozztsyxj...@mail.gmail.com
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 26dcc00ac01..52b93cd80cc 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -200,12 +200,11 @@ static XLogRecPtr flushedUpto = 0;
 static TimeLineID receiveTLI = 0;
 
 /*
- * abortedRecPtr is the start pointer of a broken record at end of WAL when
- * recovery completes; missingContrecPtr is the location of the first
- * contrecord that went missing.  See CreateOverwriteContrecordRecord for
- * details.
+ * missingContrecPtr is the location of the first contrecord that went missing.
+ * XLogCtl->lastReplayedEndRecPtr will be the start pointer of a broken record
+ * at end of WAL when recovery completes. See CreateOverwriteContrecordRecord
+ * for details.
  */
-static XLogRecPtr abortedRecPtr;
 static XLogRecPtr missingContrecPtr;
 
 /*
@@ -4425,11 +4424,8 @@ ReadRecord(XLogReaderState *xlogreader, int emode,
 			 * that portion is to be ignored.
 			 */
 			if (!StandbyMode &&
-!XLogRecPtrIsInvalid(xlogreader->abortedRecPtr))
-			{
-abortedRecPtr = xlogreader->abortedRecPtr;
+!XLogRecPtrIsInvalid(xlogreader->missingContrecPtr))
 missingContrecPtr = xlogreader->missingContrecPtr;
-			}
 
 			if (readFile >= 0)
 			{
@@ -7109,7 +7105,6 @@ StartupXLOG(void)
 	/*
 	 * Start recovery assuming that the final record isn't lost.
 	 */
-	abortedRecPtr = InvalidXLogRecPtr;
 	missingContrecPtr = InvalidXLogRecPtr;
 
 	/* REDO */
@@ -7876,10 +7871,7 @@ StartupXLOG(void)
 	 * we'll do as soon as we're open for writing new WAL.)
 	 */
 	if (!XLogRecPtrIsInvalid(missingContrecPtr))
-	{
-		Assert(!XLogRecPtrIsInvalid(abortedRecPtr));
 		EndOfLog = missingContrecPtr;
-	}
 
 	/*
 	 * Prepare to write WAL starting at EndOfLog location, and init xlog
@@ -7936,11 +7928,10 @@ StartupXLOG(void)
 	LocalSetXLogInsertAllowed();
 
 	/* If necessary, write overwrite-contrecord before doing anything else */
-	if (!XLogRecPtrIsInvalid(abortedRecPtr))
+	if (!XLogRecPtrIsInvalid(missingContrecPtr))
 	{
-		Assert(!XLogRecPtrIsInvalid(missingContrecPtr));
-		CreateOverwriteContrecordRecord(abortedRecPtr);
-		abortedRecPtr = InvalidXLogRecPtr;
+		Assert(!XLogRecPtrIsInvalid(XLogCtl->lastReplayedEndRecPtr));
+		CreateOverwriteContrecordRecord(XLogCtl->lastReplayedEndRecPtr);
 		missingContrecPtr = InvalidXLogRecPtr;
 	}
 
diff --git a/src/backend/access/transam/xlogreader.c b/src/backend/access/transam/xlogreader.c
index 4b03577dccd..516124ecd5f 100644
--- a/src/backend/access/transam/xlogreader.c
+++ b/src/backend/access/transam/xlogreader.c
@@ -294,7 +294,6 @@ XLogReadRecord(XLogReaderState *state, char **errormsg)
 	state->errormsg_buf[0] = '\0';
 
 	ResetDecoder(state);
-	state->abortedRecPtr = InvalidXLogRecPtr;
 	state->missingContrecPtr = InvalidXLogRecPtr;
 
 	RecPtr = state->EndRecPtr;
@@ -586,7 +585,6 @@ err:
 		 * in turn signal downstream WAL consumers that the broken WAL record
 		 * is to be ignored.
 		 */
-		state->abortedRecPtr = RecPtr;
 		state->missingContrecPtr = targetPagePtr;
 	}
 
diff --git a/src/include/access/xlogreader.h b/src/include/access/xlogreader.h
index de6fd791fe6..d876e25de7f 100644
--- a/src/include/access/xlogreader.h
+++ b/src/include/access/xlogreader.h
@@ -176,11 +176,9 @@ struct XLogReaderState
 	XLogRecPtr	EndRecPtr;		/* end+1 of last record read */
 
 	/*
-	 * Set at the end

Re: prevent immature WAL streaming

2021-10-07 Thread Alvaro Herrera

On 2021-Oct-07, Amul Sul wrote:

> Make sense, thanks for the explanation.

You're welcome.  Also, I forgot: thank you for taking the time to review
the code.  Much appreciated.

-- 
Álvaro Herrera   39°49'30"S 73°17'W  —  https://www.EnterpriseDB.com/

Re: prevent immature WAL streaming

2021-10-07 Thread Amul Sul

On Thu, 7 Oct 2021 at 6:41 PM, Alvaro Herrera 
wrote:

> On 2021-Oct-07, Amul Sul wrote:
>
> > While reading this commit (ff9f111bce24), wondered can't we skip
> > missingContrecPtr global variable declaration and calculate that from
> > abortedRecPtr value whenever it needed. IIUC, missingContrecPtr is the
> > next page to the page that abortedRecPtr contain and that can be
> > calculated as "abortedRecPtr + (XLOG_BLCKSZ - (abortedRecPtr %
> > XLOG_BLCKSZ))", thoughts? Please correct me if I'm missing something,
> > thanks.
>
> I don't think that works -- what if the missing record is not on the
> next page but on some future one?  Imagine an enormously large record
> that starts on segment 1, covers all of segment 2 and ends in segment 3.
> We could have flushed segment 2 already, so with your idea we would skip
> ahead only to that position, but really we need to skip all the way to
> the first page of segment 3.
>
> This is easier to imagine if you set wal segment size to 1 MB, but it is
> possible with the default segment size too, since commit records can be
> arbitrarily large, and "logical message" records as well.


Make sense, thanks for the explanation.

Regards,
Amul
-- 
Regards,
Amul Sul
EDB: http://www.enterprisedb.com

Re: prevent immature WAL streaming

2021-10-07 Thread Alvaro Herrera

On 2021-Oct-07, Amul Sul wrote:

> While reading this commit (ff9f111bce24), wondered can't we skip
> missingContrecPtr global variable declaration and calculate that from
> abortedRecPtr value whenever it needed. IIUC, missingContrecPtr is the
> next page to the page that abortedRecPtr contain and that can be
> calculated as "abortedRecPtr + (XLOG_BLCKSZ - (abortedRecPtr %
> XLOG_BLCKSZ))", thoughts? Please correct me if I'm missing something,
> thanks.

I don't think that works -- what if the missing record is not on the
next page but on some future one?  Imagine an enormously large record
that starts on segment 1, covers all of segment 2 and ends in segment 3.
We could have flushed segment 2 already, so with your idea we would skip
ahead only to that position, but really we need to skip all the way to
the first page of segment 3.

This is easier to imagine if you set wal segment size to 1 MB, but it is
possible with the default segment size too, since commit records can be
arbitrarily large, and "logical message" records as well.

-- 
Álvaro Herrera  Valdivia, Chile  —  https://www.EnterpriseDB.com/

Re: prevent immature WAL streaming

2021-10-07 Thread Amul Sul

Hi,

On Wed, Sep 29, 2021 at 8:14 PM Alvaro Herrera  wrote:
>
> On 2021-Sep-24, Alvaro Herrera wrote:
>
> > Here's the set for all branches, which I think are really final, in case
> > somebody wants to play and reproduce their respective problem scenarios.
> > Nathan already confirmed that his reproducer no longer shows a problem,
> > and this version shouldn't affect that.
>
> Pushed.  Watching for buildfarm fireworks now.
>

While reading this commit (ff9f111bce24), wondered can't we skip
missingContrecPtr global variable declaration and calculate that from
abortedRecPtr value whenever it needed. IIUC, missingContrecPtr is the
next page to the page that abortedRecPtr contain and that can be
calculated as "abortedRecPtr + (XLOG_BLCKSZ - (abortedRecPtr %
XLOG_BLCKSZ))", thoughts? Please correct me if I'm missing something,
thanks.

Regards,
Amul

Re: prevent immature WAL streaming

2021-09-30 Thread Tom Lane

Alvaro Herrera  writes:
> Hmm.  Well, as I said, maybe this part of the test isn't worth much
> anyway.  Rather than spending time trying to figure out why isn't this
> triggering the WAL overwriting, I compared the coverage report for
> running only the first test to the coverage report of running only the
> second test.  It turns out that there's no relevant coverage increase in
> the second test.  So I propose just removing that part.

Seems reasonable.  We don't need to spend buildfarm cycles forevermore
on a test that's not adding useful coverage.

regards, tom lane

Re: prevent immature WAL streaming

2021-09-30 Thread Alvaro Herrera

On 2021-Sep-30, Tom Lane wrote:

> Just when you thought it was safe to go back in the water:
> 
> https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=prairiedog=2021-09-29%2022%3A05%3A44
> 
> which is complaining that the (misspelled, BTW)

Ah, the case of the missing juxtaposed consonants.  Les Luthiers have
something to say on that matter.
https://www.youtube.com/watch?v=ptorPqV7D5s

> log message 'sucessfully skipped missing contrecord at' doesn't show
> up.

Hmm.  Well, as I said, maybe this part of the test isn't worth much
anyway.  Rather than spending time trying to figure out why isn't this
triggering the WAL overwriting, I compared the coverage report for
running only the first test to the coverage report of running only the
second test.  It turns out that there's no relevant coverage increase in
the second test.  So I propose just removing that part.

(The reason I added that test in the first place was to try to reproduce
the problem without having to physically unlink a WAL file from the
primary's pg_wal subdir.  But maybe it's just make-work.)

-- 
Álvaro Herrera PostgreSQL Developer  —  https://www.EnterpriseDB.com/
>From 759846455920d1886a154b2b159363d353ba0def Mon Sep 17 00:00:00 2001
From: Alvaro Herrera 
Date: Thu, 30 Sep 2021 16:55:17 -0300
Subject: [PATCH] Remove test, fix typo

---
 src/backend/access/transam/xlog.c |   2 +-
 .../recovery/t/026_overwrite_contrecord.pl| 116 +-
 src/test/recovery/t/idiosyncratic_copy|  20 ---
 3 files changed, 3 insertions(+), 135 deletions(-)
 delete mode 100755 src/test/recovery/t/idiosyncratic_copy

diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 1388afdfb0..f8c714b7b7 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -10598,7 +10598,7 @@ VerifyOverwriteContrecord(xl_overwrite_contrecord *xlrec, XLogReaderState *state
 			 LSN_FORMAT_ARGS(state->overwrittenRecPtr));
 
 	ereport(LOG,
-			(errmsg("sucessfully skipped missing contrecord at %X/%X, overwritten at %s",
+			(errmsg("successfully skipped missing contrecord at %X/%X, overwritten at %s",
 	LSN_FORMAT_ARGS(xlrec->overwritten_lsn),
 	timestamptz_to_str(xlrec->overwrite_time;
 
diff --git a/src/test/recovery/t/026_overwrite_contrecord.pl b/src/test/recovery/t/026_overwrite_contrecord.pl
index acf9bd08d1..1725a97531 100644
--- a/src/test/recovery/t/026_overwrite_contrecord.pl
+++ b/src/test/recovery/t/026_overwrite_contrecord.pl
@@ -10,7 +10,7 @@ use PostgresNode;
 use TestLib;
 use Test::More;
 
-plan tests => 5;
+plan tests => 3;
 
 # Test: Create a physical replica that's missing the last WAL file,
 # then restart the primary to create a divergent WAL file and observe
@@ -89,120 +89,8 @@ ok($node_standby->safe_psql('postgres', 'select * from foo') eq 'hello',
 my $log = slurp_file($node_standby->logfile);
 like(
 	$log,
-	qr[sucessfully skipped missing contrecord at],
+	qr[successfully skipped missing contrecord at],
 	"found log line in standby");
 
 $node->stop;
 $node_standby->stop;
-
-
-# Second test: a standby that receives WAL via archive/restore commands.
-$node = PostgresNode->new('primary2');
-$node->init(
-	has_archiving => 1,
-	extra => ['--wal-segsize=1']);
-$node->set_replication_conf;
-
-# Note: consistent use of forward slashes here avoids any escaping problems
-# that arise from use of backslashes. That means we need to double-quote all
-# the paths in the archive_command
-my $perlbin = TestLib::perl2host($^X);
-$perlbin =~ s!\\!/!g if $TestLib::windows_os;
-my $archivedir = $node->archive_dir;
-$archivedir =~ s!\\!/!g if $TestLib::windows_os;
-$node->append_conf(
-	'postgresql.conf',
-	qq{
-archive_command = '"$perlbin" "$FindBin::RealBin/idiosyncratic_copy" "%p" "$archivedir/%f"'
-wal_level = replica
-max_wal_senders = 2
-wal_keep_size = 1GB
-});
-# Make sure that Msys perl doesn't complain about difficulty in setting locale
-# when called from the archive_command.
-local $ENV{PERL_BADLANG} = 0;
-$node->start;
-$node->backup('backup');
-
-$node_standby = PostgresNode->new('standby2');
-$node_standby->init_from_backup($node, 'backup', has_restoring => 1);
-
-$node_standby->start;
-
-$node->safe_psql('postgres', 'create table filler (a int)');
-# First, measure how many bytes does the insertion of 1000 rows produce
-$start_lsn =
-  $node->safe_psql('postgres', q{select pg_current_wal_insert_lsn() - '0/0'});
-$node->safe_psql('postgres',
-	'insert into filler select * from generate_series(1, 1000)');
-$end_lsn =
-  $node->safe_psql('postgres', q{select pg_current_wal_insert_lsn() - '0/0'});
-$rows_walsize = $end_lsn - $start_lsn;
-
-# Now consume all remaining room in the current WAL segment, leaving
-# space enough only for the start of a largish record.
-$node->safe_psql(
-	'postgres', qq{
-WITH setting AS (
-  SELECT setting::int AS wal_segsize
-FROM pg_settings WHERE name = 'wal_segment_size'
-)
-INSERT INTO filler

Re: prevent immature WAL streaming

2021-09-30 Thread Tom Lane

Just when you thought it was safe to go back in the water:

https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=prairiedog=2021-09-29%2022%3A05%3A44

which is complaining that the (misspelled, BTW) log message
'sucessfully skipped missing contrecord at' doesn't show up.

This machine is old, slow, and 32-bit bigendian.  I first thought
the problem might be "didn't wait long enough", but it seems like
waiting for replay ought to be sufficient.  What I'm now guessing
is that the test case is making unwarranted assumptions about how
much WAL will be generated, such that no page-crossing contrecord
actually appears.

Also, digging around, I see hornet showed the same problem:

https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=hornet=2021-09-29%2018%3A19%3A55

hornet is 64-bit bigendian ... so maybe this actually reduces to
an endianness question?

regards, tom lane

Re: prevent immature WAL streaming

2021-09-30 Thread Tom Lane

Alvaro Herrera  writes:
> On 2021-Sep-30, Tom Lane wrote:
>> There's still the issue of these tests overriding postgresql.conf
>> entries made by init(), but maybe we can live with that?

> I vote to at least have has_archiving=>1 set wal_level=replica, and
> potentially max_wal_senders=2 too (allows pg_basebackup).

I think this requires a bit more investigation.  I looked quickly
through the pre-existing tests that set has_archiving=>1, and every
single one of them also sets allows_streaming=>1, which no doubt
explains why this issue hasn't come up before.  So now I'm wondering
if any of those other tests is setting allows_streaming only because
of this configuration issue.

More to the point, since we've not previously used the combination
of has_archiving without allows_streaming, I wonder exactly how
we want to define it to work.  I'm not really convinced that it
should be defined as "allows basebackup even though replication
is supposed to be off".

Perhaps a compromise could be to invent a third option
"allows_basebackup", so that init() actually knows what's going on?

regards, tom lane

Re: prevent immature WAL streaming

2021-09-30 Thread Alvaro Herrera

On 2021-Sep-30, Tom Lane wrote:

> Andrew Dunstan  writes:
> > Regardless of this problem, I think we should simply call
> > set_replication_conf unconditionally in init().  Replication connections
> > are now allowed by default on Unix, this would just bring Windows nodes
> > into line with that.
> 
> Yeah, I was thinking along the same lines yesterday.  The fact that
> pre-commit testing failed to note the problem is exactly because we
> have this random difference between what works by default on Unix
> and what works by default on Windows.  Let's close that gap before
> it bites us again.

+1

> There's still the issue of these tests overriding postgresql.conf
> entries made by init(), but maybe we can live with that?

I vote to at least have has_archiving=>1 set wal_level=replica, and
potentially max_wal_senders=2 too (allows pg_basebackup).

-- 
Álvaro Herrera PostgreSQL Developer  —  https://www.EnterpriseDB.com/

Re: prevent immature WAL streaming

2021-09-30 Thread Tom Lane

Andrew Dunstan  writes:
> Regardless of this problem, I think we should simply call
> set_replication_conf unconditionally in init().  Replication connections
> are now allowed by default on Unix, this would just bring Windows nodes
> into line with that.

Yeah, I was thinking along the same lines yesterday.  The fact that
pre-commit testing failed to note the problem is exactly because we
have this random difference between what works by default on Unix
and what works by default on Windows.  Let's close that gap before
it bites us again.

There's still the issue of these tests overriding postgresql.conf
entries made by init(), but maybe we can live with that?

regards, tom lane

Re: prevent immature WAL streaming

2021-09-30 Thread Andrew Dunstan



On 9/29/21 5:29 PM, Tom Lane wrote:
> Andrew Dunstan  writes:
>> On 9/29/21 4:33 PM, Tom Lane wrote:
>>> which looks like a pretty straightforward bogus-connection-configuration
>>> problem, except why wouldn't other BF members show it?
>> This:
>> ...
>> doesn't have "allows_streaming => 1".
> Oh, and that only breaks things on Windows, cf set_replication_conf.
>
> ... although I wonder why the fact that sub init otherwise sets
> wal_level = minimal doesn't cause a problem for this test.
> Maybe the test script is cowboy-ishly overriding that?
>
>   



Regardless of this problem, I think we should simply call
set_replication_conf unconditionally in init().  Replication connections
are now allowed by default on Unix, this would just bring Windows nodes
into line with that.


The function does have this:

$self->host eq $test_pghost
   or die "set_replication_conf only works with the default host";

I'm not sure when that wouldn't be true.


cheers


andrew


--
Andrew Dunstan
EDB: https://www.enterprisedb.com

Re: prevent immature WAL streaming

2021-09-30 Thread Alvaro Herrera

Hello

On 2021-Sep-30, Andres Freund wrote:

> FWIW, with that fixed I see the test hanging (e.g. [1]):
> 
> can't unlink 
> c:/cirrus/src/test/recovery/tmp_check/t_026_overwrite_contrecord_primary2_data/archives/00010008.fail:
>  Permission denied at t/026_overwrite_contrecord.pl line 189.
> ### Stopping node "primary2" using mode immediate
> # Running: pg_ctl -D 
> c:/cirrus/src/test/recovery/tmp_check/t_026_overwrite_contrecord_primary2_data/pgdata
>  -m immediate stop
> waiting for server to shut 
> down...
>  failed
> pg_ctl: server does not shut down
> Bail out!  command "pg_ctl -D 
> c:/cirrus/src/test/recovery/tmp_check/t_026_overwrite_contrecord_primary2_data/pgdata
>  -m immediate stop" exited with value 1
> Warning: unable to close filehandle GEN6 properly: Bad file descriptor during 
> global destruction.
> 
> The hang seems to be fixed by uncommenting the $h->finish(). Was there a
> reason you commented that out?

Hmm, no -- I was experimenting to see what effect it had, and because I
saw none, I left it like that.  Let me fix both things and see what
happens next.

> THe test still fails, but at least it doesn't hang anymore.

I'll try and get it fixed, but ultimately we can always "fix" this test
by removing it.  It is only in 14 and master; the earlier branches only
have the first part of this test file.  (That's because I ran out of
patience trying to port it to versions lacking LSN operators and such.)
I think the second half doesn't provide all that much additional
coverage anyway.

-- 
Álvaro Herrera  Valdivia, Chile  —  https://www.EnterpriseDB.com/
"Ellos andaban todos desnudos como su madre los parió, y también las mujeres,
aunque no vi más que una, harto moza, y todos los que yo vi eran todos
mancebos, que ninguno vi de edad de más de XXX años" (Cristóbal Colón)

Re: prevent immature WAL streaming

2021-09-30 Thread Andres Freund

Hi,

On 2021-09-29 17:04:48 -0400, Andrew Dunstan wrote:
> On 9/29/21 4:33 PM, Tom Lane wrote:
> # Second test: a standby that receives WAL via archive/restore commands.
> $node = PostgresNode->new('primary2');
> $node->init(
>     has_archiving => 1,
>     extra => ['--wal-segsize=1']);
> 
> 
> doesn't have "allows_streaming => 1".

FWIW, with that fixed I see the test hanging (e.g. [1]):

can't unlink 
c:/cirrus/src/test/recovery/tmp_check/t_026_overwrite_contrecord_primary2_data/archives/00010008.fail:
 Permission denied at t/026_overwrite_contrecord.pl line 189.
### Stopping node "primary2" using mode immediate
# Running: pg_ctl -D 
c:/cirrus/src/test/recovery/tmp_check/t_026_overwrite_contrecord_primary2_data/pgdata
 -m immediate stop
waiting for server to shut 
down...
 failed
pg_ctl: server does not shut down
Bail out!  command "pg_ctl -D 
c:/cirrus/src/test/recovery/tmp_check/t_026_overwrite_contrecord_primary2_data/pgdata
 -m immediate stop" exited with value 1
Warning: unable to close filehandle GEN6 properly: Bad file descriptor during 
global destruction.


The hang seems to be fixed by uncommenting the $h->finish(). Was there a
reason you commented that out? THe test still fails, but at least it doesn't
hang anymore.

Greetings,

Andres Freund

[1] 
https://api.cirrus-ci.com/v1/artifact/task/6204050896060416/tap/src/test/recovery/tmp_check/log/regress_log_026_overwrite_contrecord

Re: prevent immature WAL streaming

2021-09-29 Thread Tom Lane

Andrew Dunstan  writes:
> On 9/29/21 5:27 PM, Alvaro Herrera wrote:
>> So, do we take the stance that we have no right to expect pg_basebackup
>> to work if we didn't pass allow_streaming => 1?  If so, the fix is to
>> add it.  But my preferred fix would be to call set_replication_conf if
>> either allows_streaming or has_archiving are given.
>> 
>> Another easy fix would be to call $primary2->set_replication_conf in the
>> test file, but then you'd complain that that's supposed to be an
>> internal method :-)

> It claims that but it's also used here:
> src/bin/pg_basebackup/t/010_pg_basebackup.pl

Given that precedent, it seems like calling set_replication_conf
is a good quick-fix for getting the buildfarm green again.
But +1 for then refactoring this to get rid of these hacks (both
with respect to pg_hba.conf and the postgresql.conf parameters)
in both of these tests.

regards, tom lane

Re: prevent immature WAL streaming

2021-09-29 Thread Andrew Dunstan



On 9/29/21 5:27 PM, Alvaro Herrera wrote:
> On 2021-Sep-29, Andrew Dunstan wrote:
>
>>> The relevant info seems to be
>>>
>>> # Running: pg_basebackup -D 
>>> /home/pgrunner/bf/root/REL_14_STABLE/pgsql.build/src/test/recovery/tmp_check/t_026_overwrite_contrecord_primary2_data/backup/backup
>>>  -h 127.0.0.1 -p 59502 --checkpoint fast --no-sync
>>> pg_basebackup: error: connection to server at "127.0.0.1", port 59502 
>>> failed: FATAL:  no pg_hba.conf entry for replication connection from host 
>>> "127.0.0.1", user "pgrunner", no encryption
>>> Bail out!  system pg_basebackup failed
>>>
>>> which looks like a pretty straightforward bogus-connection-configuration
>>> problem, except why wouldn't other BF members show it?
>> This:
>>
>> # Second test: a standby that receives WAL via archive/restore commands.
>> $node = PostgresNode->new('primary2');
>> $node->init(
>>     has_archiving => 1,
>>     extra => ['--wal-segsize=1']);
>>
>> doesn't have "allows_streaming => 1".
> Hmm, but I omitted allows_streaming on purpose -- I only wanted
> archiving, not streaming.  I understand that your point is that
> set_replication_conf is not called unless allows_streaming is set.
>
> So, do we take the stance that we have no right to expect pg_basebackup
> to work if we didn't pass allow_streaming => 1?  If so, the fix is to
> add it.  But my preferred fix would be to call set_replication_conf if
> either allows_streaming or has_archiving are given.
>
>
> Another easy fix would be to call $primary2->set_replication_conf in the
> test file, but then you'd complain that that's supposed to be an
> internal method :-)


It claims that but it's also used here:


src/bin/pg_basebackup/t/010_pg_basebackup.pl


(Also, good perl style would start purely internal method names with an
underscore.)


cheers


andrew


--
Andrew Dunstan
EDB: https://www.enterprisedb.com

Re: prevent immature WAL streaming

2021-09-29 Thread Alvaro Herrera

On 2021-Sep-29, Tom Lane wrote:

> ... although I wonder why the fact that sub init otherwise sets
> wal_level = minimal doesn't cause a problem for this test.
> Maybe the test script is cowboy-ishly overriding that?

It is.  (I claim that it should be set otherwise when has_archiving=>1).

-- 
Álvaro Herrera PostgreSQL Developer  —  https://www.EnterpriseDB.com/
"El Maquinismo fue proscrito so pena de cosquilleo hasta la muerte"
(Ijon Tichy en Viajes, Stanislaw Lem)

Re: prevent immature WAL streaming

2021-09-29 Thread Tom Lane

Andrew Dunstan  writes:
> On 9/29/21 4:33 PM, Tom Lane wrote:
>> which looks like a pretty straightforward bogus-connection-configuration
>> problem, except why wouldn't other BF members show it?

> This:
> ...
> doesn't have "allows_streaming => 1".

Oh, and that only breaks things on Windows, cf set_replication_conf.

... although I wonder why the fact that sub init otherwise sets
wal_level = minimal doesn't cause a problem for this test.
Maybe the test script is cowboy-ishly overriding that?

regards, tom lane

Re: prevent immature WAL streaming

2021-09-29 Thread Alvaro Herrera

On 2021-Sep-29, Andrew Dunstan wrote:

> > The relevant info seems to be
> >
> > # Running: pg_basebackup -D 
> > /home/pgrunner/bf/root/REL_14_STABLE/pgsql.build/src/test/recovery/tmp_check/t_026_overwrite_contrecord_primary2_data/backup/backup
> >  -h 127.0.0.1 -p 59502 --checkpoint fast --no-sync
> > pg_basebackup: error: connection to server at "127.0.0.1", port 59502 
> > failed: FATAL:  no pg_hba.conf entry for replication connection from host 
> > "127.0.0.1", user "pgrunner", no encryption
> > Bail out!  system pg_basebackup failed
> >
> > which looks like a pretty straightforward bogus-connection-configuration
> > problem, except why wouldn't other BF members show it?
> 
> This:
> 
> # Second test: a standby that receives WAL via archive/restore commands.
> $node = PostgresNode->new('primary2');
> $node->init(
>     has_archiving => 1,
>     extra => ['--wal-segsize=1']);
> 
> doesn't have "allows_streaming => 1".

Hmm, but I omitted allows_streaming on purpose -- I only wanted
archiving, not streaming.  I understand that your point is that
set_replication_conf is not called unless allows_streaming is set.

So, do we take the stance that we have no right to expect pg_basebackup
to work if we didn't pass allow_streaming => 1?  If so, the fix is to
add it.  But my preferred fix would be to call set_replication_conf if
either allows_streaming or has_archiving are given.

Another easy fix would be to call $primary2->set_replication_conf in the
test file, but then you'd complain that that's supposed to be an
internal method :-)

(This reminds me that I had to add something that seemed like it should
have been unnecessary: wal_level=replica should become set if I request
archiving, right?  Otherwise the WAL archive is useless.  I also had to
add max_wal_senders=2 so that pg_basebackup would work, but I'm on the
fence about setting that automatically if has_archiving is given.)

-- 
Álvaro Herrera  Valdivia, Chile  —  https://www.EnterpriseDB.com/
"At least to kernel hackers, who really are human, despite occasional
rumors to the contrary" (LWN.net)

Re: prevent immature WAL streaming

2021-09-29 Thread Andrew Dunstan



On 9/29/21 4:33 PM, Tom Lane wrote:
> Alvaro Herrera  writes:
>> Pushed.  Watching for buildfarm fireworks now.
> jacana didn't like it:
>
> https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=jacana=2021-09-29%2018%3A25%3A53
>
> The relevant info seems to be
>
> # Running: pg_basebackup -D 
> /home/pgrunner/bf/root/REL_14_STABLE/pgsql.build/src/test/recovery/tmp_check/t_026_overwrite_contrecord_primary2_data/backup/backup
>  -h 127.0.0.1 -p 59502 --checkpoint fast --no-sync
> pg_basebackup: error: connection to server at "127.0.0.1", port 59502 failed: 
> FATAL:  no pg_hba.conf entry for replication connection from host 
> "127.0.0.1", user "pgrunner", no encryption
> Bail out!  system pg_basebackup failed
>
> which looks like a pretty straightforward bogus-connection-configuration
> problem, except why wouldn't other BF members show it?
>
>   


This:

# Second test: a standby that receives WAL via archive/restore commands.
$node = PostgresNode->new('primary2');
$node->init(
    has_archiving => 1,
    extra => ['--wal-segsize=1']);


doesn't have "allows_streaming => 1".


cheers


andrew


-- 

Andrew Dunstan
EDB: https://www.enterprisedb.com

Re: prevent immature WAL streaming

2021-09-29 Thread Tom Lane

Alvaro Herrera  writes:
> Pushed.  Watching for buildfarm fireworks now.

jacana didn't like it:

https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=jacana=2021-09-29%2018%3A25%3A53

The relevant info seems to be

# Running: pg_basebackup -D 
/home/pgrunner/bf/root/REL_14_STABLE/pgsql.build/src/test/recovery/tmp_check/t_026_overwrite_contrecord_primary2_data/backup/backup
 -h 127.0.0.1 -p 59502 --checkpoint fast --no-sync
pg_basebackup: error: connection to server at "127.0.0.1", port 59502 failed: 
FATAL:  no pg_hba.conf entry for replication connection from host "127.0.0.1", 
user "pgrunner", no encryption
Bail out!  system pg_basebackup failed

which looks like a pretty straightforward bogus-connection-configuration
problem, except why wouldn't other BF members show it?

regards, tom lane

Re: prevent immature WAL streaming

2021-09-29 Thread Alvaro Herrera

On 2021-Sep-24, Hannu Krosing wrote:

Hi Hannu

> In the first email you write
> 
> > As mentioned in the course of thread [1], we're missing a fix for
> streaming replication to avoid sending records that the primary hasn't
> fully flushed yet.  This patch is a first attempt at fixing that problem
> by retreating the LSN reported as FlushPtr whenever a segment is
> registered, based on the understanding that if no registration exists
> then the LogwrtResult.Flush pointer can be taken at face value; but if a
> registration exists, then we have to stream only till the start LSN of
> that registered entry.
> 
> So did we end up holding back the wal_sender to not send anything that
> is not confirmed as flushed on master

No.  We eventually realized that that approach was a dead end, so I
abandoned the whole thing and attacked the problem differently.  So your
other questions don't apply.  I tried to make the commit message explain
both the problem and the solution in as much detail as possible; please
have a look at that and let me know if something is unclear.

Thanks

-- 
Álvaro Herrera PostgreSQL Developer  —  https://www.EnterpriseDB.com/
"El miedo atento y previsor es la madre de la seguridad" (E. Burke)

Re: prevent immature WAL streaming

2021-09-29 Thread Alvaro Herrera

On 2021-Sep-24, Alvaro Herrera wrote:

> Here's the set for all branches, which I think are really final, in case
> somebody wants to play and reproduce their respective problem scenarios.
> Nathan already confirmed that his reproducer no longer shows a problem,
> and this version shouldn't affect that.

Pushed.  Watching for buildfarm fireworks now.

-- 
Álvaro Herrera  Valdivia, Chile  —  https://www.EnterpriseDB.com/

Re: prevent immature WAL streaming

2021-09-25 Thread Alvaro Herrera

On 2021-Sep-24, Alvaro Herrera wrote:

> Here's the set for all branches, which I think are really final, in case
> somebody wants to play and reproduce their respective problem scenarios.

I forgot to mention that I'll wait until 14.0 is tagged before getting
anything pushed.


-- 
Álvaro Herrera  Valdivia, Chile  —  https://www.EnterpriseDB.com/
Thou shalt check the array bounds of all strings (indeed, all arrays), for
surely where thou typest "foo" someone someday shall type
"supercalifragilisticexpialidocious" (5th Commandment for C programmers)

Re: prevent immature WAL streaming

2021-09-24 Thread Hannu Krosing

Hi Alvaro,

I just started reading this thread, but maybe you can confirm or
refute my understanding of what was done.

In the first email you write

> As mentioned in the course of thread [1], we're missing a fix for
streaming replication to avoid sending records that the primary hasn't
fully flushed yet.  This patch is a first attempt at fixing that problem
by retreating the LSN reported as FlushPtr whenever a segment is
registered, based on the understanding that if no registration exists
then the LogwrtResult.Flush pointer can be taken at face value; but if a
registration exists, then we have to stream only till the start LSN of
that registered entry.

So did we end up holding back the wal_sender to not send anything that
is not confirmed as flushed on master

Are there measurements on how much this slows down replication
compared to allowing sending the moment it is written in buffers but
not necessarily flushed locally ?

Did we investigate possibility of sending as fast as possible and
controlling the flush synchronisation by sending separate flush
pointers *both* ways ?

And maybe there was even an alternative considered where we are
looking at a more general Durability, for example 2-out-of-3 where
primary is one of the 3 and not necessarily the most durable one?

-
Hannu Krosing
Google Cloud - We have a long list of planned contributions and we are hiring.
Contact me if interested.

On Fri, Sep 24, 2021 at 4:33 AM Alvaro Herrera  wrote:
>
> On 2021-Sep-23, Alvaro Herrera wrote:
>
> > However, I notice now that the pg_rewind tests reproducibly fail in
> > branch 14 for reasons I haven't yet understood.  It's strange that no
> > other branch fails, even when run quite a few times.
>
> Turns out that this is a real bug (setting EndOfLog seems insufficient).
> I'm looking into it.
>
> --
> Álvaro Herrera  Valdivia, Chile  —  https://www.EnterpriseDB.com/
> "No necesitamos banderas
>  No reconocemos fronteras"  (Jorge González)
>
>

Re: prevent immature WAL streaming

2021-09-23 Thread Alvaro Herrera

On 2021-Sep-23, Alvaro Herrera wrote:

> However, I notice now that the pg_rewind tests reproducibly fail in
> branch 14 for reasons I haven't yet understood.  It's strange that no
> other branch fails, even when run quite a few times.

Turns out that this is a real bug (setting EndOfLog seems insufficient).
I'm looking into it.

-- 
Álvaro Herrera  Valdivia, Chile  —  https://www.EnterpriseDB.com/
"No necesitamos banderas
 No reconocemos fronteras"  (Jorge González)

Re: prevent immature WAL streaming

2021-09-23 Thread Alvaro Herrera

This took some time because backpatching the tests was more work than I
anticipated -- function name changes, operators that don't exist,
definition of the WAL segment size in pg_settings.  I had to remove the
second test in branches 13 and earlier due to lack of LSN+bytes
operator.  Fortunately, the first test (which is not as clean) still
works all the way back.

However, I notice now that the pg_rewind tests reproducibly fail in
branch 14 for reasons I haven't yet understood.  It's strange that no
other branch fails, even when run quite a few times.

-- 
Álvaro Herrera   39°49'30"S 73°17'W  —  https://www.EnterpriseDB.com/
"Siempre hay que alimentar a los dioses, aunque la tierra esté seca" (Orual)

Re: prevent immature WAL streaming

2021-09-21 Thread Alvaro Herrera

I made one final pass over the whole thing to be sure it's all commented
as thoroughly as possible, and changed the names of things to make them
all consistent.  So here's the last version which I intend to push to
all branches soon.  (The only difference in back-branches is that
the xlogreader struct needs to be adjusted to have the new fields at the
bottom.)

One thing to note is that this is an un-downgradeable minor; and of
course people should upgrade standbys before primaries.

On 2021-Sep-17, Alvaro Herrera wrote:

> On 2021-Sep-17, Bossart, Nathan wrote:

> > I see.  IMO feels a bit counterintuitive to validate a partial record
> > that you are ignoring anyway, but I suppose it's still valuable to
> > know when the WAL is badly broken.  It's not expensive, and it doesn't
> > add a ton of complexity, either.
> 
> Yeah, we don't have any WAL record history validation other than the
> verifying the LSN of the previous record; I suppose in this particular
> case you could argue that we shouldn't bother with any validation
> either.  But it seems safer to do it.  It doesn't hurt anything anyway.

BTW we do validate the CRC for all records, but we don't have any means
to validate the CRC of a partial record; so in theory if we don't add
CRC validation of the ignored contents, there would be no validation at
all for that record.  I don't think we care, but it's true that there
would be a big blob in WAL that we don't really know anything about.

-- 
Álvaro Herrera  Valdivia, Chile  —  https://www.EnterpriseDB.com/
Thou shalt study thy libraries and strive not to reinvent them without
cause, that thy code may be short and readable and thy days pleasant
and productive. (7th Commandment for C Programmers)
>From 8893e2073365c88449a118bd340a78241afe8f97 Mon Sep 17 00:00:00 2001
From: Alvaro Herrera 
Date: Tue, 21 Sep 2021 12:43:15 -0300
Subject: [PATCH v8 1/3] Document XLOG_INCLUDE_XID a little better
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

I noticed that commit 0bead9af484c left this flag undocumented in
XLogSetRecordFlags, which led me to discover that the flag doesn't
actually do what the one comment on it said it does.  Improve the
situation by adding some more comments.

Backpatch to 14, where the aforementioned commit appears.

Author: Álvaro Herrera 
---
 src/backend/access/transam/xloginsert.c | 2 ++
 src/include/access/xlog.h   | 2 +-
 src/include/access/xlogrecord.h | 5 +++--
 3 files changed, 6 insertions(+), 3 deletions(-)

diff --git a/src/backend/access/transam/xloginsert.c b/src/backend/access/transam/xloginsert.c
index e596a0470a..e329ae5c2a 100644
--- a/src/backend/access/transam/xloginsert.c
+++ b/src/backend/access/transam/xloginsert.c
@@ -409,6 +409,8 @@ XLogRegisterBufData(uint8 block_id, char *data, int len)
  * - XLOG_MARK_UNIMPORTANT, to signal that the record is not important for
  *	 durability, which allows to avoid triggering WAL archiving and other
  *	 background activity.
+ * - XLOG_INCLUDE_XID, a message-passing hack between XLogRecordAssemble
+ *   and XLogResetInsertion.
  */
 void
 XLogSetRecordFlags(uint8 flags)
diff --git a/src/include/access/xlog.h b/src/include/access/xlog.h
index 0a8ede700d..5e2c94a05f 100644
--- a/src/include/access/xlog.h
+++ b/src/include/access/xlog.h
@@ -212,7 +212,7 @@ extern bool XLOG_DEBUG;
  */
 #define XLOG_INCLUDE_ORIGIN		0x01	/* include the replication origin */
 #define XLOG_MARK_UNIMPORTANT	0x02	/* record not important for durability */
-#define XLOG_INCLUDE_XID		0x04	/* include XID of top-level xact */
+#define XLOG_INCLUDE_XID		0x04	/* WAL-internal message-passing hack */
 
 
 /* Checkpoint statistics */
diff --git a/src/include/access/xlogrecord.h b/src/include/access/xlogrecord.h
index e06ee92a5e..8d1305eae8 100644
--- a/src/include/access/xlogrecord.h
+++ b/src/include/access/xlogrecord.h
@@ -215,8 +215,9 @@ typedef struct XLogRecordDataHeaderLong
  * Block IDs used to distinguish different kinds of record fragments. Block
  * references are numbered from 0 to XLR_MAX_BLOCK_ID. A rmgr is free to use
  * any ID number in that range (although you should stick to small numbers,
- * because the WAL machinery is optimized for that case). A couple of ID
- * numbers are reserved to denote the "main" data portion of the record.
+ * because the WAL machinery is optimized for that case). A few ID
+ * numbers are reserved to denote the "main" data portion of the record,
+ * as well as replication-supporting transaction metadata.
  *
  * The maximum is currently set at 32, quite arbitrarily. Most records only
  * need a handful of block references, but there are a few exceptions that
-- 
2.20.1

>From 2daa93570c801f21f5abdacc6d8f77bd62bf5d53 Mon Sep 17 00:00:00 2001
From: Alvaro Herrera 
Date: Thu, 2 Sep 2021 17:21:46 -0400
Subject: [PATCH v8 2/3] Implement FIRST_IS_ABORTED_CONTRECORD

---
 src/backend/access/rmgrdesc/xlogdesc.c  |  12 ++

Re: prevent immature WAL streaming

2021-09-17 Thread Alvaro Herrera

On 2021-Sep-17, Bossart, Nathan wrote:

> > That was the first implementation, a few versions of the patch ago.  An
> > added benefit of a separate WAL record is that you can carry additional
> > data for validation, such as -- as suggested by Andres -- the CRC of the
> > partial data contained in the message that we're skipping.  I didn't
> > implement that, but it should be trivial to add it.
> 
> I see.  IMO feels a bit counterintuitive to validate a partial record
> that you are ignoring anyway, but I suppose it's still valuable to
> know when the WAL is badly broken.  It's not expensive, and it doesn't
> add a ton of complexity, either.

Yeah, we don't have any WAL record history validation other than the
verifying the LSN of the previous record; I suppose in this particular
case you could argue that we shouldn't bother with any validation
either.  But it seems safer to do it.  It doesn't hurt anything anyway.

-- 
Álvaro Herrera  Valdivia, Chile  —  https://www.EnterpriseDB.com/

Re: prevent immature WAL streaming

2021-09-17 Thread Bossart, Nathan

On 9/17/21, 1:32 PM, "Alvaro Herrera"  wrote:
> On 2021-Sep-17, Bossart, Nathan wrote:
>
>> I gave the patch a read-through.  I'm wondering if the
>> XLOG_OVERWRITE_CONTRECORD records are actually necessary.  IIUC we
>> will set XLP_FIRST_IS_ABORTED_PARTIAL on the next page, and
>> xlp_pageaddr on that page will already be validated in
>> XLogReaderValidatePageHeader().  Does adding this new record also help
>> ensure the page header is filled in and flushed to disk?
>
> That was the first implementation, a few versions of the patch ago.  An
> added benefit of a separate WAL record is that you can carry additional
> data for validation, such as -- as suggested by Andres -- the CRC of the
> partial data contained in the message that we're skipping.  I didn't
> implement that, but it should be trivial to add it.

I see.  IMO feels a bit counterintuitive to validate a partial record
that you are ignoring anyway, but I suppose it's still valuable to
know when the WAL is badly broken.  It's not expensive, and it doesn't
add a ton of complexity, either.

Nathan

Re: prevent immature WAL streaming

2021-09-17 Thread Alvaro Herrera

On 2021-Sep-17, Bossart, Nathan wrote:

> I gave the patch a read-through.  I'm wondering if the
> XLOG_OVERWRITE_CONTRECORD records are actually necessary.  IIUC we
> will set XLP_FIRST_IS_ABORTED_PARTIAL on the next page, and
> xlp_pageaddr on that page will already be validated in
> XLogReaderValidatePageHeader().  Does adding this new record also help
> ensure the page header is filled in and flushed to disk?

That was the first implementation, a few versions of the patch ago.  An
added benefit of a separate WAL record is that you can carry additional
data for validation, such as -- as suggested by Andres -- the CRC of the
partial data contained in the message that we're skipping.  I didn't
implement that, but it should be trivial to add it.

-- 
Álvaro Herrera   39°49'30"S 73°17'W  —  https://www.EnterpriseDB.com/
"How strange it is to find the words "Perl" and "saner" in such close
proximity, with no apparent sense of irony. I doubt that Larry himself
could have managed it." (ncm, http://lwn.net/Articles/174769/)

Re: prevent immature WAL streaming

2021-09-17 Thread Bossart, Nathan

On 9/17/21, 10:35 AM, "Alvaro Herrera"  wrote:
> On 2021-Sep-17, Bossart, Nathan wrote:
>
>> On 9/17/21, 9:37 AM, "Alvaro Herrera"  wrote:
>
>> > If you have some time to try your reproducers with this new proposed
>> > fix, I would appreciate it.
>>
>> I haven't had a chance to look at the patch yet, but it appears to fix
>> things with my original reproduction steps for the archive_status
>> stuff [0].
>
> Thank you, this is good to hear.

I gave the patch a read-through.  I'm wondering if the
XLOG_OVERWRITE_CONTRECORD records are actually necessary.  IIUC we
will set XLP_FIRST_IS_ABORTED_PARTIAL on the next page, and
xlp_pageaddr on that page will already be validated in
XLogReaderValidatePageHeader().  Does adding this new record also help
ensure the page header is filled in and flushed to disk?

Nathan

Re: prevent immature WAL streaming

2021-09-17 Thread Alvaro Herrera

On 2021-Sep-17, Bossart, Nathan wrote:

> On 9/17/21, 9:37 AM, "Alvaro Herrera"  wrote:

> > If you have some time to try your reproducers with this new proposed
> > fix, I would appreciate it.
> 
> I haven't had a chance to look at the patch yet, but it appears to fix
> things with my original reproduction steps for the archive_status
> stuff [0].

Thank you, this is good to hear.

-- 
Álvaro Herrera PostgreSQL Developer  —  https://www.EnterpriseDB.com/
"Ed is the standard text editor."
  http://groups.google.com/group/alt.religion.emacs/msg/8d94ddab6a9b0ad3

Re: prevent immature WAL streaming

2021-09-17 Thread Alvaro Herrera

On 2021-Sep-17, Alvaro Herrera wrote:

> Added Matsumura-san to CC, because he was interested in this topic too
> per the earlier thread.

I failed to do this, so hopefully this serves as a ping.

-- 
Álvaro Herrera   39°49'30"S 73°17'W  —  https://www.EnterpriseDB.com/

Re: prevent immature WAL streaming

2021-09-17 Thread Bossart, Nathan

On 9/17/21, 9:37 AM, "Alvaro Herrera"  wrote:
> OK, this version is much more palatable, because here we verify that the
> OVERWRITE_CONTRECORD we replay matches the record that was lost.  Also,
> I wrote a test script that creates such a broken record (by the simple
> expedient of deleting the WAL file containing the second half while the
> server is down); we then create a standby and we can observe that it
> replays the sequence correctly.
>
> If you have some time to try your reproducers with this new proposed
> fix, I would appreciate it.

I haven't had a chance to look at the patch yet, but it appears to fix
things with my original reproduction steps for the archive_status
stuff [0].

Nathan

[0] https://postgr.es/m/CBDDFA01-6E40-46BB-9F98-9340F4379505%40amazon.com

Re: prevent immature WAL streaming

2021-09-17 Thread Alvaro Herrera

OK, this version is much more palatable, because here we verify that the
OVERWRITE_CONTRECORD we replay matches the record that was lost.  Also,
I wrote a test script that creates such a broken record (by the simple
expedient of deleting the WAL file containing the second half while the
server is down); we then create a standby and we can observe that it
replays the sequence correctly.

If you have some time to try your reproducers with this new proposed
fix, I would appreciate it.


Added Matsumura-san to CC, because he was interested in this topic too
per the earlier thread.

-- 
Álvaro Herrera PostgreSQL Developer  —  https://www.EnterpriseDB.com/
>From 0404d47ca1456771a1be4c8afb2c1bd1e0fa0f08 Mon Sep 17 00:00:00 2001
From: Alvaro Herrera 
Date: Thu, 2 Sep 2021 17:21:46 -0400
Subject: [PATCH v7 1/2] Implement FIRST_IS_ABORTED_CONTRECORD

---
 src/backend/access/rmgrdesc/xlogdesc.c  |  11 ++
 src/backend/access/transam/xlog.c   | 153 ++--
 src/backend/access/transam/xloginsert.c |   1 +
 src/backend/access/transam/xlogreader.c |  54 -
 src/include/access/xlog_internal.h  |  20 +++-
 src/include/access/xlogreader.h |  14 +++
 src/include/catalog/pg_control.h|   1 +
 7 files changed, 244 insertions(+), 10 deletions(-)

diff --git a/src/backend/access/rmgrdesc/xlogdesc.c b/src/backend/access/rmgrdesc/xlogdesc.c
index e6090a9dad..0be382f8a5 100644
--- a/src/backend/access/rmgrdesc/xlogdesc.c
+++ b/src/backend/access/rmgrdesc/xlogdesc.c
@@ -139,6 +139,14 @@ xlog_desc(StringInfo buf, XLogReaderState *record)
 		 xlrec.ThisTimeLineID, xlrec.PrevTimeLineID,
 		 timestamptz_to_str(xlrec.end_time));
 	}
+	else if (info == XLOG_OVERWRITE_CONTRECORD)
+	{
+		xl_overwrite_contrecord xlrec;
+
+		memcpy(, rec, sizeof(xl_overwrite_contrecord));
+		appendStringInfo(buf, "lsn %X/%X",
+		 LSN_FORMAT_ARGS(xlrec.overwritten_lsn));
+	}
 }
 
 const char *
@@ -178,6 +186,9 @@ xlog_identify(uint8 info)
 		case XLOG_END_OF_RECOVERY:
 			id = "END_OF_RECOVERY";
 			break;
+		case XLOG_OVERWRITE_CONTRECORD:
+			id = "OVERWRITE_CONTRECORD";
+			break;
 		case XLOG_FPI:
 			id = "FPI";
 			break;
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index e51a7a749d..b6af3eb258 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -199,6 +199,15 @@ static XLogRecPtr LastRec;
 static XLogRecPtr flushedUpto = 0;
 static TimeLineID receiveTLI = 0;
 
+/*
+ * abortedRecPtr is the start pointer of a broken record at end of WAL when
+ * recovery completes; missingContrecPtr is the location of the first
+ * contrecord that went missing.  See CreateOverwriteContrecordRecord for
+ * details.
+ */
+static XLogRecPtr abortedRecPtr;
+static XLogRecPtr missingContrecPtr;
+
 /*
  * During recovery, lastFullPageWrites keeps track of full_page_writes that
  * the replayed WAL records indicate. It's initialized with full_page_writes
@@ -894,6 +903,7 @@ static void checkTimeLineSwitch(XLogRecPtr lsn, TimeLineID newTLI,
 TimeLineID prevTLI);
 static void LocalSetXLogInsertAllowed(void);
 static void CreateEndOfRecoveryRecord(void);
+static XLogRecPtr CreateOverwriteContrecordRecord(XLogRecPtr aborted_lsn);
 static void CheckPointGuts(XLogRecPtr checkPointRedo, int flags);
 static void KeepLogSeg(XLogRecPtr recptr, XLogSegNo *logSegNo);
 static XLogRecPtr XLogGetReplicationSlotMinimumLSN(void);
@@ -927,6 +937,7 @@ static XLogRecord *ReadRecord(XLogReaderState *xlogreader,
 static void CheckRecoveryConsistency(void);
 static XLogRecord *ReadCheckpointRecord(XLogReaderState *xlogreader,
 		XLogRecPtr RecPtr, int whichChkpt, bool report);
+static void VerifyOverwrittenRecord(XLogReaderState *record);
 static bool rescanLatestTimeLine(void);
 static void InitControlFile(uint64 sysidentifier);
 static void WriteControlFile(void);
@@ -1120,8 +1131,8 @@ XLogInsertRecord(XLogRecData *rdata,
 		 * All the record data, including the header, is now ready to be
 		 * inserted. Copy the record in the space reserved.
 		 */
-		CopyXLogRecordToWAL(rechdr->xl_tot_len, isLogSwitch, rdata,
-			StartPos, EndPos);
+		CopyXLogRecordToWAL(rechdr->xl_tot_len, isLogSwitch,
+			rdata, StartPos, EndPos);
 
 		/*
 		 * Unless record is flagged as not important, update LSN of last
@@ -1504,7 +1515,8 @@ checkXLogConsistency(XLogReaderState *record)
  * area in the WAL.
  */
 static void
-CopyXLogRecordToWAL(int write_len, bool isLogSwitch, XLogRecData *rdata,
+CopyXLogRecordToWAL(int write_len, bool isLogSwitch,
+	XLogRecData *rdata,
 	XLogRecPtr StartPos, XLogRecPtr EndPos)
 {
 	char	   *currpos;
@@ -2246,6 +2258,19 @@ AdvanceXLInsertBuffer(XLogRecPtr upto, bool opportunistic)
 		if (!Insert->forcePageWrites)
 			NewPage->xlp_info |= XLP_BKP_REMOVABLE;
 
+		/*
+		 * If a record was found to be broken at the end of recovery, and
+		 * we're going to write on the page where its first contrecord was
+		 * lost, set the

Re: prevent immature WAL streaming

2021-09-15 Thread Alvaro Herrera

On 2021-Sep-15, Kyotaro Horiguchi wrote:

> + CopyXLogRecordToWAL(rechdr->xl_tot_len, isLogSwitch,
> + flags & 
> XLOG_SET_ABORTED_PARTIAL,
> + rdata, StartPos, 
> EndPos);
> 
> The new xlog flag XLOG_SET_ABORTED_PARTIAL is used only by
> RM_XLOG_ID/XLOG_OVERWRITE_CONTRECORD records, so the flag value is the
> equivalent of the record type.

In the new version I removed all this; it was wrong.

> + if (set_aborted_partial)
> + pagehdr->xlp_info |= 
> XLP_FIRST_IS_ABORTED_PARTIAL;
> +
> 
> I'm not sure about the reason for the change from the previous patch
> (I might be missing something), this sets the flag on the *next* page
> of the page where the record starts.  So in the first place we
> shouldn't set the flag there.

You're right, this code is wrong.  And in fact I had already noticed it
yesterday, but much to my embarrasment I forgot to fix it.  Here is a
fixed version, where I moved the flag set back to AdvanceXLInsertBuffer.
I think doing it anywhere else is going to be very painful.  AFAICT we
do the right thing now, but amusingly we don't have any tooling to
verify that the XLP flag is set in the page where we want it.

With this patch we now have two recptrs: the LSN of the broken record,
and the LSN of the missing contrecord.  The latter is where to start
writing WAL after recovery is done, and the former is currently unused
but we could use it to double-check that we're aborting (forgetting) the
correct record.  I didn't try to implement that, but IIUC it is
xlogreader that would have to do that.

-- 
Álvaro Herrera   39°49'30"S 73°17'W  —  https://www.EnterpriseDB.com/
"Si quieres ser creativo, aprende el arte de perder el tiempo"
>From f468f058d5b480a9f27f9d1830f3959297accd4d Mon Sep 17 00:00:00 2001
From: Alvaro Herrera 
Date: Thu, 2 Sep 2021 17:21:46 -0400
Subject: [PATCH v6] Implement FIRST_IS_ABORTED_CONTRECORD

---
 src/backend/access/rmgrdesc/xlogdesc.c  |  11 +++
 src/backend/access/transam/xlog.c   | 112 ++--
 src/backend/access/transam/xloginsert.c |   1 +
 src/backend/access/transam/xlogreader.c |  38 +++-
 src/include/access/xlog_internal.h  |  20 -
 src/include/access/xlogreader.h |   7 ++
 src/include/catalog/pg_control.h|   1 +
 7 files changed, 180 insertions(+), 10 deletions(-)

diff --git a/src/backend/access/rmgrdesc/xlogdesc.c b/src/backend/access/rmgrdesc/xlogdesc.c
index e6090a9dad..0be382f8a5 100644
--- a/src/backend/access/rmgrdesc/xlogdesc.c
+++ b/src/backend/access/rmgrdesc/xlogdesc.c
@@ -139,6 +139,14 @@ xlog_desc(StringInfo buf, XLogReaderState *record)
 		 xlrec.ThisTimeLineID, xlrec.PrevTimeLineID,
 		 timestamptz_to_str(xlrec.end_time));
 	}
+	else if (info == XLOG_OVERWRITE_CONTRECORD)
+	{
+		xl_overwrite_contrecord xlrec;
+
+		memcpy(, rec, sizeof(xl_overwrite_contrecord));
+		appendStringInfo(buf, "lsn %X/%X",
+		 LSN_FORMAT_ARGS(xlrec.overwritten_lsn));
+	}
 }
 
 const char *
@@ -178,6 +186,9 @@ xlog_identify(uint8 info)
 		case XLOG_END_OF_RECOVERY:
 			id = "END_OF_RECOVERY";
 			break;
+		case XLOG_OVERWRITE_CONTRECORD:
+			id = "OVERWRITE_CONTRECORD";
+			break;
 		case XLOG_FPI:
 			id = "FPI";
 			break;
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index e51a7a749d..a8e587b5c6 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -199,6 +199,15 @@ static XLogRecPtr LastRec;
 static XLogRecPtr flushedUpto = 0;
 static TimeLineID receiveTLI = 0;
 
+/*
+ * abortedRecPtr is the start pointer of a broken record at end of WAL when
+ * recovery completes; missingContrecPtr is the location of the first
+ * contrecord that went missing.  See CreateOverwriteContrecordRecord for
+ * details.
+ */
+static XLogRecPtr abortedRecPtr;
+static XLogRecPtr missingContrecPtr;
+
 /*
  * During recovery, lastFullPageWrites keeps track of full_page_writes that
  * the replayed WAL records indicate. It's initialized with full_page_writes
@@ -894,6 +903,7 @@ static void checkTimeLineSwitch(XLogRecPtr lsn, TimeLineID newTLI,
 TimeLineID prevTLI);
 static void LocalSetXLogInsertAllowed(void);
 static void CreateEndOfRecoveryRecord(void);
+static XLogRecPtr CreateOverwriteContrecordRecord(XLogRecPtr aborted_lsn);
 static void CheckPointGuts(XLogRecPtr checkPointRedo, int flags);
 static void KeepLogSeg(XLogRecPtr recptr, XLogSegNo *logSegNo);
 static XLogRecPtr XLogGetReplicationSlotMinimumLSN(void);
@@ -1120,8 +1130,8 @@ XLogInsertRecord(XLogRecData *rdata,
 		 * All the record data, including the header, is now ready to be
 		 * inserted. Copy the record in the space reserved.
 		 */
-		CopyXLogRecordToWAL(rechdr->xl_tot_len, isLogSwitch, rdata,
-			StartPos, EndPos);
+		CopyXLogRecordToWAL(rechdr->xl_tot_len, isLogSwitch,
+			rdata, StartPos, EndPos);

Re: prevent immature WAL streaming

2021-09-14 Thread Kyotaro Horiguchi

At Tue, 14 Sep 2021 22:32:04 -0300, Alvaro Herrera  
wrote in 
> On 2021-Sep-14, Alvaro Herrera wrote:
> 
> > On 2021-Sep-08, Kyotaro Horiguchi wrote:
> > 
> > > Thanks!  As my understanding the new record add the ability to
> > > cross-check between a teard-off contrecord and the new record inserted
> > > after the teard-off record.  I didn't test the version by myself but
> > > the previous version implemented the essential machinery and that
> > > won't change fundamentally by the new record.
> > > 
> > > So I think the current patch deserves to see the algorithm actually
> > > works against the problem.
> > 
> > Here's a version with the new record type.  It passes check-world, and
> > it seems to work correctly to prevent overwrite of the tail end of a
> > segment containing a broken record.  This is very much WIP still;
> > comments are missing and I haven't tried to implement any sort of
> > verification that the record being aborted is the right one.
> 
> Here's the attachment I forgot earlier.

(I missed the chance to complain about that:p)

Tnaks for the patch!

-   CopyXLogRecordToWAL(rechdr->xl_tot_len, isLogSwitch, rdata,
-   StartPos, EndPos);
+   CopyXLogRecordToWAL(rechdr->xl_tot_len, isLogSwitch,
+   flags & 
XLOG_SET_ABORTED_PARTIAL,
+   rdata, StartPos, 
EndPos);

The new xlog flag XLOG_SET_ABORTED_PARTIAL is used only by
RM_XLOG_ID/XLOG_OVERWRITE_CONTRECORD records, so the flag value is the
equivalent of the record type.  We might instead want a new flag
XLOG_SPECIAL_TREATED_RECORD or something to quickly distinguish
records that need a special treat like XLOG_SWITCH.

  if (flags & XLOG_SPECIAL_TREATED_RECORD)
  {
if (rechdr->xl_rmid == RM_XLOG_ID)
{
   if (info ==  XLOG_SWITCH)
  isLogSwitch = true;
   if (info == XLOG_OVERWRITE_CONTRECORD)
  isOverwrite = true;
}
  }
  ..
  CopyXLogRecrodToWAL(.., isLogSwitch, isOverwrite, rdata, StartPos, EndPos);

+   /* XXX can only happen once in the loop. Verify? */
+   if (set_aborted_partial)
+   pagehdr->xlp_info |= 
XLP_FIRST_IS_ABORTED_PARTIAL;
+

I'm not sure about the reason for the change from the previous patch
(I might be missing something), this sets the flag on the *next* page
of the page where the record starts.  So in the first place we
shouldn't set the flag there.  The page header flags of the first page
is set by AdvanceXLInsertBuffer. If we want to set the flag in the
function, we need to find the page header for the beginning of the
record and make sure that the record is placed at the beginning of the
page.  (It is the reason that I did that in AdvanceXLInsertBuffer..)

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center

Re: prevent immature WAL streaming

2021-09-14 Thread Alvaro Herrera

On 2021-Sep-14, Alvaro Herrera wrote:

> On 2021-Sep-08, Kyotaro Horiguchi wrote:
> 
> > Thanks!  As my understanding the new record add the ability to
> > cross-check between a teard-off contrecord and the new record inserted
> > after the teard-off record.  I didn't test the version by myself but
> > the previous version implemented the essential machinery and that
> > won't change fundamentally by the new record.
> > 
> > So I think the current patch deserves to see the algorithm actually
> > works against the problem.
> 
> Here's a version with the new record type.  It passes check-world, and
> it seems to work correctly to prevent overwrite of the tail end of a
> segment containing a broken record.  This is very much WIP still;
> comments are missing and I haven't tried to implement any sort of
> verification that the record being aborted is the right one.

Here's the attachment I forgot earlier.

-- 
Álvaro Herrera   39°49'30"S 73°17'W  —  https://www.EnterpriseDB.com/
"[PostgreSQL] is a great group; in my opinion it is THE best open source
development communities in existence anywhere."(Lamar Owen)
>From 3de9660b2e570604412d435023dcbd13355fc123 Mon Sep 17 00:00:00 2001
From: Alvaro Herrera 
Date: Thu, 2 Sep 2021 17:21:46 -0400
Subject: [PATCH v5] Implement FIRST_IS_ABORTED_CONTRECORD

---
 src/backend/access/rmgrdesc/xlogdesc.c  | 11 +++
 src/backend/access/transam/xlog.c   | 91 +++--
 src/backend/access/transam/xloginsert.c |  3 +
 src/backend/access/transam/xlogreader.c | 39 ++-
 src/include/access/xlog.h   |  1 +
 src/include/access/xlog_internal.h  | 20 +-
 src/include/access/xlogreader.h |  3 +
 src/include/catalog/pg_control.h|  1 +
 8 files changed, 159 insertions(+), 10 deletions(-)

diff --git a/src/backend/access/rmgrdesc/xlogdesc.c b/src/backend/access/rmgrdesc/xlogdesc.c
index e6090a9dad..0be382f8a5 100644
--- a/src/backend/access/rmgrdesc/xlogdesc.c
+++ b/src/backend/access/rmgrdesc/xlogdesc.c
@@ -139,6 +139,14 @@ xlog_desc(StringInfo buf, XLogReaderState *record)
 		 xlrec.ThisTimeLineID, xlrec.PrevTimeLineID,
 		 timestamptz_to_str(xlrec.end_time));
 	}
+	else if (info == XLOG_OVERWRITE_CONTRECORD)
+	{
+		xl_overwrite_contrecord xlrec;
+
+		memcpy(, rec, sizeof(xl_overwrite_contrecord));
+		appendStringInfo(buf, "lsn %X/%X",
+		 LSN_FORMAT_ARGS(xlrec.overwritten_lsn));
+	}
 }
 
 const char *
@@ -178,6 +186,9 @@ xlog_identify(uint8 info)
 		case XLOG_END_OF_RECOVERY:
 			id = "END_OF_RECOVERY";
 			break;
+		case XLOG_OVERWRITE_CONTRECORD:
+			id = "OVERWRITE_CONTRECORD";
+			break;
 		case XLOG_FPI:
 			id = "FPI";
 			break;
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index e51a7a749d..73cfd64125 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -586,6 +586,8 @@ typedef struct XLogCtlData
 	XLogRecPtr	replicationSlotMinLSN;	/* oldest LSN needed by any slot */
 
 	XLogSegNo	lastRemovedSegNo;	/* latest removed/recycled XLOG segment */
+	XLogRecPtr	abortedContrecordPtr; /* LSN of incomplete record at end of
+	   * WAL */
 
 	/* Fake LSN counter, for unlogged relations. Protected by ulsn_lck. */
 	XLogRecPtr	unloggedLSN;
@@ -894,6 +896,7 @@ static void checkTimeLineSwitch(XLogRecPtr lsn, TimeLineID newTLI,
 TimeLineID prevTLI);
 static void LocalSetXLogInsertAllowed(void);
 static void CreateEndOfRecoveryRecord(void);
+static XLogRecPtr CreateOverwriteContrecordRecord(XLogRecPtr aborted_lsn);
 static void CheckPointGuts(XLogRecPtr checkPointRedo, int flags);
 static void KeepLogSeg(XLogRecPtr recptr, XLogSegNo *logSegNo);
 static XLogRecPtr XLogGetReplicationSlotMinimumLSN(void);
@@ -950,6 +953,7 @@ static void rm_redo_error_callback(void *arg);
 static int	get_sync_bit(int method);
 
 static void CopyXLogRecordToWAL(int write_len, bool isLogSwitch,
+bool set_aborted_partial,
 XLogRecData *rdata,
 XLogRecPtr StartPos, XLogRecPtr EndPos);
 static void ReserveXLogInsertLocation(int size, XLogRecPtr *StartPos,
@@ -1120,8 +1124,9 @@ XLogInsertRecord(XLogRecData *rdata,
 		 * All the record data, including the header, is now ready to be
 		 * inserted. Copy the record in the space reserved.
 		 */
-		CopyXLogRecordToWAL(rechdr->xl_tot_len, isLogSwitch, rdata,
-			StartPos, EndPos);
+		CopyXLogRecordToWAL(rechdr->xl_tot_len, isLogSwitch,
+			flags & XLOG_SET_ABORTED_PARTIAL,
+			rdata, StartPos, EndPos);
 
 		/*
 		 * Unless record is flagged as not important, update LSN of last
@@ -1504,7 +1509,8 @@ checkXLogConsistency(XLogReaderState *record)
  * area in the WAL.
  */
 static void
-CopyXLogRecordToWAL(int write_len, bool isLogSwitch, XLogRecData *rdata,
+CopyXLogRecordToWAL(int write_len, bool isLogSwitch,
+	bool set_aborted_partial, XLogRecData *rdata,
 	XLogRecPtr StartPos, XLogRecPtr EndPos)
 {
 	char	   *currpos;
@@ -1560,6 +1566,10 @@

Re: prevent immature WAL streaming

2021-09-14 Thread Alvaro Herrera

On 2021-Sep-08, Kyotaro Horiguchi wrote:

> Thanks!  As my understanding the new record add the ability to
> cross-check between a teard-off contrecord and the new record inserted
> after the teard-off record.  I didn't test the version by myself but
> the previous version implemented the essential machinery and that
> won't change fundamentally by the new record.
> 
> So I think the current patch deserves to see the algorithm actually
> works against the problem.

Here's a version with the new record type.  It passes check-world, and
it seems to work correctly to prevent overwrite of the tail end of a
segment containing a broken record.  This is very much WIP still;
comments are missing and I haven't tried to implement any sort of
verification that the record being aborted is the right one.

-- 
Álvaro Herrera PostgreSQL Developer  —  https://www.EnterpriseDB.com/
"XML!" Exclaimed C++.  "What are you doing here? You're not a programming
language."
"Tell that to the people who use me," said XML.
https://burningbird.net/the-parable-of-the-languages/

Re: prevent immature WAL streaming

2021-09-08 Thread Kyotaro Horiguchi

At Tue, 7 Sep 2021 18:41:57 +, "Bossart, Nathan"  
wrote in 
> On 9/4/21, 10:26 AM, "Alvaro Herrera"  wrote:
> > Attached are the same patches as last night, except I added a test for
> > XLOG_DEBUG where pertinent.  (The elog(PANIC) is not made conditional on
> > that, since it's a cross-check rather than informative.)  Also fixed the
> > silly pg_rewind mistake I made.
> >
> > I'll work on the new xlog record early next week.
> 
> Are these patches in a good state for some preliminary testing?  I'd
> like to try them out, but I'll hold off if they're not quite ready
> yet.

Thanks!  As my understanding the new record add the ability to
cross-check between a teard-off contrecord and the new record inserted
after the teard-off record.  I didn't test the version by myself but
the previous version implemented the essential machinery and that
won't change fundamentally by the new record.

So I think the current patch deserves to see the algorithm actually
works against the problem.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center

Re: prevent immature WAL streaming

2021-09-07 Thread Bossart, Nathan

On 9/4/21, 10:26 AM, "Alvaro Herrera"  wrote:
> Attached are the same patches as last night, except I added a test for
> XLOG_DEBUG where pertinent.  (The elog(PANIC) is not made conditional on
> that, since it's a cross-check rather than informative.)  Also fixed the
> silly pg_rewind mistake I made.
>
> I'll work on the new xlog record early next week.

Are these patches in a good state for some preliminary testing?  I'd
like to try them out, but I'll hold off if they're not quite ready
yet.

Nathan

Re: prevent immature WAL streaming

2021-09-04 Thread Alvaro Herrera

On 2021-Sep-03, Andres Freund wrote:

> > +#ifdef WAL_DEBUG
> > +   ereport(LOG,
> > +   (errmsg_internal("recovery overwriting broken 
> > contrecord at %X/%X (EndRecPtr: %X/%X)",
> > +
> > LSN_FORMAT_ARGS(abortedContrecordPtr),
> > +
> > LSN_FORMAT_ARGS(EndRecPtr;
> > +#endif
> 
> "broken" sounds a bit off. But then, it's just WAL_DEBUG. Which made me
> realize, isn't this missing a
> if (XLOG_DEBUG)?

Attached are the same patches as last night, except I added a test for
XLOG_DEBUG where pertinent.  (The elog(PANIC) is not made conditional on
that, since it's a cross-check rather than informative.)  Also fixed the
silly pg_rewind mistake I made.

I'll work on the new xlog record early next week.

-- 
Álvaro Herrera  Valdivia, Chile  —  https://www.EnterpriseDB.com/
"I can see support will not be a problem.  10 out of 10."(Simon Wittber)
  (http://archives.postgresql.org/pgsql-general/2004-12/msg00159.php)
>From 4173fc5000c9101604e6c64a795322771cb0687a Mon Sep 17 00:00:00 2001
From: Alvaro Herrera 
Date: Thu, 2 Sep 2021 17:21:46 -0400
Subject: [PATCH v4 1/4] Implement FIRST_IS_ABORTED_CONTRECORD

---
 src/backend/access/transam/xlog.c   | 55 +++--
 src/backend/access/transam/xlogreader.c | 39 +-
 src/include/access/xlog_internal.h  | 14 ++-
 src/include/access/xlogreader.h |  3 ++
 4 files changed, 105 insertions(+), 6 deletions(-)

diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index e51a7a749d..49912483d5 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -586,6 +586,8 @@ typedef struct XLogCtlData
 	XLogRecPtr	replicationSlotMinLSN;	/* oldest LSN needed by any slot */
 
 	XLogSegNo	lastRemovedSegNo;	/* latest removed/recycled XLOG segment */
+	XLogRecPtr	abortedContrecordPtr; /* LSN of incomplete record at end of
+	   * WAL */
 
 	/* Fake LSN counter, for unlogged relations. Protected by ulsn_lck. */
 	XLogRecPtr	unloggedLSN;
@@ -848,6 +850,7 @@ static XLogSource XLogReceiptSource = XLOG_FROM_ANY;
 /* State information for XLOG reading */
 static XLogRecPtr ReadRecPtr;	/* start of last record read */
 static XLogRecPtr EndRecPtr;	/* end+1 of last record read */
+static XLogRecPtr abortedContrecordPtr;	/* end+1 of incomplete record */
 
 /*
  * Local copies of equivalent fields in the control file.  When running
@@ -2246,6 +2249,31 @@ AdvanceXLInsertBuffer(XLogRecPtr upto, bool opportunistic)
 		if (!Insert->forcePageWrites)
 			NewPage->xlp_info |= XLP_BKP_REMOVABLE;
 
+		/*
+		 * If the last page ended with an aborted partial continuation record,
+		 * mark the new page to indicate that the partial record can be
+		 * omitted.
+		 *
+		 * This happens only once at the end of recovery, so there's no race
+		 * condition here.
+		 */
+		if (XLogCtl->abortedContrecordPtr >= NewPageBeginPtr)
+		{
+#ifdef WAL_DEBUG
+			if (XLogCtl->abortedContrecordPtr != NewPageBeginPtr)
+elog(PANIC, "inconsistent aborted contrecord location %X/%X, expected %X/%X",
+	 LSN_FORMAT_ARGS(XLogCtl->abortedContrecordPtr),
+	 LSN_FORMAT_ARGS(NewPageBeginPtr));
+			if (XLOG_DEBUG)
+ereport(LOG,
+		(errmsg_internal("setting XLP_FIRST_IS_ABORTED_PARTIAL flag at %X/%X",
+		 LSN_FORMAT_ARGS(NewPageBeginPtr;
+#endif
+			NewPage->xlp_info |= XLP_FIRST_IS_ABORTED_PARTIAL;
+
+			XLogCtl->abortedContrecordPtr = InvalidXLogRecPtr;
+		}
+
 		/*
 		 * If first page of an XLOG segment file, make it a long header.
 		 */
@@ -4392,6 +4420,7 @@ ReadRecord(XLogReaderState *xlogreader, int emode,
 		record = XLogReadRecord(xlogreader, );
 		ReadRecPtr = xlogreader->ReadRecPtr;
 		EndRecPtr = xlogreader->EndRecPtr;
+		abortedContrecordPtr = xlogreader->abortedContrecordPtr;
 		if (record == NULL)
 		{
 			if (readFile >= 0)
@@ -7691,10 +7720,30 @@ StartupXLOG(void)
 	/*
 	 * Re-fetch the last valid or last applied record, so we can identify the
 	 * exact endpoint of what we consider the valid portion of WAL.
+	 *
+	 * When recovery ended in an incomplete record, continue writing from the
+	 * point where it went missing.  This leaves behind an initial part of
+	 * broken record, which rescues downstream which have already received
+	 * that first part.
 	 */
-	XLogBeginRead(xlogreader, LastRec);
-	record = ReadRecord(xlogreader, PANIC, false);
-	EndOfLog = EndRecPtr;
+	if (XLogRecPtrIsInvalid(abortedContrecordPtr))
+	{
+		XLogBeginRead(xlogreader, LastRec);
+		record = ReadRecord(xlogreader, PANIC, false);
+		EndOfLog = EndRecPtr;
+	}
+	else
+	{
+#ifdef WAL_DEBUG
+		if (XLOG_DEBUG)
+			ereport(LOG,
+	(errmsg_internal("recovery overwriting broken contrecord at %X/%X (EndRecPtr: %X/%X)",
+	 LSN_FORMAT_ARGS(abortedContrecordPtr),
+	 LSN_FORMAT_ARGS(EndRecPtr;
+#endif
+		EndOfLog =

Re: prevent immature WAL streaming

2021-09-04 Thread Alvaro Herrera

On 2021-Sep-03, Andres Freund wrote:

> Hi,
> 
> On 2021-09-03 20:01:50 -0400, Alvaro Herrera wrote:
> > From 6abc5026f92b99d704bff527d1306eb8588635e9 Mon Sep 17 00:00:00 2001
> > From: Alvaro Herrera 
> > Date: Tue, 31 Aug 2021 20:55:10 -0400
> > Subject: [PATCH v3 1/5] Revert "Avoid creating archive status ".ready" files
> >  too early"
> 
> > This reverts commit 515e3d84a0b58b58eb30194209d2bc47ed349f5b.
> 
> I'd prefer to see this pushed soon. I've a bunch of patches to xlog.c that
> conflict with the prior changes, and rebasing back and forth isn't that much
> fun...

Done.

> > +   /*
> > +* If we were expecting a continuation record and got 
> > an "aborted
> > +* partial" flag, that means the continuation record 
> > was lost.
> > +* Ignore the record we were reading, since we now know 
> > it's broken
> > +* and lost forever, and restart the read by assuming 
> > the address
> > +* to read is the location where we found this flag.
> > +*/
> > +   if (pageHeader->xlp_info & XLP_FIRST_IS_ABORTED_PARTIAL)
> > +   {
> > +   ResetDecoder(state);
> > +   RecPtr = targetPagePtr;
> > +   goto restart;
> > +   }
> 
> I think we need to add more validation to this path. What I was proposing
> earlier is that we add a new special type of record at the start of an
> XLP_FIRST_IS_ABORTED_PARTIAL page, which contains a) lsn of the record we're
> aborting, b) checksum of the data up to this point.

Hmm, a new record type?  Yeah, we can do that, and sounds like it would
make things simpler too -- I wasn't too happy about adding clutter to
AdvanceXLInsertBuffer either, but the alternative I had in mind was that
we'd pass the flag to the checkpointer, which seemed quite annoying API
wise.  But if we add a new record, seems we can write it directly in
StartupXLOG and avoid most nastiness.  I'm not too sure about the
checksum up to this point, though, but I'll spend some time with the
idea to see how bad it is.

-- 
Álvaro Herrera   39°49'30"S 73°17'W  —  https://www.EnterpriseDB.com/
Voy a acabar con todos los humanos / con los humanos yo acabaré
voy a acabar con todos (bis) / con todos los humanos acabaré ¡acabaré! (Bender)

Re: prevent immature WAL streaming

2021-09-03 Thread Andres Freund

Hi,

On 2021-09-03 20:01:50 -0400, Alvaro Herrera wrote:
> From 6abc5026f92b99d704bff527d1306eb8588635e9 Mon Sep 17 00:00:00 2001
> From: Alvaro Herrera 
> Date: Tue, 31 Aug 2021 20:55:10 -0400
> Subject: [PATCH v3 1/5] Revert "Avoid creating archive status ".ready" files
>  too early"

> This reverts commit 515e3d84a0b58b58eb30194209d2bc47ed349f5b.

I'd prefer to see this pushed soon. I've a bunch of patches to xlog.c that
conflict with the prior changes, and rebasing back and forth isn't that much
fun...



> From f767cdddb3120f1f6c079c8eb00eaff38ea98c79 Mon Sep 17 00:00:00 2001
> From: Alvaro Herrera 
> Date: Thu, 2 Sep 2021 17:21:46 -0400
> Subject: [PATCH v3 2/5] Implement FIRST_IS_ABORTED_CONTRECORD
>
> ---
>  src/backend/access/transam/xlog.c   | 53 +++--
>  src/backend/access/transam/xlogreader.c | 39 +-
>  src/include/access/xlog_internal.h  | 14 ++-
>  src/include/access/xlogreader.h |  3 ++
>  4 files changed, 103 insertions(+), 6 deletions(-)
>
> diff --git a/src/backend/access/transam/xlog.c 
> b/src/backend/access/transam/xlog.c
> index e51a7a749d..411f1618df 100644
> --- a/src/backend/access/transam/xlog.c
> +++ b/src/backend/access/transam/xlog.c
> @@ -586,6 +586,8 @@ typedef struct XLogCtlData
>   XLogRecPtr  replicationSlotMinLSN;  /* oldest LSN needed by any 
> slot */
>
>   XLogSegNo   lastRemovedSegNo;   /* latest removed/recycled XLOG 
> segment */
> + XLogRecPtr  abortedContrecordPtr; /* LSN of incomplete record at 
> end of
> +* 
> WAL */
>
>   /* Fake LSN counter, for unlogged relations. Protected by ulsn_lck. */
>   XLogRecPtr  unloggedLSN;
> @@ -848,6 +850,7 @@ static XLogSource XLogReceiptSource = XLOG_FROM_ANY;
>  /* State information for XLOG reading */
>  static XLogRecPtr ReadRecPtr;/* start of last record read */
>  static XLogRecPtr EndRecPtr; /* end+1 of last record read */
> +static XLogRecPtr abortedContrecordPtr;  /* end+1 of incomplete record */
>
>  /*
>   * Local copies of equivalent fields in the control file.  When running
> @@ -2246,6 +2249,30 @@ AdvanceXLInsertBuffer(XLogRecPtr upto, bool 
> opportunistic)
>   if (!Insert->forcePageWrites)
>   NewPage->xlp_info |= XLP_BKP_REMOVABLE;
>
> + /*
> +  * If the last page ended with an aborted partial continuation 
> record,
> +  * mark the new page to indicate that the partial record can be
> +  * omitted.
> +  *
> +  * This happens only once at the end of recovery, so there's no 
> race
> +  * condition here.
> +  */
> + if (XLogCtl->abortedContrecordPtr >= NewPageBeginPtr)
> + {

Can we move this case out of AdvanceXLInsertBuffer()? As the comment says,
this only happens at the end of recovery, so putting it into
AdvanceXLInsertBuffer() doesn't really seem necessary?


> +#ifdef WAL_DEBUG
> + if (XLogCtl->abortedContrecordPtr != NewPageBeginPtr)
> + elog(PANIC, "inconsistent aborted contrecord 
> location %X/%X, expected %X/%X",
> +  
> LSN_FORMAT_ARGS(XLogCtl->abortedContrecordPtr),
> +  LSN_FORMAT_ARGS(NewPageBeginPtr));
> + ereport(LOG,
> + (errmsg_internal("setting 
> XLP_FIRST_IS_ABORTED_PARTIAL flag at %X/%X",
> +  
> LSN_FORMAT_ARGS(NewPageBeginPtr;
> +#endif
> + NewPage->xlp_info |= XLP_FIRST_IS_ABORTED_PARTIAL;
> +
> + XLogCtl->abortedContrecordPtr = InvalidXLogRecPtr;
> + }

>   /*
>* If first page of an XLOG segment file, make it a long header.
>*/
> @@ -4392,6 +4419,7 @@ ReadRecord(XLogReaderState *xlogreader, int emode,
>   record = XLogReadRecord(xlogreader, );
>   ReadRecPtr = xlogreader->ReadRecPtr;
>   EndRecPtr = xlogreader->EndRecPtr;
> + abortedContrecordPtr = xlogreader->abortedContrecordPtr;
>   if (record == NULL)
>   {
>   if (readFile >= 0)
> @@ -7691,10 +7719,29 @@ StartupXLOG(void)
>   /*
>* Re-fetch the last valid or last applied record, so we can identify 
> the
>* exact endpoint of what we consider the valid portion of WAL.
> +  *
> +  * When recovery ended in an incomplete record, continue writing from 
> the
> +  * point where it went missing.  This leaves behind an initial part of
> +  * broken record, which rescues downstream which have already received
> +  * that first part.
>*/
> - XLogBeginRead(xlogreader, LastRec);
> - record = ReadRecord(xlogreader, PANIC, false);
> -

Re: prevent immature WAL streaming

2021-09-03 Thread Alvaro Herrera

I thought that the way to have debug output for this new WAL code is to
use WAL_DEBUG; that way it won't bother anyone and we can remove them
later if necessary.

Also, I realized that we should cause any error in the path that
assembles a record from contrecords is to set a flag that we can test
after the standard "err:" label; no need to create a new label.

I also wrote a lot more comments to try and explain what is going on and
why.

I'm still unsure about the two-flags reporting in xlogreader, so I put
that in a separate commit.  Opinions on that one?

The last commit is something I noticed in pg_rewind ...

-- 
Álvaro Herrera  Valdivia, Chile  —  https://www.EnterpriseDB.com/
"No hay ausente sin culpa ni presente sin disculpa" (Prov. francés)
>From 6abc5026f92b99d704bff527d1306eb8588635e9 Mon Sep 17 00:00:00 2001
From: Alvaro Herrera 
Date: Tue, 31 Aug 2021 20:55:10 -0400
Subject: [PATCH v3 1/5] Revert "Avoid creating archive status ".ready" files
 too early"

This reverts commit 515e3d84a0b58b58eb30194209d2bc47ed349f5b.
---
 src/backend/access/transam/timeline.c|   2 +-
 src/backend/access/transam/xlog.c| 220 ++-
 src/backend/access/transam/xlogarchive.c |  17 +-
 src/backend/postmaster/walwriter.c   |   7 -
 src/backend/replication/walreceiver.c|   6 +-
 src/include/access/xlog.h|   1 -
 src/include/access/xlogarchive.h |   4 +-
 src/include/access/xlogdefs.h|   1 -
 8 files changed, 24 insertions(+), 234 deletions(-)

diff --git a/src/backend/access/transam/timeline.c b/src/backend/access/transam/timeline.c
index acd5c2431d..8d0903c175 100644
--- a/src/backend/access/transam/timeline.c
+++ b/src/backend/access/transam/timeline.c
@@ -452,7 +452,7 @@ writeTimeLineHistory(TimeLineID newTLI, TimeLineID parentTLI,
 	if (XLogArchivingActive())
 	{
 		TLHistoryFileName(histfname, newTLI);
-		XLogArchiveNotify(histfname, true);
+		XLogArchiveNotify(histfname);
 	}
 }
 
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 24165ab03e..e51a7a749d 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -724,18 +724,6 @@ typedef struct XLogCtlData
 	XLogRecPtr	lastFpwDisableRecPtr;
 
 	slock_t		info_lck;		/* locks shared variables shown above */
-
-	/*
-	 * Variables used to track segment-boundary-crossing WAL records.  See
-	 * RegisterSegmentBoundary.  Protected by segtrack_lck.
-	 */
-	XLogSegNo	lastNotifiedSeg;
-	XLogSegNo	earliestSegBoundary;
-	XLogRecPtr	earliestSegBoundaryEndPtr;
-	XLogSegNo	latestSegBoundary;
-	XLogRecPtr	latestSegBoundaryEndPtr;
-
-	slock_t		segtrack_lck;	/* locks shared variables shown above */
 } XLogCtlData;
 
 static XLogCtlData *XLogCtl = NULL;
@@ -932,7 +920,6 @@ static void RemoveXlogFile(const char *segname, XLogSegNo recycleSegNo,
 		   XLogSegNo *endlogSegNo);
 static void UpdateLastRemovedPtr(char *filename);
 static void ValidateXLOGDirectoryStructure(void);
-static void RegisterSegmentBoundary(XLogSegNo seg, XLogRecPtr pos);
 static void CleanupBackupHistory(void);
 static void UpdateMinRecoveryPoint(XLogRecPtr lsn, bool force);
 static XLogRecord *ReadRecord(XLogReaderState *xlogreader,
@@ -1167,56 +1154,23 @@ XLogInsertRecord(XLogRecData *rdata,
 	END_CRIT_SECTION();
 
 	/*
-	 * If we crossed page boundary, update LogwrtRqst.Write; if we crossed
-	 * segment boundary, register that and wake up walwriter.
+	 * Update shared LogwrtRqst.Write, if we crossed page boundary.
 	 */
 	if (StartPos / XLOG_BLCKSZ != EndPos / XLOG_BLCKSZ)
 	{
-		XLogSegNo	StartSeg;
-		XLogSegNo	EndSeg;
-
-		XLByteToSeg(StartPos, StartSeg, wal_segment_size);
-		XLByteToSeg(EndPos, EndSeg, wal_segment_size);
-
-		/*
-		 * Register our crossing the segment boundary if that occurred.
-		 *
-		 * Note that we did not use XLByteToPrevSeg() for determining the
-		 * ending segment.  This is so that a record that fits perfectly into
-		 * the end of the segment causes the latter to get marked ready for
-		 * archival immediately.
-		 */
-		if (StartSeg != EndSeg && XLogArchivingActive())
-			RegisterSegmentBoundary(EndSeg, EndPos);
-
-		/*
-		 * Advance LogwrtRqst.Write so that it includes new block(s).
-		 *
-		 * We do this after registering the segment boundary so that the
-		 * comparison with the flushed pointer below can use the latest value
-		 * known globally.
-		 */
 		SpinLockAcquire(>info_lck);
+		/* advance global request to include new block(s) */
 		if (XLogCtl->LogwrtRqst.Write < EndPos)
 			XLogCtl->LogwrtRqst.Write = EndPos;
 		/* update local result copy while I have the chance */
 		LogwrtResult = XLogCtl->LogwrtResult;
 		SpinLockRelease(>info_lck);
-
-		/*
-		 * There's a chance that the record was already flushed to disk and we
-		 * missed marking segments as ready for archive.  If this happens, we
-		 * nudge the WALWriter, which will take care of notifying segments as
-		 * needed.
-		 */
-		if (StartSeg != EndSeg &&

Re: prevent immature WAL streaming

2021-09-03 Thread Alvaro Herrera

On 2021-Sep-03, Andres Freund wrote:

> Hi,
> 
> On 2021-09-03 12:55:23 -0400, Alvaro Herrera wrote:
> > Oh, but of course we can't modify XLogReaderState in backbranches to add
> > the new struct member abortedContrecordPtr (or whatever we end up naming
> > that.)
> 
> Why is that? Afaict it's always allocated via XLogReaderAllocate(), so adding
> a new field at the end should be fine?

Hmm, true, that works.

> That said, I'm worried that this stuff is too complicated to get right in the
> backbranches. I suspect letting it stew in master for a while before
> backpatching would be a good move.

Sure, we can put it in master now and backpatch before the November
minors if everything goes well.

-- 
Álvaro Herrera   39°49'30"S 73°17'W  —  https://www.EnterpriseDB.com/

Re: prevent immature WAL streaming

2021-09-03 Thread Andres Freund

Hi,

On 2021-09-03 12:55:23 -0400, Alvaro Herrera wrote:
> Oh, but of course we can't modify XLogReaderState in backbranches to add
> the new struct member abortedContrecordPtr (or whatever we end up naming
> that.)

Why is that? Afaict it's always allocated via XLogReaderAllocate(), so adding
a new field at the end should be fine?

That said, I'm worried that this stuff is too complicated to get right in the
backbranches. I suspect letting it stew in master for a while before
backpatching would be a good move.

Greetings,

Andres Freund

Re: prevent immature WAL streaming

2021-09-03 Thread Alvaro Herrera

On 2021-Sep-03, Kyotaro Horiguchi wrote:

> At Thu, 2 Sep 2021 18:43:33 -0400, Alvaro Herrera  
> wrote in 

> The name sounds like the start LSN. doesn't contrecordAbort(ed)Ptr work?
> 
> > if (!(pageHeader->xlp_info & XLP_FIRST_IS_CONTRECORD))
> > {
> > report_invalid_record(state,
> >   
> > "there is no contrecord flag at %X/%X",
> >   
> > LSN_FORMAT_ARGS(RecPtr));
> > +   goto aborted_contrecord;
> 
> This loses the exclusion check between XLP_FIRST_IS_CONTRECORD and
> _IS_ABROTED_PARTIAL.  Is it okay?  (I don't object to remove the check.).

On second thought, I'm not sure that we should make xlogreader report an
invalid record here.  If we do, how is the user going to recover?
Recovery will stop there and lose whatever was written afterwards.
Maybe you could claim that if both bits are set then WAL is corrupted,
so it's okay to stop recovery.  But if WAL is really corrupted, then the
CRC check will fail.  All in all, I think I'd rather ignore the flag if
we see it set.

At most, we could have an

#ifndef FRONTEND
ereport(WARNING, "found unexpected flag xyz");
#endif

or something like that.  However, xlogreader does not currently have
anything like that, so I'm not completely sure.

-- 
Álvaro Herrera PostgreSQL Developer  —  https://www.EnterpriseDB.com/

Re: prevent immature WAL streaming

2021-09-03 Thread Alvaro Herrera

Oh, but of course we can't modify XLogReaderState in backbranches to add
the new struct member abortedContrecordPtr (or whatever we end up naming
that.)

I think I'm going to fix this, in backbranches only, by having
xlogreader.c have a global variable, which is going to be used by
ReadRecord instead of accessing the struct member.

-- 
Álvaro Herrera PostgreSQL Developer  —  https://www.EnterpriseDB.com/

Re: prevent immature WAL streaming

2021-09-03 Thread Alvaro Herrera

On 2021-Sep-03, Kyotaro Horiguchi wrote:

> At Thu, 2 Sep 2021 18:43:33 -0400, Alvaro Herrera  
> wrote in 

> 0002:
> > +   XLogRecPtr  abortedContrecordPtr; /* LSN of incomplete record at 
> > end of
> > +  * 
> > WAL */
> 
> The name sounds like the start LSN. doesn't contrecordAbort(ed)Ptr work?

I went over various iterations of the name of this, and still not
entirely happy.  I think we need to convey the ideas that

* This is the endptr+1 of the known-good part of the record, that is,
  the beginning of the next part of the record.  I think "endPtr"
  summarizes this well; we use this name elsewhere.

* At some point before recovery, this was the last WAL record that
  existed

* there is an invalid contrecord, or we were looking for a contrecord
  and found invalid data

* this record is incomplete

So maybe
1. incompleteRecEndPtr
2. finalInvalidRecEndPtr
3. brokenContrecordEndPtr

> > if (!(pageHeader->xlp_info & XLP_FIRST_IS_CONTRECORD))
> > {
> > report_invalid_record(state,
> >   
> > "there is no contrecord flag at %X/%X",
> >   
> > LSN_FORMAT_ARGS(RecPtr));
> > -   goto err;
> > +   goto aborted_contrecord;
> 
> This loses the exclusion check between XLP_FIRST_IS_CONTRECORD and
> _IS_ABROTED_PARTIAL.  Is it okay?  (I don't object to remove the check.).

Yeah, I was unsure about that.  I think it's good to have it as a
cross-check, though it should never occur.  I'll put it back.

Another related point is whether it's a good idea to have the ereport()
about the bit appearing in a not-start-of-page address being a PANIC.
If we degrade to WARNING then it'll be lost in the noise, but I'm not
sure what else can we do.  (If it's a PANIC, then you end up with an
unusable database).

> I didin't thought this as an aborted contrecord. but on second
> thought, when we see a record broken in any style, we stop recovery at
> the point. I agree to the change and all the silmiar changes.
> 
> + /* XXX should we goto 
> aborted_contrecord here? */
> 
> I think it should be aborted_contrecord.
> 
> When that happens, the loaded bytes actually looked like the first
> fragment of a continuation record to xlogreader, even if the cause
> were a broken total_len.  So if we abort the record there, the next
> time xlogreader will meet XLP_FIRST_IS_ABORTED_PARTIAL at the same
> page, and correctly finds a new record there.
> 
> On the other hand if we just errored-out there, we will step-back to
> the beginning of the broken record in the previous page or segment
> then start writing a new record there but that is exactly what we want
> to avoid now.

Hmm, yeah, makes sense.

-- 
Álvaro Herrera   39°49'30"S 73°17'W  —  https://www.EnterpriseDB.com/

Re: prevent immature WAL streaming

2021-09-03 Thread Kyotaro Horiguchi

At Thu, 2 Sep 2021 18:43:33 -0400, Alvaro Herrera  
wrote in 
> On 2021-Sep-02, Kyotaro Horiguchi wrote:
> 
> > So, this is a crude PoC of that.
> 
> I had ended up with something very similar, except I was trying to cram
> the flag via the checkpoint record instead of hacking
> AdvanceXLInsertBuffer().  I removed that stuff and merged both, here's
> the result.
> 
> > 1. This patch is written on the current master, but it doesn't
> >   interfare with the seg-boundary-memorize patch since it removes the
> >   calls to RegisterSegmentBoundary.
> 
> I rebased on top of the revert patch.

Thanks!

> > 2. Since xlogreader cannot emit a log-message immediately, we don't
> >   have a means to leave a log message to inform recovery met an
> >   aborted partial continuation record. (In this PoC, it is done by
> >   fprintf:p)
> 
> Shrug.  We can just use an #ifndef FRONTEND / elog(LOG).  (I didn't keep
> this part, sorry.)

No problem, it was mere a develop-time message for behavior
observation.

> > 3. Myebe we need to pg_waldump to show partial continuation records,
> >   but I'm not sure how to realize that.
> 
> Ah yes, we'll need to fix that.

I just believe 0001 does the right thing.

0002:
> + XLogRecPtr  abortedContrecordPtr; /* LSN of incomplete record at 
> end of
> +* 
> WAL */

The name sounds like the start LSN. doesn't contrecordAbort(ed)Ptr work?

>   if (!(pageHeader->xlp_info & XLP_FIRST_IS_CONTRECORD))
>   {
>   report_invalid_record(state,
> 
> "there is no contrecord flag at %X/%X",
> 
> LSN_FORMAT_ARGS(RecPtr));
> - goto err;
> + goto aborted_contrecord;

This loses the exclusion check between XLP_FIRST_IS_CONTRECORD and
_IS_ABROTED_PARTIAL.  Is it okay?  (I don't object to remove the check.).

I didin't thought this as an aborted contrecord. but on second
thought, when we see a record broken in any style, we stop recovery at
the point. I agree to the change and all the silmiar changes.

+   /* XXX should we goto 
aborted_contrecord here? */

I think it should be aborted_contrecord.

When that happens, the loaded bytes actually looked like the first
fragment of a continuation record to xlogreader, even if the cause
were a broken total_len.  So if we abort the record there, the next
time xlogreader will meet XLP_FIRST_IS_ABORTED_PARTIAL at the same
page, and correctly finds a new record there.

On the other hand if we just errored-out there, we will step-back to
the beginning of the broken record in the previous page or segment
then start writing a new record there but that is exactly what we want
to avoid now.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center

Re: prevent immature WAL streaming

2021-09-02 Thread Alvaro Herrera

On 2021-Sep-02, Kyotaro Horiguchi wrote:

> So, this is a crude PoC of that.

I had ended up with something very similar, except I was trying to cram
the flag via the checkpoint record instead of hacking
AdvanceXLInsertBuffer().  I removed that stuff and merged both, here's
the result.

> 1. This patch is written on the current master, but it doesn't
>   interfare with the seg-boundary-memorize patch since it removes the
>   calls to RegisterSegmentBoundary.

I rebased on top of the revert patch.

> 2. Since xlogreader cannot emit a log-message immediately, we don't
>   have a means to leave a log message to inform recovery met an
>   aborted partial continuation record. (In this PoC, it is done by
>   fprintf:p)

Shrug.  We can just use an #ifndef FRONTEND / elog(LOG).  (I didn't keep
this part, sorry.)

> 3. Myebe we need to pg_waldump to show partial continuation records,
>   but I'm not sure how to realize that.

Ah yes, we'll need to fix that.

-- 
Álvaro Herrera PostgreSQL Developer  —  https://www.EnterpriseDB.com/
"Las navajas y los monos deben estar siempre distantes"   (Germán Poo)
>From af9dba1c1523ea8b7963e01fdd5e357b142e69f8 Mon Sep 17 00:00:00 2001
From: Alvaro Herrera 
Date: Tue, 31 Aug 2021 20:55:10 -0400
Subject: [PATCH v2 1/3] Revert "Avoid creating archive status ".ready" files
 too early"

This reverts commit 515e3d84a0b58b58eb30194209d2bc47ed349f5b.
---
 src/backend/access/transam/timeline.c|   2 +-
 src/backend/access/transam/xlog.c| 220 ++-
 src/backend/access/transam/xlogarchive.c |  17 +-
 src/backend/postmaster/walwriter.c   |   7 -
 src/backend/replication/walreceiver.c|   6 +-
 src/include/access/xlog.h|   1 -
 src/include/access/xlogarchive.h |   4 +-
 src/include/access/xlogdefs.h|   1 -
 8 files changed, 24 insertions(+), 234 deletions(-)

diff --git a/src/backend/access/transam/timeline.c b/src/backend/access/transam/timeline.c
index acd5c2431d..8d0903c175 100644
--- a/src/backend/access/transam/timeline.c
+++ b/src/backend/access/transam/timeline.c
@@ -452,7 +452,7 @@ writeTimeLineHistory(TimeLineID newTLI, TimeLineID parentTLI,
 	if (XLogArchivingActive())
 	{
 		TLHistoryFileName(histfname, newTLI);
-		XLogArchiveNotify(histfname, true);
+		XLogArchiveNotify(histfname);
 	}
 }
 
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 24165ab03e..e51a7a749d 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -724,18 +724,6 @@ typedef struct XLogCtlData
 	XLogRecPtr	lastFpwDisableRecPtr;
 
 	slock_t		info_lck;		/* locks shared variables shown above */
-
-	/*
-	 * Variables used to track segment-boundary-crossing WAL records.  See
-	 * RegisterSegmentBoundary.  Protected by segtrack_lck.
-	 */
-	XLogSegNo	lastNotifiedSeg;
-	XLogSegNo	earliestSegBoundary;
-	XLogRecPtr	earliestSegBoundaryEndPtr;
-	XLogSegNo	latestSegBoundary;
-	XLogRecPtr	latestSegBoundaryEndPtr;
-
-	slock_t		segtrack_lck;	/* locks shared variables shown above */
 } XLogCtlData;
 
 static XLogCtlData *XLogCtl = NULL;
@@ -932,7 +920,6 @@ static void RemoveXlogFile(const char *segname, XLogSegNo recycleSegNo,
 		   XLogSegNo *endlogSegNo);
 static void UpdateLastRemovedPtr(char *filename);
 static void ValidateXLOGDirectoryStructure(void);
-static void RegisterSegmentBoundary(XLogSegNo seg, XLogRecPtr pos);
 static void CleanupBackupHistory(void);
 static void UpdateMinRecoveryPoint(XLogRecPtr lsn, bool force);
 static XLogRecord *ReadRecord(XLogReaderState *xlogreader,
@@ -1167,56 +1154,23 @@ XLogInsertRecord(XLogRecData *rdata,
 	END_CRIT_SECTION();
 
 	/*
-	 * If we crossed page boundary, update LogwrtRqst.Write; if we crossed
-	 * segment boundary, register that and wake up walwriter.
+	 * Update shared LogwrtRqst.Write, if we crossed page boundary.
 	 */
 	if (StartPos / XLOG_BLCKSZ != EndPos / XLOG_BLCKSZ)
 	{
-		XLogSegNo	StartSeg;
-		XLogSegNo	EndSeg;
-
-		XLByteToSeg(StartPos, StartSeg, wal_segment_size);
-		XLByteToSeg(EndPos, EndSeg, wal_segment_size);
-
-		/*
-		 * Register our crossing the segment boundary if that occurred.
-		 *
-		 * Note that we did not use XLByteToPrevSeg() for determining the
-		 * ending segment.  This is so that a record that fits perfectly into
-		 * the end of the segment causes the latter to get marked ready for
-		 * archival immediately.
-		 */
-		if (StartSeg != EndSeg && XLogArchivingActive())
-			RegisterSegmentBoundary(EndSeg, EndPos);
-
-		/*
-		 * Advance LogwrtRqst.Write so that it includes new block(s).
-		 *
-		 * We do this after registering the segment boundary so that the
-		 * comparison with the flushed pointer below can use the latest value
-		 * known globally.
-		 */
 		SpinLockAcquire(>info_lck);
+		/* advance global request to include new block(s) */
 		if (XLogCtl->LogwrtRqst.Write < EndPos)
 			XLogCtl->LogwrtRqst.Write = EndPos;
 		/* update local result copy while I have the chance */

Re: prevent immature WAL streaming

2021-09-01 Thread Kyotaro Horiguchi

At Wed, 1 Sep 2021 15:01:43 +0900, Fujii Masao  
wrote in 
> 
> 
> On 2021/09/01 12:15, Andres Freund wrote:
> > Hi,
> > On 2021-09-01 11:34:34 +0900, Fujii Masao wrote:
> >> On 2021/09/01 0:53, Andres Freund wrote:
> >>> Of course, we need to be careful to not weaken WAL validity checking
> >>> too
> >>> much. How about the following:
> >>>
> >>> If we're "aborting" a continued record, we set
> >>> XLP_FIRST_IS_ABORTED_PARTIAL on
> >>> the page at which we do so (i.e. the page after the valid end of the
> >>> WAL).
> >>
> >> When do you expect that XLP_FIRST_IS_ABORTED_PARTIAL is set? It's set
> >> when recovery finds a a partially-flushed segment-spanning record?
> >> But maybe we cannot do that (i.e., cannot overwrite the page) because
> >> the page that the flag is set in might have already been archived. No?
> > I was imagining that XLP_FIRST_IS_ABORTED_PARTIAL would be set in the
> > "tail
> > end" of a partial record. I.e. if there's a partial record starting in
> > the
> > successfully archived segment A, but the end of the record, in B, has
> > not been
> > written to disk before a crash, we'd set XLP_FIRST_IS_ABORTED_PARTIAL
> > at the
> > end of the valid data in B. Which could not have been archived yet, or
> > we'd
> > not have a partial record.  So we should never need to set the flag on
> > an
> > already archived page.
> 
> Thanks for clarifying that! Unless I misunderstand that, when recovery
> ends
> at a partially-flushed segment-spanning record, it sets
> XLP_FIRST_IS_ABORTED_PARTIAL in the next segment before starting
> writing
> new WAL data there. Therefore XLP_FIRST_IS_CONTRECORD or
> XLP_FIRST_IS_ABORTED_PARTIAL must be set in the next segment following
> a partially-flushed segment-spanning record. If none of them is set,
> the validation code in recovery should report an error.
> 
> Yes, this design looks good to me.

So, this is a crude PoC of that.

At the end of recovery:

- When XLogReadRecord misses a page where the next part of the current
  continuation record should be seen, xlogreader->ContRecAbortPtr is
  set to the beginning of the missing page.

- When StartupXLOG receives a valid ContRecAbortPtr, the value is used
  as the next WAL insertion location then sets the same to
  XLogCtl->contAbortedRecPtr.

- When XLogCtl->contAbortedRecPtr is set, AdvanceXLInsertBuffer()
  (called under XLogInsertRecord()) sets XLP_FIRST_IS_ABORTED_PARTIAL
  flag to the page.

While recovery:
- When XLogReadRecord meets a XLP_FIRST_IS_ABORT_PARTIAL page, it
  rereads a record from there.

In this PoC,

1. This patch is written on the current master, but it doesn't
  interfare with the seg-boundary-memorize patch since it removes the
  calls to RegisterSegmentBoundary.

2. Since xlogreader cannot emit a log-message immediately, we don't
  have a means to leave a log message to inform recovery met an
  aborted partial continuation record. (In this PoC, it is done by
  fprintf:p)

3. Myebe we need to pg_waldump to show partial continuation records,
  but I'm not sure how to realize that.

4. By the way, is this (method) applicable in this stage?


The attached first is the PoC including debug-crash aid.  The second
is the repro script.  It failes to reproduce the situation once per
several trials.

The following log messages are shown by a run of the script.

> ...
> TRAP: FailedAssertion("c++ < 1", File: "xlog.c", Line: 2675, PID: 254644)
> ...
> LOG:  database system is shut down
> ...
> 
> LOG:  redo starts at 0/228
> LOG:  redo done at 0/6A8 system usage:...
> LOG:   Recovery finished: ContRecAbort: 0/700 (EndRecPtr: 0/6E8)

The record from 6E8 is missing the trailing part after 700.

> LOG:   EndOfLog=0/700
> LOG:   set XLP_FIRST_IS_ABORT_PARTIAL@0/700

So, WAL insertion starts from 700 and the first page is set the flag.

> LOG:  database system is ready to accept connections
> ...
> LOG:  database system is shut down
> ...
> #
> ...
> LOG:  redo starts at 0/228
> LOG:  consistent recovery state reached at 0/2000100
> ...
> LOG:  restored log file "00010007" from archive
>  aborted partial continuation record found at 0/6E8, continue from 
> 0/700

The record from 6E8 is immature so skip it and continue reading
from 700.

> LOG:  last completed transaction was at log time 2021-09-01 20:40:21.775295+09
> LOG:   Recovery finished: ContRecAbort: 0/0 (EndRecPtr: 0/800)
> LOG:  restored log file "00010007" from archive
> LOG:  selected new timeline ID: 2
> LOG:  archive recovery complete
> LOG:   EndOfLog=0/800

Recovery ends.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center
diff --git a/src/backend/access/transam/xlog.c 
b/src/backend/access/transam/xlog.c
index 24165ab03e..b0f18e4e5e 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -111,6 +111,7 @@ int CommitSiblings = 5;

Re: prevent immature WAL streaming

2021-09-01 Thread Bossart, Nathan

On 9/1/21, 10:06 AM, "Andres Freund"  wrote:
> On 2021-09-01 15:01:43 +0900, Fujii Masao wrote:
>> Thanks for clarifying that! Unless I misunderstand that, when recovery ends
>> at a partially-flushed segment-spanning record, it sets
>> XLP_FIRST_IS_ABORTED_PARTIAL in the next segment before starting writing
>> new WAL data there. Therefore XLP_FIRST_IS_CONTRECORD or
>> XLP_FIRST_IS_ABORTED_PARTIAL must be set in the next segment following
>> a partially-flushed segment-spanning record. If none of them is set,
>> the validation code in recovery should report an error.
>
> Right. With the small addition that I think we shouldn't just do this for
> segment spanning records, but for all records spanning pages.

This approach seems promising.  I like that it avoids adding extra
work in the hot path for writing WAL.  I'm assuming it won't be back-
patchable, though.

Nathan

Re: prevent immature WAL streaming

2021-09-01 Thread Andres Freund

Hi,

On 2021-09-01 15:01:43 +0900, Fujii Masao wrote:
> Thanks for clarifying that! Unless I misunderstand that, when recovery ends
> at a partially-flushed segment-spanning record, it sets
> XLP_FIRST_IS_ABORTED_PARTIAL in the next segment before starting writing
> new WAL data there. Therefore XLP_FIRST_IS_CONTRECORD or
> XLP_FIRST_IS_ABORTED_PARTIAL must be set in the next segment following
> a partially-flushed segment-spanning record. If none of them is set,
> the validation code in recovery should report an error.

Right. With the small addition that I think we shouldn't just do this for
segment spanning records, but for all records spanning pages.

Greetings,

Andres Freund

Re: prevent immature WAL streaming

2021-09-01 Thread Fujii Masao





On 2021/09/01 12:15, Andres Freund wrote:

Hi,

On 2021-09-01 11:34:34 +0900, Fujii Masao wrote:

On 2021/09/01 0:53, Andres Freund wrote:

Of course, we need to be careful to not weaken WAL validity checking too
much. How about the following:

If we're "aborting" a continued record, we set XLP_FIRST_IS_ABORTED_PARTIAL on
the page at which we do so (i.e. the page after the valid end of the WAL).


When do you expect that XLP_FIRST_IS_ABORTED_PARTIAL is set? It's set
when recovery finds a a partially-flushed segment-spanning record?
But maybe we cannot do that (i.e., cannot overwrite the page) because
the page that the flag is set in might have already been archived. No?


I was imagining that XLP_FIRST_IS_ABORTED_PARTIAL would be set in the "tail
end" of a partial record. I.e. if there's a partial record starting in the
successfully archived segment A, but the end of the record, in B, has not been
written to disk before a crash, we'd set XLP_FIRST_IS_ABORTED_PARTIAL at the
end of the valid data in B. Which could not have been archived yet, or we'd
not have a partial record.  So we should never need to set the flag on an
already archived page.


Thanks for clarifying that! Unless I misunderstand that, when recovery ends
at a partially-flushed segment-spanning record, it sets
XLP_FIRST_IS_ABORTED_PARTIAL in the next segment before starting writing
new WAL data there. Therefore XLP_FIRST_IS_CONTRECORD or
XLP_FIRST_IS_ABORTED_PARTIAL must be set in the next segment following
a partially-flushed segment-spanning record. If none of them is set,
the validation code in recovery should report an error.

Yes, this design looks good to me.

Regards,

--
Fujii Masao
Advanced Computing Technology Center
Research and Development Headquarters
NTT DATA CORPORATION

Re: prevent immature WAL streaming

2021-08-31 Thread Kyotaro Horiguchi

At Tue, 31 Aug 2021 20:15:24 -0700, Andres Freund  wrote in 
> Hi,
> 
> On 2021-09-01 11:34:34 +0900, Fujii Masao wrote:
> > On 2021/09/01 0:53, Andres Freund wrote:
> > > Of course, we need to be careful to not weaken WAL validity checking too
> > > much. How about the following:
> > > 
> > > If we're "aborting" a continued record, we set 
> > > XLP_FIRST_IS_ABORTED_PARTIAL on
> > > the page at which we do so (i.e. the page after the valid end of the WAL).
> > 
> > When do you expect that XLP_FIRST_IS_ABORTED_PARTIAL is set? It's set
> > when recovery finds a a partially-flushed segment-spanning record?
> > But maybe we cannot do that (i.e., cannot overwrite the page) because
> > the page that the flag is set in might have already been archived. No?
> 
> I was imagining that XLP_FIRST_IS_ABORTED_PARTIAL would be set in the "tail
> end" of a partial record. I.e. if there's a partial record starting in the
> successfully archived segment A, but the end of the record, in B, has not been
> written to disk before a crash, we'd set XLP_FIRST_IS_ABORTED_PARTIAL at the
> end of the valid data in B. Which could not have been archived yet, or we'd
> not have a partial record.  So we should never need to set the flag on an
> already archived page.

I agree that that makes sense.

Is that that, crash recovery remembers if the last record was an
immature record that looks like continue to the next segment, and if
so, set the flag when inserting the first record, which would be
CHECKPOINT_SHUTDOWN? (and reader deals with it)

I'll try to show how it looks like.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center

Re: prevent immature WAL streaming

2021-08-31 Thread Andres Freund

Hi,

On 2021-09-01 11:34:34 +0900, Fujii Masao wrote:
> On 2021/09/01 0:53, Andres Freund wrote:
> > Of course, we need to be careful to not weaken WAL validity checking too
> > much. How about the following:
> > 
> > If we're "aborting" a continued record, we set XLP_FIRST_IS_ABORTED_PARTIAL 
> > on
> > the page at which we do so (i.e. the page after the valid end of the WAL).
> 
> When do you expect that XLP_FIRST_IS_ABORTED_PARTIAL is set? It's set
> when recovery finds a a partially-flushed segment-spanning record?
> But maybe we cannot do that (i.e., cannot overwrite the page) because
> the page that the flag is set in might have already been archived. No?

I was imagining that XLP_FIRST_IS_ABORTED_PARTIAL would be set in the "tail
end" of a partial record. I.e. if there's a partial record starting in the
successfully archived segment A, but the end of the record, in B, has not been
written to disk before a crash, we'd set XLP_FIRST_IS_ABORTED_PARTIAL at the
end of the valid data in B. Which could not have been archived yet, or we'd
not have a partial record.  So we should never need to set the flag on an
already archived page.

Greetings,

Andres Freund

Re: prevent immature WAL streaming

2021-08-31 Thread Fujii Masao





On 2021/09/01 0:53, Andres Freund wrote:

I was thinking that on a normal WAL write we'd do nothing. Instead we would
have dedicated code at the end of recovery that, if the WAL ends in a partial
record, changes the page following the "valid" portion of the WAL to indicate
that an incomplete record is to be skipped.


Agreed!




Of course, we need to be careful to not weaken WAL validity checking too
much. How about the following:

If we're "aborting" a continued record, we set XLP_FIRST_IS_ABORTED_PARTIAL on
the page at which we do so (i.e. the page after the valid end of the WAL).


When do you expect that XLP_FIRST_IS_ABORTED_PARTIAL is set? It's set
when recovery finds a a partially-flushed segment-spanning record?
But maybe we cannot do that (i.e., cannot overwrite the page) because
the page that the flag is set in might have already been archived. No?



I think we can just read the WAL and see if it ends with a partial
record. It'd add a bit of complication to the error checking in xlogreader,
because we'd likely want to treat verification from page headers a bit
different from verification due to record data. But that seems doable.


Yes.



Does this make sense?


Yes, I think!


Regards,

--
Fujii Masao
Advanced Computing Technology Center
Research and Development Headquarters
NTT DATA CORPORATION

Re: prevent immature WAL streaming

2021-08-31 Thread Andres Freund

Hi,

On 2021-08-31 09:56:30 -0400, Alvaro Herrera wrote:
> On 2021-Aug-30, Andres Freund wrote:
> > I think a better approach might be to handle this on the WAL layout
> > level. What if we never overwrite partial records but instead just
> > skipped over them during decoding?
>
> Maybe this is a workable approach, let's work it out fully.
>
> Let me see if I understand what you mean:
> * We would remove the logic to inhibit archiving and streaming-
>   replicating the tail end of a split WAL record; that logic deals with
>   bytes only, so doesn't have to be aware of record boundaries.
> * On WAL replay, we ignore records that are split across a segment
>   boundary and whose checksum does not match.
> * On WAL write ... ?

I was thinking that on a normal WAL write we'd do nothing. Instead we would
have dedicated code at the end of recovery that, if the WAL ends in a partial
record, changes the page following the "valid" portion of the WAL to indicate
that an incomplete record is to be skipped.

Of course, we need to be careful to not weaken WAL validity checking too
much. How about the following:

If we're "aborting" a continued record, we set XLP_FIRST_IS_ABORTED_PARTIAL on
the page at which we do so (i.e. the page after the valid end of the WAL).

On a page with XLP_FIRST_IS_ABORTED_PARTIAL we expect a special type of record
to start just after the page header. That record contains sufficient
information for us to verify the validity of the partial record (since its
checksum and length aren't valid, and may not even be all readable if the
record header itself was split). I think it would make sense to include the
LSN of the aborted record, and a checksum of the partial data.

> How do we detect after recovery that a record that was being written,
> and potentially was sent to the archive, needs to be "skipped"?

I think we can just read the WAL and see if it ends with a partial
record. It'd add a bit of complication to the error checking in xlogreader,
because we'd likely want to treat verification from page headers a bit
different from verification due to record data. But that seems doable.

Does this make sense?

Greetings,

Andres Freund

1 2 >

1 - 100 of 118 matches

Mail list logo