Re: Corruption during WAL replay

2022-03-29 Thread Robert Haas
On Tue, Mar 29, 2022 at 12:34 PM Stephen Frost  wrote:
> Anyhow, this whole thread has struck me as a good reason to polish those
> patches off and add on top of them an extended checksum ability, first,
> independent of TDE, and remove the dependency of those patches from the
> TDE effort and instead allow it to just leverage that ability.  I still
> suspect we'll have some folks who will want TDE w/o a per-page nonce and
> that could be an option but we'd be able to support TDE w/ integrity
> pretty easily, which would be fantastic.

Yes, I like that idea. Once we get beyond feature freeze, perhaps we
can try to coordinate to avoid duplication of effort -- or absence of
effort.

-- 
Robert Haas
EDB: http://www.enterprisedb.com




Re: Corruption during WAL replay

2022-03-29 Thread Stephen Frost
Greetings,

* Robert Haas (robertmh...@gmail.com) wrote:
> On Fri, Mar 25, 2022 at 10:34 AM Tom Lane  wrote:
> > I dunno.  Compatibility and speed concerns aside, that seems like an awful
> > lot of bits to be expending on every page compared to the value.
> 
> I dunno either, but over on the TDE thread people seemed quite willing
> to expend like 16-32 *bytes* for page verifiers and nonces and things.

Absolutely.

> For compatibility and speed reasons, I doubt we could ever get by with
> doing that in every cluster, but I do have some hope of introducing
> something like that someday at least as an optional feature. It's not
> like a 16-bit checksum was state-of-the-art even when we introduced
> it. We just did it because we had 2 bytes that we could repurpose
> relatively painlessly, and not any larger number. And that's still the
> case today, so at least in the short term we will have to choose some
> other solution to this problem.

I agree that this would be great as an optional feature.  Those patches
to allow the system to be built with reserved space for $whatever would
allow us to have a larger checksum for those who want it and perhaps a
nonce with TDE for those who wish that in the future.  I mentioned
before that I thought it might be a good way to introduce page-level
epochs for 64bit xids too though it never seemed to get much traction.

Anyhow, this whole thread has struck me as a good reason to polish those
patches off and add on top of them an extended checksum ability, first,
independent of TDE, and remove the dependency of those patches from the
TDE effort and instead allow it to just leverage that ability.  I still
suspect we'll have some folks who will want TDE w/o a per-page nonce and
that could be an option but we'd be able to support TDE w/ integrity
pretty easily, which would be fantastic.

Thanks,

Stephen


signature.asc
Description: PGP signature


Re: Corruption during WAL replay

2022-03-27 Thread Kyotaro Horiguchi
At Thu, 24 Mar 2022 15:33:29 -0400, Robert Haas  wrote 
in 
> On Thu, Mar 17, 2022 at 9:21 PM Kyotaro Horiguchi
>  wrote:
> > All versions pass check world.
> 
> Thanks, committed.

(I was overwhelmed by the flood of following discussion..)

Anyway, thanks for picking up this and committing!

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center




Re: Corruption during WAL replay

2022-03-26 Thread Michael Paquier
On Fri, Mar 25, 2022 at 10:34:49AM -0400, Tom Lane wrote:
> Robert Haas  writes:
>> On Fri, Mar 25, 2022 at 10:02 AM Tom Lane  wrote:
>>> Adding another 16 bits won't get you to that, sadly.  Yeah, it *might*
>>> extend the MTTF to more than the project's likely lifespan, but that
>>> doesn't mean we couldn't get unlucky next week.
> 
>> I suspect that the number of bits Andres wants to add is no less than 48.
> 
> I dunno.  Compatibility and speed concerns aside, that seems like an awful
> lot of bits to be expending on every page compared to the value.

Err.  And there are not that many bits that could be recycled for this
purpose in the current page layout, aren't there?
--
Michael


signature.asc
Description: PGP signature


Re: Corruption during WAL replay

2022-03-25 Thread Tom Lane
=?utf-8?Q?Dagfinn_Ilmari_Manns=C3=A5ker?=  writes:
> LGTM, but it would be good to include $! in the die messages.

Roger, will do.

regards, tom lane




Re: Corruption during WAL replay

2022-03-25 Thread Dagfinn Ilmari Mannsåker
Tom Lane  writes:

> Robert Haas  writes:
>> ... It's not
>> like a 16-bit checksum was state-of-the-art even when we introduced
>> it. We just did it because we had 2 bytes that we could repurpose
>> relatively painlessly, and not any larger number. And that's still the
>> case today, so at least in the short term we will have to choose some
>> other solution to this problem.
>
> Indeed.  I propose the attached, which also fixes the unsafe use
> of seek() alongside syswrite(), directly contrary to what "man perlfunc"
> says to do.

LGTM, but it would be good to include $! in the die messages.

- ilmari

>   regards, tom lane
>
> diff --git a/src/bin/pg_checksums/t/002_actions.pl 
> b/src/bin/pg_checksums/t/002_actions.pl
> index 62c608eaf6..8c70453a45 100644
> --- a/src/bin/pg_checksums/t/002_actions.pl
> +++ b/src/bin/pg_checksums/t/002_actions.pl
> @@ -24,6 +24,7 @@ sub check_relation_corruption
>   my $tablespace = shift;
>   my $pgdata = $node->data_dir;
>  
> + # Create table and discover its filesystem location.
>   $node->safe_psql(
>   'postgres',
>   "CREATE TABLE $table AS SELECT a FROM generate_series(1,1) 
> AS a;
> @@ -37,9 +38,6 @@ sub check_relation_corruption
>   my $relfilenode_corrupted = $node->safe_psql('postgres',
>   "SELECT relfilenode FROM pg_class WHERE relname = '$table';");
>  
> - # Set page header and block size
> - my $pageheader_size = 24;
> - my $block_size = $node->safe_psql('postgres', 'SHOW block_size;');
>   $node->stop;
>  
>   # Checksums are correct for single relfilenode as the table is not
> @@ -55,8 +53,12 @@ sub check_relation_corruption
>  
>   # Time to create some corruption
>   open my $file, '+<', "$pgdata/$file_corrupted";
> - seek($file, $pageheader_size, SEEK_SET);
> - syswrite($file, "\0\0\0\0\0\0\0\0\0");
> + my $pageheader;
> + sysread($file, $pageheader, 24) or die "sysread failed";
> + # This inverts the pd_checksum field (only); see struct PageHeaderData
> + $pageheader ^= "\0\0\0\0\0\0\0\0\xff\xff";
> + sysseek($file, 0, 0) or die "sysseek failed";
> + syswrite($file, $pageheader) or die "syswrite failed";
>   close $file;
>  
>   # Checksum checks on single relfilenode fail




Re: Corruption during WAL replay

2022-03-25 Thread Tom Lane
Andres Freund  writes:
> The same code also exists in src/bin/pg_basebackup/t/010_pg_basebackup.pl,
> which presumably has the same collision risks.

Oooh, I missed that.

> Perhaps we should put a
> function into Cluster.pm and use it from both?

+1, I'll make it so.

regards, tom lane




Re: Corruption during WAL replay

2022-03-25 Thread Andres Freund
Hi,

On 2022-03-25 11:50:48 -0400, Tom Lane wrote:
> Robert Haas  writes:
> > ... It's not
> > like a 16-bit checksum was state-of-the-art even when we introduced
> > it. We just did it because we had 2 bytes that we could repurpose
> > relatively painlessly, and not any larger number. And that's still the
> > case today, so at least in the short term we will have to choose some
> > other solution to this problem.
> 
> Indeed.  I propose the attached, which also fixes the unsafe use
> of seek() alongside syswrite(), directly contrary to what "man perlfunc"
> says to do.

That looks reasonable. Although I wonder if we loose something by not testing
the influence of the rest of the block - but I don't really see anything.

The same code also exists in src/bin/pg_basebackup/t/010_pg_basebackup.pl,
which presumably has the same collision risks. Perhaps we should put a
function into Cluster.pm and use it from both?

Greetings,

Andres Freund




Re: Corruption during WAL replay

2022-03-25 Thread Tom Lane
Robert Haas  writes:
> ... It's not
> like a 16-bit checksum was state-of-the-art even when we introduced
> it. We just did it because we had 2 bytes that we could repurpose
> relatively painlessly, and not any larger number. And that's still the
> case today, so at least in the short term we will have to choose some
> other solution to this problem.

Indeed.  I propose the attached, which also fixes the unsafe use
of seek() alongside syswrite(), directly contrary to what "man perlfunc"
says to do.

regards, tom lane

diff --git a/src/bin/pg_checksums/t/002_actions.pl b/src/bin/pg_checksums/t/002_actions.pl
index 62c608eaf6..8c70453a45 100644
--- a/src/bin/pg_checksums/t/002_actions.pl
+++ b/src/bin/pg_checksums/t/002_actions.pl
@@ -24,6 +24,7 @@ sub check_relation_corruption
 	my $tablespace = shift;
 	my $pgdata = $node->data_dir;
 
+	# Create table and discover its filesystem location.
 	$node->safe_psql(
 		'postgres',
 		"CREATE TABLE $table AS SELECT a FROM generate_series(1,1) AS a;
@@ -37,9 +38,6 @@ sub check_relation_corruption
 	my $relfilenode_corrupted = $node->safe_psql('postgres',
 		"SELECT relfilenode FROM pg_class WHERE relname = '$table';");
 
-	# Set page header and block size
-	my $pageheader_size = 24;
-	my $block_size = $node->safe_psql('postgres', 'SHOW block_size;');
 	$node->stop;
 
 	# Checksums are correct for single relfilenode as the table is not
@@ -55,8 +53,12 @@ sub check_relation_corruption
 
 	# Time to create some corruption
 	open my $file, '+<', "$pgdata/$file_corrupted";
-	seek($file, $pageheader_size, SEEK_SET);
-	syswrite($file, "\0\0\0\0\0\0\0\0\0");
+	my $pageheader;
+	sysread($file, $pageheader, 24) or die "sysread failed";
+	# This inverts the pd_checksum field (only); see struct PageHeaderData
+	$pageheader ^= "\0\0\0\0\0\0\0\0\xff\xff";
+	sysseek($file, 0, 0) or die "sysseek failed";
+	syswrite($file, $pageheader) or die "syswrite failed";
 	close $file;
 
 	# Checksum checks on single relfilenode fail


Re: Corruption during WAL replay

2022-03-25 Thread Robert Haas
On Fri, Mar 25, 2022 at 10:34 AM Tom Lane  wrote:
> I dunno.  Compatibility and speed concerns aside, that seems like an awful
> lot of bits to be expending on every page compared to the value.

I dunno either, but over on the TDE thread people seemed quite willing
to expend like 16-32 *bytes* for page verifiers and nonces and things.
For compatibility and speed reasons, I doubt we could ever get by with
doing that in every cluster, but I do have some hope of introducing
something like that someday at least as an optional feature. It's not
like a 16-bit checksum was state-of-the-art even when we introduced
it. We just did it because we had 2 bytes that we could repurpose
relatively painlessly, and not any larger number. And that's still the
case today, so at least in the short term we will have to choose some
other solution to this problem.

--
Robert Haas
EDB: http://www.enterprisedb.com




Re: Corruption during WAL replay

2022-03-25 Thread Tom Lane
Robert Haas  writes:
> On Fri, Mar 25, 2022 at 10:02 AM Tom Lane  wrote:
>> Adding another 16 bits won't get you to that, sadly.  Yeah, it *might*
>> extend the MTTF to more than the project's likely lifespan, but that
>> doesn't mean we couldn't get unlucky next week.

> I suspect that the number of bits Andres wants to add is no less than 48.

I dunno.  Compatibility and speed concerns aside, that seems like an awful
lot of bits to be expending on every page compared to the value.

regards, tom lane




Re: Corruption during WAL replay

2022-03-25 Thread Robert Haas
On Fri, Mar 25, 2022 at 10:02 AM Tom Lane  wrote:
> Robert Haas  writes:
> > On Fri, Mar 25, 2022 at 9:49 AM Tom Lane  wrote:
> >> That'll just reduce the probability of failure, not eliminate it.
>
> > I mean, if the expected time to the first failure on even 1 machine
> > exceeds the time until the heat death of the universe by 10 orders of
> > magnitude, it's probably good enough.
>
> Adding another 16 bits won't get you to that, sadly.  Yeah, it *might*
> extend the MTTF to more than the project's likely lifespan, but that
> doesn't mean we couldn't get unlucky next week.

I suspect that the number of bits Andres wants to add is no less than 48.

-- 
Robert Haas
EDB: http://www.enterprisedb.com




Re: Corruption during WAL replay

2022-03-25 Thread Robert Haas
On Fri, Mar 25, 2022 at 2:07 AM Andres Freund  wrote:
> We really ought to find a way to get to wider checksums :/

Eh, let's just use longer names for the buildfarm animals and call it good. :-)

-- 
Robert Haas
EDB: http://www.enterprisedb.com




Re: Corruption during WAL replay

2022-03-25 Thread Tom Lane
Robert Haas  writes:
> On Fri, Mar 25, 2022 at 9:49 AM Tom Lane  wrote:
>> That'll just reduce the probability of failure, not eliminate it.

> I mean, if the expected time to the first failure on even 1 machine
> exceeds the time until the heat death of the universe by 10 orders of
> magnitude, it's probably good enough.

Adding another 16 bits won't get you to that, sadly.  Yeah, it *might*
extend the MTTF to more than the project's likely lifespan, but that
doesn't mean we couldn't get unlucky next week.

regards, tom lane




Re: Corruption during WAL replay

2022-03-25 Thread Robert Haas
On Fri, Mar 25, 2022 at 9:49 AM Tom Lane  wrote:
> That'll just reduce the probability of failure, not eliminate it.

I mean, if the expected time to the first failure on even 1 machine
exceeds the time until the heat death of the universe by 10 orders of
magnitude, it's probably good enough.

-- 
Robert Haas
EDB: http://www.enterprisedb.com




Re: Corruption during WAL replay

2022-03-25 Thread Tom Lane
Andres Freund  writes:
> On 2022-03-25 01:38:45 -0400, Tom Lane wrote:
>> AFAICS, this strategy of whacking a predetermined chunk of the page with
>> a predetermined value is going to fail 1-out-of-64K times.

> Yea. I suspect that the way the modifications and checksumming are done are
> actually higher chance than 1/64k. But even it actually is 1/64k, it's not
> great to wait for (#animals * #catalog-changes) to approach a decent
> percentage of 1/64k.

Exactly.

> I'm was curious whether there have been similar issues in the past. Querying
> the buildfarm logs suggests not, at least not in the pg_checksums test.

That test has only been there since 2018 (b34e84f16).  We've probably
accumulated a couple hundred initial-catalog-contents changes since
then, so maybe this failure arrived right on schedule :-(.

> We really ought to find a way to get to wider checksums :/

That'll just reduce the probability of failure, not eliminate it.

regards, tom lane




Re: Corruption during WAL replay

2022-03-25 Thread Andres Freund
Hi,

On 2022-03-25 01:38:45 -0400, Tom Lane wrote:
> Andres Freund  writes:
> > Not sure what to do here... I guess we can just change the value we 
> > overwrite
> > the page with and hope to not hit this again? But that feels deeply deeply
> > unsatisfying.
> 
> AFAICS, this strategy of whacking a predetermined chunk of the page with
> a predetermined value is going to fail 1-out-of-64K times.

Yea. I suspect that the way the modifications and checksumming are done are
actually higher chance than 1/64k. But even it actually is 1/64k, it's not
great to wait for (#animals * #catalog-changes) to approach a decent
percentage of 1/64k.


I'm was curious whether there have been similar issues in the past. Querying
the buildfarm logs suggests not, at least not in the pg_checksums test.


> We have to change the test so that it's guaranteed to produce an invalid
> checksum.  Inverting just the checksum field, without doing anything else,
> would do that ... but that feels pretty unsatisfying too.

We really ought to find a way to get to wider checksums :/

Greetings,

Andres Freund




Re: Corruption during WAL replay

2022-03-24 Thread Tom Lane
Andres Freund  writes:
> Ah, and that's finally also the explanation why I couldn't reproduce the
> failure it in a different directory, with an otherwise identically configured
> PG: The length of the path to the tablespace influences the size of the
> XLOG_TBLSPC_CREATE record.

Ohhh ... yeah, that could explain a lot of cross-animal variation.

> Not sure what to do here... I guess we can just change the value we overwrite
> the page with and hope to not hit this again? But that feels deeply deeply
> unsatisfying.

AFAICS, this strategy of whacking a predetermined chunk of the page with
a predetermined value is going to fail 1-out-of-64K times.  We have to
change the test so that it's guaranteed to produce an invalid checksum.
Inverting just the checksum field, without doing anything else, would
do that ... but that feels pretty unsatisfying too.

regards, tom lane




Re: Corruption during WAL replay

2022-03-24 Thread Andres Freund
Hi,

On 2022-03-25 01:23:00 -0400, Tom Lane wrote:
> Andres Freund  writes:
> > I do see that the LSN that ends up on the page is the same across a few runs
> > of the test on serinus. Which presumably differs between different
> > animals. Surprised that it's this predictable - but I guess the run is short
> > enough that there's no variation due to autovacuum, checkpoints etc.
> 
> Uh-huh.  I'm not surprised that it's repeatable on a given animal.
> What remains to be explained:
> 
> 1. Why'd it start failing now?  I'm guessing that ce95c5437 *was* the
> culprit after all, by slightly changing the amount of catalog data
> written during initdb, and thus moving the initial LSN.

Yep, verified that (see mail I just sent).


> 2. Why just these two animals?  If initial LSN is the critical thing,
> then the results of "locale -a" would affect it, so platform
> dependence is hardly surprising ... but I'd have thought that all
> the animals on that host would use the same initial set of
> collations.

I think it's the animal's name that makes the difference, due to the
tablespace path lenght thing. And while I was confused for a second by

petalura
pogona
serinus
dragonet

failing, despite different name lengths, it still makes sense: We MAXALIGN the
start of records. Which explains why flaviventris didn't fail the same way.


> As for a fix, would damaging more of the page help?  I guess
> it'd just move around the one-in-64K chance of failure.

As I wrote in the other email, I think spreading the changes out wider might
help. But it's still not great. However:

> Maybe we have to intentionally corrupt (e.g. invert) the
> checksum field specifically.

seems like it'd do the trick? Even a single bit change of the checksum ought
to do, as long as we don't set it to 0.

Greetings,

Andres Freund




Re: Corruption during WAL replay

2022-03-24 Thread Andres Freund
Hi,

On 2022-03-24 21:54:38 -0700, Andres Freund wrote:
> I do see that the LSN that ends up on the page is the same across a few runs
> of the test on serinus. Which presumably differs between different
> animals. Surprised that it's this predictable - but I guess the run is short
> enough that there's no variation due to autovacuum, checkpoints etc.

This actually explains how the issue could start to be visible with
ce95c543763. It changes the amount of WAL initdb generates and therefore
influences what LSN the page ends up with.  I've verified that the failing
test is reproducible with ce95c543763, but not its parent 7dac61402e3. While
of course ce95c543763 isn't "actually responsible".

Ah, and that's finally also the explanation why I couldn't reproduce the
failure it in a different directory, with an otherwise identically configured
PG: The length of the path to the tablespace influences the size of the
XLOG_TBLSPC_CREATE record.


Not sure what to do here... I guess we can just change the value we overwrite
the page with and hope to not hit this again? But that feels deeply deeply
unsatisfying.

Perhaps it would be enough to write into multiple parts of the page? I am very
much not a cryptographical expert, but the way pg_checksum_block() works, it
looks to me that "multiple" changes within a 16 byte chunk have a smaller
influence on the overall result than the same "amount" of changes to separate
16 byte chunks.


I might have to find a store still selling strong beverages at this hour.

- Andres




Re: Corruption during WAL replay

2022-03-24 Thread Tom Lane
Andres Freund  writes:
> I do see that the LSN that ends up on the page is the same across a few runs
> of the test on serinus. Which presumably differs between different
> animals. Surprised that it's this predictable - but I guess the run is short
> enough that there's no variation due to autovacuum, checkpoints etc.

Uh-huh.  I'm not surprised that it's repeatable on a given animal.
What remains to be explained:

1. Why'd it start failing now?  I'm guessing that ce95c5437 *was* the
culprit after all, by slightly changing the amount of catalog data
written during initdb, and thus moving the initial LSN.

2. Why just these two animals?  If initial LSN is the critical thing,
then the results of "locale -a" would affect it, so platform
dependence is hardly surprising ... but I'd have thought that all
the animals on that host would use the same initial set of
collations.  OTOH, I see petalura and pogona just fell over too.
Do you have some of those animals --with-icu and others not?

> 16bit checksums for the win.

Yay :-(

As for a fix, would damaging more of the page help?  I guess
it'd just move around the one-in-64K chance of failure.
Maybe we have to intentionally corrupt (e.g. invert) the
checksum field specifically.

regards, tom lane




Re: Corruption during WAL replay

2022-03-24 Thread Andres Freund
Hi,

On 2022-03-25 00:08:20 -0400, Tom Lane wrote:
> Andres Freund  writes:
> > The only thing I can really conclude here is that we apparently end up with
> > the same checksum for exactly the modifications we are doing? Just on those
> > two damn instances? Reliably?
> 
> IIRC, the table's OID or relfilenode enters into the checksum.
> Could it be that assigning a specific OID to the table allows
> this to happen, and these two animals are somehow assigning
> that OID while others are using some slightly different OID?

It's just the block number that goes into it.

I do see that the LSN that ends up on the page is the same across a few runs
of the test on serinus. Which presumably differs between different
animals. Surprised that it's this predictable - but I guess the run is short
enough that there's no variation due to autovacuum, checkpoints etc.

If I add a 'SELECT txid_current()' before the CREATE TABLE in
check_relation_corruption(), the test doesn't fail anymore, because there's an
additional WAL record.

16bit checksums for the win.

Greetings,

Andres Freund




Re: Corruption during WAL replay

2022-03-24 Thread Tom Lane
Andres Freund  writes:
> The only thing I can really conclude here is that we apparently end up with
> the same checksum for exactly the modifications we are doing? Just on those
> two damn instances? Reliably?

IIRC, the table's OID or relfilenode enters into the checksum.
Could it be that assigning a specific OID to the table allows
this to happen, and these two animals are somehow assigning
that OID while others are using some slightly different OID?

regards, tom lane




Re: Corruption during WAL replay

2022-03-24 Thread Andres Freund
Hi,

On 2022-03-24 19:43:02 -0700, Andres Freund wrote:
> Just to be sure I'm going to clean out serinus' ccache dir and rerun. I'll
> leave dragonet's alone for now.

Turns out they had the same dir. But it didn't help.

I haven't yet figured out why, but I now *am* able to reproduce the problem in
the buildfarm built tree.  Wonder if there's a path length issue or such
somewhere?

Either way, I can now manipulate the tests and still repro. I made the test
abort after the first failure.

hexedit shows that the file is modified, as we'd expect:
   00 00 00 00  C0 01 5B 01  16 7D 00 00  A0 03 C0 03  00 20 04 20  00 
00 00 00  00 00 00 00  00 00 00 00  ..[..}... . 
0020   00 9F 38 00  80 9F 38 00  60 9F 38 00  40 9F 38 00  20 9F 38 00  00 
9F 38 00  E0 9E 38 00  C0 9E 38 00  ..8...8.`.8.@.8. .8...8...8...8.

And we are checking the right file:

bf@andres-postgres-edb-buildfarm-v1:~/build/buildfarm-serinus/HEAD/pgsql.build$ 
tmp_install/home/bf/build/buildfarm-serinus/HEAD/inst/bin/pg_checksums --check 
-D 
/home/bf/build/buildfarm-serinus/HEAD/pgsql.build/src/bin/pg_checksums/tmp_check/t_002_actions_node_checksum_data/pgdata
 --filenode 16391 -v
pg_checksums: checksums verified in file 
"/home/bf/build/buildfarm-serinus/HEAD/pgsql.build/src/bin/pg_checksums/tmp_check/t_002_actions_node_checksum_data/pgdata/pg_tblspc/16387/PG_15_202203241/5/16391"
Checksum operation completed
Files scanned:   1
Blocks scanned:  45
Bad checksums:  0
Data checksum version: 1

If I twiddle further bits, I see that page failing checksum verification, as
expected.

I made the script copy the file before twiddling it around:
   00 00 00 00  C0 01 5B 01  16 7D 00 00  A0 03 C0 03  00 20 04 20  00 
00 00 00  E0 9F 38 00  C0 9F 38 00  ..[..}... . ..8...8.
0020   A0 9F 38 00  80 9F 38 00  60 9F 38 00  40 9F 38 00  20 9F 38 00  00 
9F 38 00  E0 9E 38 00  C0 9E 38 00  ..8...8.`.8.@.8. .8...8...8...8.

So it's indeed modified.


The only thing I can really conclude here is that we apparently end up with
the same checksum for exactly the modifications we are doing? Just on those
two damn instances? Reliably?


Gotta make some food. Suggestions what exactly to look at welcome.


Greetings,

Andres Freund

PS: I should really rename the hostname of that machine one of these days...




Re: Corruption during WAL replay

2022-03-24 Thread Thomas Munro
On Fri, Mar 25, 2022 at 3:35 PM Andres Freund  wrote:
> So I'm not sure how much can be gleaned from raw "failure counts" without
> taking the number of runs into account as well?

Ah, right, it does indeed hold the record for most runs in 3 months,
and taking runs into account its "stats" failure rate is clustered
with mandrill and seawasp.  Anyway, clearly not relevant because
dragonet doesn't even show up in the list.

animal | runs | stats_test_fail_fraction
---+--+--
 mandrill  |  158 |   0.0126582278481013
 seawasp   |   85 |   0.0117647058823529
 serinus   | 1299 |   0.0107775211701309
 sungazer  |  174 |  0.00574712643678161
 flaviventris  | 1292 |  0.00464396284829721
 bonito|  313 |  0.00319488817891374
 crake |  743 |  0.00134589502018843




Re: Corruption during WAL replay

2022-03-24 Thread Andres Freund
Hi,

On 2022-03-24 19:20:10 -0700, Andres Freund wrote:
> I forced a run while writing the other email, with keep_error_whatnot, and I
> just saw it failing... Looking whether there's anything interesting to glean.

Unfortunately the test drops the table and it doesn't report the filepath of
the failure. So I haven't learned much from the data dir so far.


I still don't see a failure when running the tests in a separate source
tree. Can't explain that. Going to try to get closer to the buildfarm script
run - it'd be a whole lot easier to be able to edit the source of the test and
reproduce...

Just to be sure I'm going to clean out serinus' ccache dir and rerun. I'll
leave dragonet's alone for now.

Greetings,

Andres Freund




Re: Corruption during WAL replay

2022-03-24 Thread Andres Freund
Hi,

On 2022-03-25 15:23:24 +1300, Thomas Munro wrote:
> One random thing I've noticed about serinus is that it seems to drop
> UDP packets more than others, but dragonet apparently doesn't:

Serinus is built with optimization. Which I guess could lead to other backends
reporting stats more quickly? And of course could lead to running more often
(due to finishing before the next cron invocation). I think I've also
configured my animals to run more often than many other owners.

So I'm not sure how much can be gleaned from raw "failure counts" without
taking the number of runs into account as well?

- Andres




Re: Corruption during WAL replay

2022-03-24 Thread Thomas Munro
On Fri, Mar 25, 2022 at 3:14 PM Andres Freund  wrote:
> On 2022-03-24 21:22:38 -0400, Tom Lane wrote:
> > serinus is 0-for-3 since then, and dragonet 0-for-4, so we can be pretty
> > confident that the failure is repeatable for them.
>
> That's weird. They run on the same host, but otherwise they have very little
> in common. There's plenty other animals running on the same machine that
> didn't report errors.

One random thing I've noticed about serinus is that it seems to drop
UDP packets more than others, but dragonet apparently doesn't:

tmunro=> select animal, count(*) from run where result = 'FAILURE' and
'stats' = any(fail_tests) and snapshot > now() - interval '3 month'
group by 1 order by 2 desc;
animal| count
--+---
 serinus  |14
 flaviventris | 6
 mandrill | 2
 bonito   | 1
 seawasp  | 1
 crake| 1
 sungazer | 1
(7 rows)

Example: 
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=serinus=2022-03-24%2001:00:14




Re: Corruption during WAL replay

2022-03-24 Thread Andres Freund
Hi,

On 2022-03-24 21:59:08 -0400, Tom Lane wrote:
> Another thing that seems quite baffling, but is becoming clearer by
> the hour, is that only serinus and dragonet are seeing this failure.
> How is that?  They're not very similarly configured --- one is gcc,
> one clang, and one uses jit and one doesn't.  They do share the same
> perl version, 5.34.0; but so do twenty-three other animals, many of
> which have reported in cleanly.  I'm at a loss to explain that.
> Andres, can you think of anything that's peculiar to those two
> animals?

No, I'm quite baffled myself. As I noted in an email I just sent, before
reading this one, I can't explain it, and at least in simple attempts, can't
reproduce it either. And there are animals much closer to each other than
those two...

I forced a run while writing the other email, with keep_error_whatnot, and I
just saw it failing... Looking whether there's anything interesting to glean.

Greetings,

Andres Freund




Re: Corruption during WAL replay

2022-03-24 Thread Andres Freund
Hi,

On 2022-03-24 21:22:38 -0400, Tom Lane wrote:
> serinus is 0-for-3 since then, and dragonet 0-for-4, so we can be pretty
> confident that the failure is repeatable for them.

That's weird. They run on the same host, but otherwise they have very little
in common. There's plenty other animals running on the same machine that
didn't report errors.

I copied serinus' configuration, ran the tests repeatedly, without reproducing
the failure so far. Odd.

Combined with the replslot failure I'd be prepared to think the machine has
issues, except that the replslot thing triggered on other machines too.


I looked through logs on the machine without finding anything indicating
something odd.

I turned on keep_error_builds for serinus. Hopefully that'll leave us with
on-disk files to inspect.

Greetings,

Andres Freund




Re: Corruption during WAL replay

2022-03-24 Thread Tom Lane
I wrote:
> ... So that leaves 7dac61402e, which did this to
> the test script that's failing:
 
>  use strict;
>  use warnings;
> -use Config;
>  use PostgreSQL::Test::Cluster;
>  use PostgreSQL::Test::Utils;

> Discuss.

Another thing that seems quite baffling, but is becoming clearer by
the hour, is that only serinus and dragonet are seeing this failure.
How is that?  They're not very similarly configured --- one is gcc,
one clang, and one uses jit and one doesn't.  They do share the same
perl version, 5.34.0; but so do twenty-three other animals, many of
which have reported in cleanly.  I'm at a loss to explain that.
Andres, can you think of anything that's peculiar to those two
animals?

regards, tom lane




Re: Corruption during WAL replay

2022-03-24 Thread Andres Freund
Hi,

On 2022-03-24 20:39:27 -0400, Robert Haas wrote:
> But that leaves me even more confused. How can a change to only the server
> code cause a client utility to fail to detect corruption that is being
> created by Perl while the server is stopped?

I guess it could somehow cause the first page to be all zeroes, in which case
overwriting it with more zeroes wouldn't cause a problem that pg_checksums can
see?  But I have a somewhat more realistic idea:

I'm suspicious of pg_checksums --filenode. If I understand correctly
--filenode scans the data directory, including all tablespaces, for a file
matching that filenode. If we somehow end up with a leftover file in the pre
ALTER TABLE SET TABLESPACE location, it'd not notice that there *also* is a
file in a different place?

Perhaps the --filenode mode should print out the file location...


Randomly noticed: The test fetches the block size without doing anything with
it afaics.

Andres




Re: Corruption during WAL replay

2022-03-24 Thread Tom Lane
Robert Haas  writes:
> I hate to say "no" because the evidence suggests that the answer might
> be "yes" -- but it definitely isn't intending to change anything about
> the shutdown sequence. It just introduces a mechanism to backends to
> force the checkpointer to delay writing the checkpoint record.

Wait a minute, I think we may be barking up the wrong tree.

The three commits that serinus saw as new in its first failure were

ce95c54376 Thu Mar 24 20:33:13 2022 UTC  Fix pg_statio_all_tables view for 
multiple TOAST indexes. 
7dac61402e Thu Mar 24 19:51:40 2022 UTC  Remove unused module imports from TAP 
tests 
412ad7a556 Thu Mar 24 18:52:28 2022 UTC  Fix possible recovery trouble if 
TRUNCATE overlaps a checkpoint.

I failed to look closely at dragonet, but I now see that its
first failure saw

ce95c54376 Thu Mar 24 20:33:13 2022 UTC  Fix pg_statio_all_tables view for 
multiple TOAST indexes. 
7dac61402e Thu Mar 24 19:51:40 2022 UTC  Remove unused module imports from TAP 
tests

serinus is 0-for-3 since then, and dragonet 0-for-4, so we can be pretty
confident that the failure is repeatable for them.  That means that the
culprit must be ce95c54376 or 7dac61402e, not anything nearby such as
412ad7a556.

It's *really* hard to see how the pg_statio_all_tables change could
have affected this.  So that leaves 7dac61402e, which did this to
the test script that's failing:
 
 use strict;
 use warnings;
-use Config;
 use PostgreSQL::Test::Cluster;
 use PostgreSQL::Test::Utils;

Discuss.

regards, tom lane




Re: Corruption during WAL replay

2022-03-24 Thread Robert Haas
On Thu, Mar 24, 2022 at 8:45 PM Tom Lane  wrote:
> Hmm, I'd supposed that the failing test cases were new as of 412ad7a55.
> Now I see they're not, which indeed puts quite a different spin on
> things.  Your thought about maybe the server isn't shut down yet is
> interesting --- did 412ad7a55 touch anything about the shutdown
> sequence?

I hate to say "no" because the evidence suggests that the answer might
be "yes" -- but it definitely isn't intending to change anything about
the shutdown sequence. It just introduces a mechanism to backends to
force the checkpointer to delay writing the checkpoint record.

--
Robert Haas
EDB: http://www.enterprisedb.com




Re: Corruption during WAL replay

2022-03-24 Thread Tom Lane
Robert Haas  writes:
> And ... right after hitting send, I see that the recovery check
> failures are under separate troubleshooting and thus probably
> unrelated.

Yeah, we've been chasing those for months.

> But that leaves me even more confused. How can a change to
> only the server code cause a client utility to fail to detect
> corruption that is being created by Perl while the server is stopped?

Hmm, I'd supposed that the failing test cases were new as of 412ad7a55.
Now I see they're not, which indeed puts quite a different spin on
things.  Your thought about maybe the server isn't shut down yet is
interesting --- did 412ad7a55 touch anything about the shutdown
sequence?

regards, tom lane




Re: Corruption during WAL replay

2022-03-24 Thread Robert Haas
On Thu, Mar 24, 2022 at 8:37 PM Robert Haas  wrote:
> Any ideas?

And ... right after hitting send, I see that the recovery check
failures are under separate troubleshooting and thus probably
unrelated. But that leaves me even more confused. How can a change to
only the server code cause a client utility to fail to detect
corruption that is being created by Perl while the server is stopped?

--
Robert Haas
EDB: http://www.enterprisedb.com




Re: Corruption during WAL replay

2022-03-24 Thread Robert Haas
On Thu, Mar 24, 2022 at 6:04 PM Tom Lane  wrote:
> Robert Haas  writes:
> > Thanks, committed.
>
> Some of the buildfarm is seeing failures in the pg_checksums test.

Hmm. So the tests seem to be failing because 002_actions.pl stops the
database cluster, runs pg_checksums (which passes), writes some zero
bytes over the line pointer array of the first block of pg_class, and
then runs pg_checksums again. In the failing buildfarm runs,
pg_checksums fails to detect the corruption: the second run succeeds,
while pg_checksums expects it to fail. That's pretty curious, because
if the database cluster is stopped, and things are OK at that point,
then how could a server bug of any kind cause a Perl script to be
unable to corrupt a file on disk?

A possible clue is that I also see a few machines failing in
recoveryCheck. And the code that is failing there looks like this:

# We've seen occasional cases where multiple walsender pids are active. An
# immediate shutdown may hide evidence of a locking bug. So if multiple
# walsenders are observed, shut down in fast mode, and collect some more
# information.
if (not like($senderpid, qr/^[0-9]+$/, "have walsender pid $senderpid"))
{
my ($stdout, $stderr);
$node_primary3->psql('postgres',
 "\\a\\t\nSELECT * FROM pg_stat_activity",
 stdout => \$stdout, stderr => \$stderr);
diag $stdout, $stderr;
$node_primary3->stop('fast');
$node_standby3->stop('fast');
die "could not determine walsender pid, can't continue";
}

And the failure looks like this:

#   Failed test 'have walsender pid 1047504
# 1047472'
#   at t/019_replslot_limit.pl line 343.

That sure looks like there are multiple walsender PIDs active, and the
pg_stat_activity output confirms it. 1047504 is running
START_REPLICATION SLOT "rep3" 0/70 TIMELINE 1 and 1047472 is
running START_REPLICATION SLOT "pg_basebackup_1047472" 0/60
TIMELINE 1.

Both of these failures could possibly be explained by some failure of
things to shut down properly, but it's not the same things. In the
first case, the database server would have had to still be running
after we run $node->stop, and it would have had to overwrite the bad
contents of pg_class with some good contents. In the second case, the
cluster's supposed to still be running, but the backends that were
creating those replication slots should have exited sooner.

I've been running the pg_checksums test in a loop here for a bit now
in the hopes of being able to reproduce the failure, but it doesn't
seem to want to fail here. And I've also looked over the commit and I
can't quite see how it would cause a process, or the cluster, to fail
to shutdown, unless perhaps it's the checkpointer that gets stuck, but
that doesn't really seem to match the symptoms.

Any ideas?

-- 
Robert Haas
EDB: http://www.enterprisedb.com




Re: Corruption during WAL replay

2022-03-24 Thread Tom Lane
Robert Haas  writes:
> Thanks, committed.

Some of the buildfarm is seeing failures in the pg_checksums test.

regards, tom lane




Re: Corruption during WAL replay

2022-03-24 Thread Robert Haas
On Thu, Mar 17, 2022 at 9:21 PM Kyotaro Horiguchi
 wrote:
> Finally, no two of from 10 to 14 doesn't accept the same patch.
>
> As a cross-version check, I compared all combinations of the patches
> for two adjacent versions and confirmed that no hunks are lost.
>
> All versions pass check world.

Thanks, committed.

-- 
Robert Haas
EDB: http://www.enterprisedb.com




Re: Corruption during WAL replay

2022-03-17 Thread Kyotaro Horiguchi
At Wed, 16 Mar 2022 10:14:56 -0400, Robert Haas  wrote 
in 
> Hmm. I think the last two instances of "buffers" in this comment
> should actually say "blocks".

Ok. I replaced them with "blocks" and it looks nicer. Thanks!

> > I'll try that, if you are already working on it, please inform me. (It
> > may more than likely be too late..)
> 
> If you want to take a crack at that, I'd be delighted.

Finally, no two of from 10 to 14 doesn't accept the same patch.

As a cross-version check, I compared all combinations of the patches
for two adjacent versions and confirmed that no hunks are lost.

All versions pass check world.


The differences between each two adjacent versions are as follows.

master->14:

 A hunk fails due to the change in how to access rel->rd_smgr.

14->13:

  Several hunks fail due to simple context differences.

13->12:

 Many hunks fail due to the migration of delayChkpt from PGPROC to
 PGXACT and the context difference due to change of FSM trancation
 logic in RelationTruncate.

12->11:

 Several hunks fail due to the removal of volatile qalifier from
 pointers to PGPROC/PGXACT.

11-10:

 A hunk fails due to the context difference due to an additional
 member tempNamespaceId of PGPROC.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center
From c88858d3e5681005ba0396b7e7ebcde4322b3308 Mon Sep 17 00:00:00 2001
From: Robert Haas 
Date: Tue, 15 Mar 2022 12:29:14 -0400
Subject: [PATCH] Fix possible recovery trouble if TRUNCATE overlaps a
 checkpoint.
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

If TRUNCATE causes some buffers to be invalidated and thus the
checkpoint does not flush them, TRUNCATE must also ensure that the
corresponding files are truncated on disk. Otherwise, a replay
from the checkpoint might find that the buffers exist but have
the wrong contents, which may cause replay to fail.

Report by Teja Mupparti. Patch by Kyotaro Horiguchi, per a design
suggestion from Heikki Linnakangas, with some changes to the
comments by me. Review of this and a prior patch that approached
the issue differently by Heikki Linnakangas, Andres Freund, Álvaro
Herrera, Masahiko Sawada, and Tom Lane.

Back-patch to all supported versions.

Discussion: http://postgr.es/m/byapr06mb6373bf50b469ca393c614257ab...@byapr06mb6373.namprd06.prod.outlook.com
---
 src/backend/access/transam/multixact.c  |  6 ++--
 src/backend/access/transam/twophase.c   | 12 
 src/backend/access/transam/xact.c   |  5 ++--
 src/backend/access/transam/xlog.c   | 16 +--
 src/backend/access/transam/xloginsert.c |  2 +-
 src/backend/catalog/storage.c   | 29 ++-
 src/backend/storage/buffer/bufmgr.c |  6 ++--
 src/backend/storage/ipc/procarray.c | 26 -
 src/backend/storage/lmgr/proc.c |  4 +--
 src/include/storage/proc.h  | 37 -
 src/include/storage/procarray.h |  5 ++--
 11 files changed, 120 insertions(+), 28 deletions(-)

diff --git a/src/backend/access/transam/multixact.c b/src/backend/access/transam/multixact.c
index 6a70d49738..9f65c600d0 100644
--- a/src/backend/access/transam/multixact.c
+++ b/src/backend/access/transam/multixact.c
@@ -3088,8 +3088,8 @@ TruncateMultiXact(MultiXactId newOldestMulti, Oid newOldestMultiDB)
 	 * crash/basebackup, even though the state of the data directory would
 	 * require it.
 	 */
-	Assert(!MyProc->delayChkpt);
-	MyProc->delayChkpt = true;
+	Assert((MyProc->delayChkpt & DELAY_CHKPT_START) == 0);
+	MyProc->delayChkpt |= DELAY_CHKPT_START;
 
 	/* WAL log truncation */
 	WriteMTruncateXlogRec(newOldestMultiDB,
@@ -3115,7 +3115,7 @@ TruncateMultiXact(MultiXactId newOldestMulti, Oid newOldestMultiDB)
 	/* Then offsets */
 	PerformOffsetsTruncation(oldestMulti, newOldestMulti);
 
-	MyProc->delayChkpt = false;
+	MyProc->delayChkpt &= ~DELAY_CHKPT_START;
 
 	END_CRIT_SECTION();
 	LWLockRelease(MultiXactTruncationLock);
diff --git a/src/backend/access/transam/twophase.c b/src/backend/access/transam/twophase.c
index 874c8ed125..4dc8ccc12b 100644
--- a/src/backend/access/transam/twophase.c
+++ b/src/backend/access/transam/twophase.c
@@ -475,7 +475,7 @@ MarkAsPreparingGuts(GlobalTransaction gxact, TransactionId xid, const char *gid,
 	}
 	proc->xid = xid;
 	Assert(proc->xmin == InvalidTransactionId);
-	proc->delayChkpt = false;
+	proc->delayChkpt = 0;
 	proc->statusFlags = 0;
 	proc->pid = 0;
 	proc->databaseId = databaseid;
@@ -1164,7 +1164,8 @@ EndPrepare(GlobalTransaction gxact)
 
 	START_CRIT_SECTION();
 
-	MyProc->delayChkpt = true;
+	Assert((MyProc->delayChkpt & DELAY_CHKPT_START) == 0);
+	MyProc->delayChkpt |= DELAY_CHKPT_START;
 
 	XLogBeginInsert();
 	for (record = records.head; record != NULL; record = record->next)
@@ -1207,7 +1208,7 @@ EndPrepare(GlobalTransaction gxact)
 	 * checkpoint starting after this will certainly see the gxact as a
 	 * candidate for fsyncing.
 	 */
-	MyProc->delayChkpt = false;
+	

Re: Corruption during WAL replay

2022-03-16 Thread Robert Haas
On Wed, Mar 16, 2022 at 1:14 AM Kyotaro Horiguchi
 wrote:
> storage.c:
> +* Make sure that a concurrent checkpoint can't complete while 
> truncation
> +* is in progress.
> +*
> +* The truncation operation might drop buffers that the checkpoint
> +* otherwise would have flushed. If it does, then it's essential that
> +* the files actually get truncated on disk before the checkpoint 
> record
> +* is written. Otherwise, if reply begins from that checkpoint, the
> +* to-be-truncated buffers might still exist on disk but have older
> +* contents than expected, which can cause replay to fail. It's OK for
> +* the buffers to not exist on disk at all, but not for them to have 
> the
> +* wrong contents.
>
> FWIW, this seems like slightly confusing between buffer and its
> content.  I can read it correctly so I don't mind if it is natural
> enough.

Hmm. I think the last two instances of "buffers" in this comment
should actually say "blocks".

> I'll try that, if you are already working on it, please inform me. (It
> may more than likely be too late..)

If you want to take a crack at that, I'd be delighted.

-- 
Robert Haas
EDB: http://www.enterprisedb.com




Re: Corruption during WAL replay

2022-03-15 Thread Kyotaro Horiguchi
At Tue, 15 Mar 2022 12:44:49 -0400, Robert Haas  wrote 
in 
> On Wed, Jan 26, 2022 at 3:25 AM Kyotaro Horiguchi
>  wrote:
> > The attached is the fixed version and it surely works with the repro.
> 
> Hi,
> 
> I spent the morning working on this patch and came up with the
> attached version. I wrote substantial comments in RelationTruncate(),
> where I tried to make it more clear exactly what the bug is here, and
> also in storage/proc.h, where I tried to clarify both the use of the
> DELAY_CHKPT_* flags in general terms. If nobody is too sad about this
> version, I plan to commit it.

Thanks for taking this and for the time.  The additional comments
seems describing the flags more clearly.

storage.c:
+* Make sure that a concurrent checkpoint can't complete while 
truncation
+* is in progress.
+*
+* The truncation operation might drop buffers that the checkpoint
+* otherwise would have flushed. If it does, then it's essential that
+* the files actually get truncated on disk before the checkpoint record
+* is written. Otherwise, if reply begins from that checkpoint, the
+* to-be-truncated buffers might still exist on disk but have older
+* contents than expected, which can cause replay to fail. It's OK for
+* the buffers to not exist on disk at all, but not for them to have the
+* wrong contents.

FWIW, this seems like slightly confusing between buffer and its
content.  I can read it correctly so I don't mind if it is natural
enough.

Otherwise all the added/revised comments looks fine. Thanks for the
labor.

> I think it should be back-patched, too, but that looks like a bit of a
> pain. I think every back-branch will require different adjustments.

I'll try that, if you are already working on it, please inform me. (It
may more than likely be too late..)

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center




Re: Corruption during WAL replay

2022-03-15 Thread Robert Haas
On Wed, Jan 26, 2022 at 3:25 AM Kyotaro Horiguchi
 wrote:
> The attached is the fixed version and it surely works with the repro.

Hi,

I spent the morning working on this patch and came up with the
attached version. I wrote substantial comments in RelationTruncate(),
where I tried to make it more clear exactly what the bug is here, and
also in storage/proc.h, where I tried to clarify both the use of the
DELAY_CHKPT_* flags in general terms. If nobody is too sad about this
version, I plan to commit it.

I think it should be back-patched, too, but that looks like a bit of a
pain. I think every back-branch will require different adjustments.

-- 
Robert Haas
EDB: http://www.enterprisedb.com


v5-0001-Fix-possible-recovery-trouble-if-TRUNCATE-overlap.patch
Description: Binary data


Re: Corruption during WAL replay

2022-01-26 Thread Kyotaro Horiguchi
At Mon, 24 Jan 2022 23:33:20 +0300, Daniel Shelepanov  
wrote in 
> Hi. This is my first attempt to review a patch so feel free to tell me
> if I missed something.

Welcome!

> As of today's state of REL_14_STABLE
> (ef9706bbc8ce917a366e4640df8c603c9605817a), the problem is
> reproducible using the script provided by Daniel Wood in this
> (1335373813.287510.1573611814...@connect.xfinity.com) message. Also,
> the latest patch seems not to be applicable and requires some minor
> tweaks.

Thanks for the info.  The reason for my failure is checksum was
enabled.. After disalbing both fpw and checksum (and wal_log_hints)
allows me to reproduce the issue.  And what I found is:

v3 patch:
 >  vxids = GetVirtualXIDsDelayingChkpt(, DELAY_CHKPT_COMPLETE);
!>  if (0 && nvxids > 0)
 >  {

Ugggh!  It looks like a debugging tweak but it prevents everything
from working.

The attached is the fixed version and it surely works with the repro.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center
>From e4155f4e777f31c019cbbd00f41c7b5eb2320ef7 Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi 
Date: Wed, 26 Jan 2022 15:07:58 +0900
Subject: [PATCH v4] Delay checkpoint completion until truncation completes

When a relation is being truncated, the buffers for the relation are
dropped without being written to underlying relation file, so after
that moment the content of the to-be-truncated file is not guaranteed
to be consistent. This is fine in the normal cases where the file is
eventually removed. However, if a checkpoint after the buffer dropping
but finally the file truncation doesn't happen due to server crash or
filesystem failure, the files left behind is inconsistent with WAL.

This commit delays checkpoint completion until all ongoing file
truncation are gone to prevent such inconsistency.
---
 src/backend/access/transam/multixact.c  |  6 +++---
 src/backend/access/transam/twophase.c   | 12 +++-
 src/backend/access/transam/xact.c   |  5 +++--
 src/backend/access/transam/xlog.c   | 16 +--
 src/backend/access/transam/xloginsert.c |  2 +-
 src/backend/catalog/storage.c   | 13 +
 src/backend/storage/buffer/bufmgr.c |  6 --
 src/backend/storage/ipc/procarray.c | 26 ++---
 src/backend/storage/lmgr/proc.c |  4 ++--
 src/include/storage/proc.h  |  7 ++-
 src/include/storage/procarray.h |  7 ---
 11 files changed, 76 insertions(+), 28 deletions(-)

diff --git a/src/backend/access/transam/multixact.c b/src/backend/access/transam/multixact.c
index 806f2e43ba..70593238ab 100644
--- a/src/backend/access/transam/multixact.c
+++ b/src/backend/access/transam/multixact.c
@@ -3075,8 +3075,8 @@ TruncateMultiXact(MultiXactId newOldestMulti, Oid newOldestMultiDB)
 	 * crash/basebackup, even though the state of the data directory would
 	 * require it.
 	 */
-	Assert(!MyProc->delayChkpt);
-	MyProc->delayChkpt = true;
+	Assert((MyProc->delayChkpt & DELAY_CHKPT_START) == 0);
+	MyProc->delayChkpt |= DELAY_CHKPT_START;
 
 	/* WAL log truncation */
 	WriteMTruncateXlogRec(newOldestMultiDB,
@@ -3102,7 +3102,7 @@ TruncateMultiXact(MultiXactId newOldestMulti, Oid newOldestMultiDB)
 	/* Then offsets */
 	PerformOffsetsTruncation(oldestMulti, newOldestMulti);
 
-	MyProc->delayChkpt = false;
+	MyProc->delayChkpt &= ~DELAY_CHKPT_START;
 
 	END_CRIT_SECTION();
 	LWLockRelease(MultiXactTruncationLock);
diff --git a/src/backend/access/transam/twophase.c b/src/backend/access/transam/twophase.c
index 271a3146db..2b1a25e723 100644
--- a/src/backend/access/transam/twophase.c
+++ b/src/backend/access/transam/twophase.c
@@ -474,7 +474,7 @@ MarkAsPreparingGuts(GlobalTransaction gxact, TransactionId xid, const char *gid,
 	}
 	proc->xid = xid;
 	Assert(proc->xmin == InvalidTransactionId);
-	proc->delayChkpt = false;
+	proc->delayChkpt = 0;
 	proc->statusFlags = 0;
 	proc->pid = 0;
 	proc->databaseId = databaseid;
@@ -1165,7 +1165,8 @@ EndPrepare(GlobalTransaction gxact)
 
 	START_CRIT_SECTION();
 
-	MyProc->delayChkpt = true;
+	Assert((MyProc->delayChkpt & DELAY_CHKPT_START) == 0);
+	MyProc->delayChkpt |= DELAY_CHKPT_START;
 
 	XLogBeginInsert();
 	for (record = records.head; record != NULL; record = record->next)
@@ -1208,7 +1209,7 @@ EndPrepare(GlobalTransaction gxact)
 	 * checkpoint starting after this will certainly see the gxact as a
 	 * candidate for fsyncing.
 	 */
-	MyProc->delayChkpt = false;
+	MyProc->delayChkpt &= ~DELAY_CHKPT_START;
 
 	/*
 	 * Remember that we have this GlobalTransaction entry locked for us.  If
@@ -2267,7 +2268,8 @@ RecordTransactionCommitPrepared(TransactionId xid,
 	START_CRIT_SECTION();
 
 	/* See notes in RecordTransactionCommit */
-	MyProc->delayChkpt = true;
+	Assert((MyProc->delayChkpt & DELAY_CHKPT_START) == 0);
+	MyProc->delayChkpt |= DELAY_CHKPT_START;
 
 	/*
 	 * Emit the XLOG commit record. Note that we mark 2PC commits as
@@ -2315,7 +2317,7 @@ 

Re: Corruption during WAL replay

2022-01-24 Thread Daniel Shelepanov



On 27.09.2021 11:30, Kyotaro Horiguchi wrote:

Thank you for the comments! (Sorry for the late resopnse.)

At Tue, 10 Aug 2021 14:14:05 -0400, Robert Haas  wrote in

On Thu, Mar 4, 2021 at 10:01 PM Kyotaro Horiguchi
 wrote:

The patch assumed that CHKPT_START/COMPLETE barrier are exclusively
used each other, but MarkBufferDirtyHint which delays checkpoint start
is called in RelationTruncate while delaying checkpoint completion.
That is not a strange nor harmful behavior.  I changed delayChkpt to a
bitmap integer from an enum so that both barrier are separately
triggered.

I'm not sure this is the way to go here, though.  This fixes the issue
of a crash during RelationTruncate, but the issue of smgrtruncate
failure during RelationTruncate still remains (unless we treat that
failure as PANIC?).

I like this patch. As I understand it, we're currently cheating by
allowing checkpoints to complete without necessarily flushing all of
the pages that were dirty at the time we fixed the redo pointer out to
disk. We think this is OK because we know that those pages are going
to get truncated away, but it's not really OK because when the system
starts up, it has to replay WAL starting from the checkpoint's redo
pointer, but the state of the page is not the same as it was at the
time when the redo pointer was the end of WAL, so redo fails. In the
case described in
http://postgr.es/m/byapr06mb63739b2692dc6dbb3c5f186cab...@byapr06mb6373.namprd06.prod.outlook.com
modifications are made to the page before the redo pointer is fixed
and those changes never make it to disk, but the truncation also never
makes it to the disk either. With this patch, that can't happen,
because no checkpoint can intervene between when we (1) decide we're
not going to bother writing those dirty pages and (2) actually
truncate them away. So either the pages will get written as part of
the checkpoint, or else they'll be gone before the checkpoint
completes. In the latter case, I suppose redo that would have modified
those pages will just be skipped, thus dodging the problem.

I think your understanding is right.


In RelationTruncate, I suggest that we ought to clear the
delay-checkpoint flag before rather than after calling
FreeSpaceMapVacuumRange. Since the free space map is not fully
WAL-logged, anything we're doing there should be non-critical. Also, I

Agreed and fixed.


think it might be better if MarkBufferDirtyHint stays closer to the
existing coding and just uses a Boolean and an if-test to decide
whether to clear the bit, instead of inventing a new mechanism. I
don't really see anything wrong with the new mechanism, but I think
it's better to keep the patch minimal.

Yeah, that was a a kind of silly. Fixed.


As you say, this doesn't fix the problem that truncation might fail.
But as Andres and Sawada-san said, the solution to that is to get rid
of the comments saying that it's OK for truncation to fail and make it
a PANIC. However, I don't think that change needs to be part of this
patch. Even if we do that, we still need to do this. And even if we do
this, we still need to do that.

Ok. Addition to the aboves, I rewrote the comment in RelatinoTruncate.

+* Delay the concurrent checkpoint's completion until this truncation
+* successfully completes, so that we don't establish a redo-point 
between
+* buffer deletion and file-truncate. Otherwise we can leave 
inconsistent
+* file content against the WAL records after the REDO position and 
future
+* recovery fails.

However, a problem for me for now is that I cannot reproduce the
problem.

To avoid further confusion, the attached is named as *.patch.

regards.

Hi. This is my first attempt to review a patch so feel free to tell me 
if I missed something.


As of today's state of REL_14_STABLE 
(ef9706bbc8ce917a366e4640df8c603c9605817a), the problem is reproducible 
using the script provided by Daniel Wood in this 
(1335373813.287510.1573611814...@connect.xfinity.com) message. Also, the 
latest patch seems not to be applicable and requires some minor tweaks.



Regards,

Daniel Shelepanov





Re: Corruption during WAL replay

2021-09-27 Thread Kyotaro Horiguchi
Thank you for the comments! (Sorry for the late resopnse.)

At Tue, 10 Aug 2021 14:14:05 -0400, Robert Haas  wrote 
in 
> On Thu, Mar 4, 2021 at 10:01 PM Kyotaro Horiguchi
>  wrote:
> > The patch assumed that CHKPT_START/COMPLETE barrier are exclusively
> > used each other, but MarkBufferDirtyHint which delays checkpoint start
> > is called in RelationTruncate while delaying checkpoint completion.
> > That is not a strange nor harmful behavior.  I changed delayChkpt to a
> > bitmap integer from an enum so that both barrier are separately
> > triggered.
> >
> > I'm not sure this is the way to go here, though.  This fixes the issue
> > of a crash during RelationTruncate, but the issue of smgrtruncate
> > failure during RelationTruncate still remains (unless we treat that
> > failure as PANIC?).
> 
> I like this patch. As I understand it, we're currently cheating by
> allowing checkpoints to complete without necessarily flushing all of
> the pages that were dirty at the time we fixed the redo pointer out to
> disk. We think this is OK because we know that those pages are going
> to get truncated away, but it's not really OK because when the system
> starts up, it has to replay WAL starting from the checkpoint's redo
> pointer, but the state of the page is not the same as it was at the
> time when the redo pointer was the end of WAL, so redo fails. In the
> case described in
> http://postgr.es/m/byapr06mb63739b2692dc6dbb3c5f186cab...@byapr06mb6373.namprd06.prod.outlook.com
> modifications are made to the page before the redo pointer is fixed
> and those changes never make it to disk, but the truncation also never
> makes it to the disk either. With this patch, that can't happen,
> because no checkpoint can intervene between when we (1) decide we're
> not going to bother writing those dirty pages and (2) actually
> truncate them away. So either the pages will get written as part of
> the checkpoint, or else they'll be gone before the checkpoint
> completes. In the latter case, I suppose redo that would have modified
> those pages will just be skipped, thus dodging the problem.

I think your understanding is right.

> In RelationTruncate, I suggest that we ought to clear the
> delay-checkpoint flag before rather than after calling
> FreeSpaceMapVacuumRange. Since the free space map is not fully
> WAL-logged, anything we're doing there should be non-critical. Also, I

Agreed and fixed.

> think it might be better if MarkBufferDirtyHint stays closer to the
> existing coding and just uses a Boolean and an if-test to decide
> whether to clear the bit, instead of inventing a new mechanism. I
> don't really see anything wrong with the new mechanism, but I think
> it's better to keep the patch minimal.

Yeah, that was a a kind of silly. Fixed.

> As you say, this doesn't fix the problem that truncation might fail.
> But as Andres and Sawada-san said, the solution to that is to get rid
> of the comments saying that it's OK for truncation to fail and make it
> a PANIC. However, I don't think that change needs to be part of this
> patch. Even if we do that, we still need to do this. And even if we do
> this, we still need to do that.

Ok. Addition to the aboves, I rewrote the comment in RelatinoTruncate.

+* Delay the concurrent checkpoint's completion until this truncation
+* successfully completes, so that we don't establish a redo-point 
between
+* buffer deletion and file-truncate. Otherwise we can leave 
inconsistent
+* file content against the WAL records after the REDO position and 
future
+* recovery fails.

However, a problem for me for now is that I cannot reproduce the
problem.

To avoid further confusion, the attached is named as *.patch.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center
diff --git a/src/backend/access/transam/multixact.c b/src/backend/access/transam/multixact.c
index e6c70ed0bc..17357179e3 100644
--- a/src/backend/access/transam/multixact.c
+++ b/src/backend/access/transam/multixact.c
@@ -3075,8 +3075,8 @@ TruncateMultiXact(MultiXactId newOldestMulti, Oid newOldestMultiDB)
 	 * crash/basebackup, even though the state of the data directory would
 	 * require it.
 	 */
-	Assert(!MyProc->delayChkpt);
-	MyProc->delayChkpt = true;
+	Assert((MyProc->delayChkpt & DELAY_CHKPT_START) == 0);
+	MyProc->delayChkpt |= DELAY_CHKPT_START;
 
 	/* WAL log truncation */
 	WriteMTruncateXlogRec(newOldestMultiDB,
@@ -3102,7 +3102,7 @@ TruncateMultiXact(MultiXactId newOldestMulti, Oid newOldestMultiDB)
 	/* Then offsets */
 	PerformOffsetsTruncation(oldestMulti, newOldestMulti);
 
-	MyProc->delayChkpt = false;
+	MyProc->delayChkpt &= ~DELAY_CHKPT_START;
 
 	END_CRIT_SECTION();
 	LWLockRelease(MultiXactTruncationLock);
diff --git a/src/backend/access/transam/twophase.c b/src/backend/access/transam/twophase.c
index 2156de187c..b7dc84d6e3 100644
--- a/src/backend/access/transam/twophase.c
+++ b/src/backend/access/transam/twophase.c
@@ -463,7 +463,7 @@ 

Re: Corruption during WAL replay

2021-09-27 Thread Kyotaro Horiguchi
Thaks for looking this, Robert and Tom.

At Fri, 24 Sep 2021 16:22:28 -0400, Tom Lane  wrote in 
> Robert Haas  writes:
> > On Fri, Sep 24, 2021 at 3:42 PM Tom Lane  wrote:
> >> I think the basic idea is about right, but I'm not happy with the
> >> three-way delayChkpt business; that seems too cute by three-quarters.
> 
> > Nobody, but the version of the patch that I was looking at uses a
> > separate bit for each one:
> 
> > +/* symbols for PGPROC.delayChkpt */
> > +#define DELAY_CHKPT_START (1<<0)
> > +#define DELAY_CHKPT_COMPLETE (1<<1)
> 
> Hm, that's not in the patch version that the CF app claims to be
> latest [1].  It does this:
> 
> +/* type for PGPROC.delayChkpt */
> +typedef enum DelayChkptType
> +{
> + DELAY_CHKPT_NONE = 0,
> + DELAY_CHKPT_START,
> + DELAY_CHKPT_COMPLETE
> +} DelayChkptType;
> 
> which seems like a distinct disimprovement over what you're quoting.
 
Yeah, that is because the latest patch is not attached as *.patch/diff
but *.txt.  I didn't name it as *.patch in order to avoid noise patch
in that thread although it was too late. On the contrary that seems to
have lead in another trouble..

Tom's concern is right.  Actually both the two events can happen
simultaneously but the latest *.patch.txt treats that case as Robert
said.

One advantage of having the two flags as one bitmap integer is it
slightly simplifies the logic in GetVirtualXIDsDelayingChkpt and
HaveVirtualXIDsDelayingChkpt. On the other hand it very slightly
complexifies how to set/reset the flags.

GetVirtualXIDsDelayingChkpt:
+   if ((proc->delayChkpt & type) != 0)

vs

+   if (delayStart)
+   delayflag = proc->delayChkptStart;
+   else
+   delayflag = proc->delayChkptEnd;
+   if (delayflag != 0)

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center




Re: Corruption during WAL replay

2021-09-24 Thread Tom Lane
Robert Haas  writes:
> On Fri, Sep 24, 2021 at 3:42 PM Tom Lane  wrote:
>> I think the basic idea is about right, but I'm not happy with the
>> three-way delayChkpt business; that seems too cute by three-quarters.

> Nobody, but the version of the patch that I was looking at uses a
> separate bit for each one:

> +/* symbols for PGPROC.delayChkpt */
> +#define DELAY_CHKPT_START (1<<0)
> +#define DELAY_CHKPT_COMPLETE (1<<1)

Hm, that's not in the patch version that the CF app claims to be
latest [1].  It does this:

+/* type for PGPROC.delayChkpt */
+typedef enum DelayChkptType
+{
+   DELAY_CHKPT_NONE = 0,
+   DELAY_CHKPT_START,
+   DELAY_CHKPT_COMPLETE
+} DelayChkptType;

which seems like a distinct disimprovement over what you're quoting.

regards, tom lane

[1] 
https://www.postgresql.org/message-id/20210106.173327.1444585955309078930.horikyota@gmail.com




Re: Corruption during WAL replay

2021-09-24 Thread Robert Haas
On Fri, Sep 24, 2021 at 3:42 PM Tom Lane  wrote:
> Robert Haas  writes:
> > I like this patch.
>
> I think the basic idea is about right, but I'm not happy with the
> three-way delayChkpt business; that seems too cute by three-quarters.
> I think two independent boolean flags, one saying "I'm preventing
> checkpoint start" and one saying "I'm preventing checkpoint completion",
> would be much less confusing and also more future-proof.  Who's to say
> that we won't ever need both states to be set in the same process?

Nobody, but the version of the patch that I was looking at uses a
separate bit for each one:

+/* symbols for PGPROC.delayChkpt */
+#define DELAY_CHKPT_START (1<<0)
+#define DELAY_CHKPT_COMPLETE (1<<1)

One could instead use separate Booleans, but there doesn't seem to be
anything three-way about this?

-- 
Robert Haas
EDB: http://www.enterprisedb.com




Re: Corruption during WAL replay

2021-09-24 Thread Tom Lane
Robert Haas  writes:
> I like this patch.

I think the basic idea is about right, but I'm not happy with the
three-way delayChkpt business; that seems too cute by three-quarters.
I think two independent boolean flags, one saying "I'm preventing
checkpoint start" and one saying "I'm preventing checkpoint completion",
would be much less confusing and also more future-proof.  Who's to say
that we won't ever need both states to be set in the same process?

I also dislike the fact that the patch has made procarray.h depend
on proc.h ... maybe I'm wrong, but I thought that there was a reason
for keeping those independent, if indeed this hasn't actually resulted
in a circular-includes situation.  If we avoid inventing that enum type
then there's no need for that.  If we do need an enum, maybe it could
be put in some already-common prerequisite header.

regards, tom lane




Re: Corruption during WAL replay

2021-08-10 Thread Robert Haas
On Thu, Mar 4, 2021 at 10:01 PM Kyotaro Horiguchi
 wrote:
> The patch assumed that CHKPT_START/COMPLETE barrier are exclusively
> used each other, but MarkBufferDirtyHint which delays checkpoint start
> is called in RelationTruncate while delaying checkpoint completion.
> That is not a strange nor harmful behavior.  I changed delayChkpt to a
> bitmap integer from an enum so that both barrier are separately
> triggered.
>
> I'm not sure this is the way to go here, though.  This fixes the issue
> of a crash during RelationTruncate, but the issue of smgrtruncate
> failure during RelationTruncate still remains (unless we treat that
> failure as PANIC?).

I like this patch. As I understand it, we're currently cheating by
allowing checkpoints to complete without necessarily flushing all of
the pages that were dirty at the time we fixed the redo pointer out to
disk. We think this is OK because we know that those pages are going
to get truncated away, but it's not really OK because when the system
starts up, it has to replay WAL starting from the checkpoint's redo
pointer, but the state of the page is not the same as it was at the
time when the redo pointer was the end of WAL, so redo fails. In the
case described in
http://postgr.es/m/byapr06mb63739b2692dc6dbb3c5f186cab...@byapr06mb6373.namprd06.prod.outlook.com
modifications are made to the page before the redo pointer is fixed
and those changes never make it to disk, but the truncation also never
makes it to the disk either. With this patch, that can't happen,
because no checkpoint can intervene between when we (1) decide we're
not going to bother writing those dirty pages and (2) actually
truncate them away. So either the pages will get written as part of
the checkpoint, or else they'll be gone before the checkpoint
completes. In the latter case, I suppose redo that would have modified
those pages will just be skipped, thus dodging the problem.

In RelationTruncate, I suggest that we ought to clear the
delay-checkpoint flag before rather than after calling
FreeSpaceMapVacuumRange. Since the free space map is not fully
WAL-logged, anything we're doing there should be non-critical. Also, I
think it might be better if MarkBufferDirtyHint stays closer to the
existing coding and just uses a Boolean and an if-test to decide
whether to clear the bit, instead of inventing a new mechanism. I
don't really see anything wrong with the new mechanism, but I think
it's better to keep the patch minimal.

As you say, this doesn't fix the problem that truncation might fail.
But as Andres and Sawada-san said, the solution to that is to get rid
of the comments saying that it's OK for truncation to fail and make it
a PANIC. However, I don't think that change needs to be part of this
patch. Even if we do that, we still need to do this. And even if we do
this, we still need to do that.

-- 
Robert Haas
EDB: http://www.enterprisedb.com




Re: Corruption during WAL replay

2021-03-04 Thread Kyotaro Horiguchi
At Thu, 4 Mar 2021 22:37:23 +0500, Ibrar Ahmed  wrote in 
> The regression is failing for this patch, do you mind look at that and send
> the updated patch?
> 
> https://api.cirrus-ci.com/v1/task/6313174510075904/logs/test.log
> 
> ...
> t/006_logical_decoding.pl  ok
> t/007_sync_rep.pl  ok
> Bailout called.  Further testing stopped:  system pg_ctl failed
> FAILED--Further testing stopped: system pg_ctl failed
> make[2]: *** [Makefile:19: check] Error 255
> make[1]: *** [Makefile:49: check-recovery-recurse] Error 2
> make: *** [GNUmakefile:71: check-world-src/test-recurse] Error 2
> ...

(I regret that I sent this as .patch file..)

Thaks for pointing that!

The patch assumed that CHKPT_START/COMPLETE barrier are exclusively
used each other, but MarkBufferDirtyHint which delays checkpoint start
is called in RelationTruncate while delaying checkpoint completion.
That is not a strange nor harmful behavior.  I changed delayChkpt to a
bitmap integer from an enum so that both barrier are separately
triggered.

I'm not sure this is the way to go here, though.  This fixes the issue
of a crash during RelationTruncate, but the issue of smgrtruncate
failure during RelationTruncate still remains (unless we treat that
failure as PANIC?).

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center
diff --git a/src/backend/access/transam/multixact.c 
b/src/backend/access/transam/multixact.c
index 1f9f1a1fa1..c1b0b48362 100644
--- a/src/backend/access/transam/multixact.c
+++ b/src/backend/access/transam/multixact.c
@@ -3072,8 +3072,8 @@ TruncateMultiXact(MultiXactId newOldestMulti, Oid 
newOldestMultiDB)
 * crash/basebackup, even though the state of the data directory would
 * require it.
 */
-   Assert(!MyProc->delayChkpt);
-   MyProc->delayChkpt = true;
+   Assert((MyProc->delayChkpt & DELAY_CHKPT_START) == 0);
+   MyProc->delayChkpt |= DELAY_CHKPT_START;
 
/* WAL log truncation */
WriteMTruncateXlogRec(newOldestMultiDB,
@@ -3099,7 +3099,7 @@ TruncateMultiXact(MultiXactId newOldestMulti, Oid 
newOldestMultiDB)
/* Then offsets */
PerformOffsetsTruncation(oldestMulti, newOldestMulti);
 
-   MyProc->delayChkpt = false;
+   MyProc->delayChkpt &= ~DELAY_CHKPT_START;
 
END_CRIT_SECTION();
LWLockRelease(MultiXactTruncationLock);
diff --git a/src/backend/access/transam/twophase.c 
b/src/backend/access/transam/twophase.c
index 80d2d20d6c..85c720491b 100644
--- a/src/backend/access/transam/twophase.c
+++ b/src/backend/access/transam/twophase.c
@@ -463,7 +463,7 @@ MarkAsPreparingGuts(GlobalTransaction gxact, TransactionId 
xid, const char *gid,
proc->lxid = (LocalTransactionId) xid;
proc->xid = xid;
Assert(proc->xmin == InvalidTransactionId);
-   proc->delayChkpt = false;
+   proc->delayChkpt = 0;
proc->statusFlags = 0;
proc->pid = 0;
proc->backendId = InvalidBackendId;
@@ -1109,7 +1109,8 @@ EndPrepare(GlobalTransaction gxact)
 
START_CRIT_SECTION();
 
-   MyProc->delayChkpt = true;
+   Assert((MyProc->delayChkpt & DELAY_CHKPT_START) == 0);
+   MyProc->delayChkpt |= DELAY_CHKPT_START;
 
XLogBeginInsert();
for (record = records.head; record != NULL; record = record->next)
@@ -1152,7 +1153,7 @@ EndPrepare(GlobalTransaction gxact)
 * checkpoint starting after this will certainly see the gxact as a
 * candidate for fsyncing.
 */
-   MyProc->delayChkpt = false;
+   MyProc->delayChkpt &= ~DELAY_CHKPT_START;
 
/*
 * Remember that we have this GlobalTransaction entry locked for us.  If
@@ -2198,7 +2199,8 @@ RecordTransactionCommitPrepared(TransactionId xid,
START_CRIT_SECTION();
 
/* See notes in RecordTransactionCommit */
-   MyProc->delayChkpt = true;
+   Assert((MyProc->delayChkpt & DELAY_CHKPT_START) == 0);
+   MyProc->delayChkpt |= DELAY_CHKPT_START;
 
/*
 * Emit the XLOG commit record. Note that we mark 2PC commits as
@@ -2246,7 +2248,7 @@ RecordTransactionCommitPrepared(TransactionId xid,
TransactionIdCommitTree(xid, nchildren, children);
 
/* Checkpoint can proceed now */
-   MyProc->delayChkpt = false;
+   MyProc->delayChkpt &= ~DELAY_CHKPT_START;
 
END_CRIT_SECTION();
 
diff --git a/src/backend/access/transam/xact.c 
b/src/backend/access/transam/xact.c
index 4e6a3df6b8..f033e8940a 100644
--- a/src/backend/access/transam/xact.c
+++ b/src/backend/access/transam/xact.c
@@ -1334,8 +1334,9 @@ RecordTransactionCommit(void)
 * This makes checkpoint's determination of which xacts are 
delayChkpt
 * a bit fuzzy, but it doesn't matter.
 */
+   Assert((MyProc->delayChkpt & DELAY_CHKPT_START) == 0);
START_CRIT_SECTION();
-   MyProc->delayChkpt = true;
+   MyProc->delayChkpt |= DELAY_CHKPT_START;
 
 

Re: Corruption during WAL replay

2021-03-04 Thread Ibrar Ahmed
On Wed, Jan 6, 2021 at 1:33 PM Kyotaro Horiguchi 
wrote:

> At Mon, 17 Aug 2020 11:22:15 -0700, Andres Freund 
> wrote in
> > Hi,
> >
> > On 2020-08-17 14:05:37 +0300, Heikki Linnakangas wrote:
> > > On 14/04/2020 22:04, Teja Mupparti wrote:
> > > > Thanks Kyotaro and Masahiko for the feedback. I think there is a
> > > > consensus on the critical-section around truncate,
> > >
> > > +1
> >
> > I'm inclined to think that we should do that independent of the far more
> > complicated fix for other related issues.
> ...
> > > Perhaps a better approach would be to prevent the checkpoint from
> > > completing, until all in-progress truncations have completed. We have a
> > > mechanism to wait out in-progress commits at the beginning of a
> checkpoint,
> > > right after the redo point has been established. See comments around
> the
> > > GetVirtualXIDsDelayingChkpt() function call in CreateCheckPoint(). We
> could
> > > have a similar mechanism to wait out the truncations before
> *completing* a
> > > checkpoint.
> >
> > What I outlined earlier *is* essentially a way to do so, by preventing
> > checkpointing from finishing the buffer scan while a dangerous state
> > exists.
>
> Seems reasonable. The attached does that. It actually works for the
> initial case.
>
> regards.
>
> --
> Kyotaro Horiguchi
> NTT Open Source Software Center
>

The regression is failing for this patch, do you mind look at that and send
the updated patch?

https://api.cirrus-ci.com/v1/task/6313174510075904/logs/test.log

...
t/006_logical_decoding.pl  ok
t/007_sync_rep.pl  ok
Bailout called.  Further testing stopped:  system pg_ctl failed
FAILED--Further testing stopped: system pg_ctl failed
make[2]: *** [Makefile:19: check] Error 255
make[1]: *** [Makefile:49: check-recovery-recurse] Error 2
make: *** [GNUmakefile:71: check-world-src/test-recurse] Error 2
...

-- 
Ibrar Ahmed


Re: Corruption during WAL replay

2021-01-06 Thread Kyotaro Horiguchi
At Mon, 17 Aug 2020 11:22:15 -0700, Andres Freund  wrote in 
> Hi,
> 
> On 2020-08-17 14:05:37 +0300, Heikki Linnakangas wrote:
> > On 14/04/2020 22:04, Teja Mupparti wrote:
> > > Thanks Kyotaro and Masahiko for the feedback. I think there is a
> > > consensus on the critical-section around truncate,
> > 
> > +1
> 
> I'm inclined to think that we should do that independent of the far more
> complicated fix for other related issues.
...
> > Perhaps a better approach would be to prevent the checkpoint from
> > completing, until all in-progress truncations have completed. We have a
> > mechanism to wait out in-progress commits at the beginning of a checkpoint,
> > right after the redo point has been established. See comments around the
> > GetVirtualXIDsDelayingChkpt() function call in CreateCheckPoint(). We could
> > have a similar mechanism to wait out the truncations before *completing* a
> > checkpoint.
> 
> What I outlined earlier *is* essentially a way to do so, by preventing
> checkpointing from finishing the buffer scan while a dangerous state
> exists.

Seems reasonable. The attached does that. It actually works for the
initial case.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center
diff --git a/src/backend/access/transam/multixact.c b/src/backend/access/transam/multixact.c
index 1233448481..e1d3068f14 100644
--- a/src/backend/access/transam/multixact.c
+++ b/src/backend/access/transam/multixact.c
@@ -3062,8 +3062,8 @@ TruncateMultiXact(MultiXactId newOldestMulti, Oid newOldestMultiDB)
 	 * crash/basebackup, even though the state of the data directory would
 	 * require it.
 	 */
-	Assert(!MyProc->delayChkpt);
-	MyProc->delayChkpt = true;
+	Assert(MyProc->delayChkpt == DELAY_CHKPT_NONE);
+	MyProc->delayChkpt = DELAY_CHKPT_START;
 
 	/* WAL log truncation */
 	WriteMTruncateXlogRec(newOldestMultiDB,
@@ -3089,7 +3089,7 @@ TruncateMultiXact(MultiXactId newOldestMulti, Oid newOldestMultiDB)
 	/* Then offsets */
 	PerformOffsetsTruncation(oldestMulti, newOldestMulti);
 
-	MyProc->delayChkpt = false;
+	MyProc->delayChkpt = DELAY_CHKPT_NONE;
 
 	END_CRIT_SECTION();
 	LWLockRelease(MultiXactTruncationLock);
diff --git a/src/backend/access/transam/twophase.c b/src/backend/access/transam/twophase.c
index fc18b77832..1b74a229d6 100644
--- a/src/backend/access/transam/twophase.c
+++ b/src/backend/access/transam/twophase.c
@@ -463,7 +463,7 @@ MarkAsPreparingGuts(GlobalTransaction gxact, TransactionId xid, const char *gid,
 	proc->lxid = (LocalTransactionId) xid;
 	proc->xid = xid;
 	Assert(proc->xmin == InvalidTransactionId);
-	proc->delayChkpt = false;
+	proc->delayChkpt = DELAY_CHKPT_NONE;
 	proc->statusFlags = 0;
 	proc->pid = 0;
 	proc->backendId = InvalidBackendId;
@@ -1108,7 +1108,8 @@ EndPrepare(GlobalTransaction gxact)
 
 	START_CRIT_SECTION();
 
-	MyProc->delayChkpt = true;
+	Assert(MyProc->delayChkpt == DELAY_CHKPT_NONE);
+	MyProc->delayChkpt = DELAY_CHKPT_START;
 
 	XLogBeginInsert();
 	for (record = records.head; record != NULL; record = record->next)
@@ -1151,7 +1152,7 @@ EndPrepare(GlobalTransaction gxact)
 	 * checkpoint starting after this will certainly see the gxact as a
 	 * candidate for fsyncing.
 	 */
-	MyProc->delayChkpt = false;
+	MyProc->delayChkpt = DELAY_CHKPT_NONE;
 
 	/*
 	 * Remember that we have this GlobalTransaction entry locked for us.  If
@@ -2199,7 +2200,8 @@ RecordTransactionCommitPrepared(TransactionId xid,
 	START_CRIT_SECTION();
 
 	/* See notes in RecordTransactionCommit */
-	MyProc->delayChkpt = true;
+	Assert(MyProc->delayChkpt == DELAY_CHKPT_NONE);
+	MyProc->delayChkpt = DELAY_CHKPT_START;
 
 	/*
 	 * Emit the XLOG commit record. Note that we mark 2PC commits as
@@ -2247,7 +2249,7 @@ RecordTransactionCommitPrepared(TransactionId xid,
 	TransactionIdCommitTree(xid, nchildren, children);
 
 	/* Checkpoint can proceed now */
-	MyProc->delayChkpt = false;
+	MyProc->delayChkpt = DELAY_CHKPT_NONE;
 
 	END_CRIT_SECTION();
 
diff --git a/src/backend/access/transam/xact.c b/src/backend/access/transam/xact.c
index a2068e3fd4..000206e506 100644
--- a/src/backend/access/transam/xact.c
+++ b/src/backend/access/transam/xact.c
@@ -1334,8 +1334,9 @@ RecordTransactionCommit(void)
 		 * This makes checkpoint's determination of which xacts are delayChkpt
 		 * a bit fuzzy, but it doesn't matter.
 		 */
+		Assert(MyProc->delayChkpt == DELAY_CHKPT_NONE);
 		START_CRIT_SECTION();
-		MyProc->delayChkpt = true;
+		MyProc->delayChkpt = DELAY_CHKPT_START;
 
 		SetCurrentTransactionStopTimestamp();
 
@@ -1436,7 +1437,7 @@ RecordTransactionCommit(void)
 	 */
 	if (markXidCommitted)
 	{
-		MyProc->delayChkpt = false;
+		MyProc->delayChkpt = DELAY_CHKPT_NONE;
 		END_CRIT_SECTION();
 	}
 
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index ede93ad7fd..5b0a653408 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -9024,18 +9024,30 @@ CreateCheckPoint(int flags)
 	 * and we will correctly flush the update below. 

Re: Corruption during WAL replay

2020-12-01 Thread Anastasia Lubennikova

On 06.11.2020 14:40, Masahiko Sawada wrote:


So I agree to
proceed with the patch that adds a critical section independent of
fixing other related things discussed in this thread. If Teja seems
not to work on this I’ll write the patch.

Regards,


--
Masahiko Sawada
EnterpriseDB:  https://www.enterprisedb.com/



Status update for a commitfest entry.

The commitfest is closed now. As this entry is a bug fix, I am moving it 
to the next CF.

Are you planning to continue working on it?

--
Anastasia Lubennikova
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company





Re: Corruption during WAL replay

2020-11-06 Thread Masahiko Sawada
On Tue, Aug 18, 2020 at 3:22 AM Andres Freund  wrote:
>
> Hi,
>
> On 2020-08-17 14:05:37 +0300, Heikki Linnakangas wrote:
> > On 14/04/2020 22:04, Teja Mupparti wrote:
> > > Thanks Kyotaro and Masahiko for the feedback. I think there is a
> > > consensus on the critical-section around truncate,
> >
> > +1
>
> I'm inclined to think that we should do that independent of the far more
> complicated fix for other related issues.

+1

If we had a critical section in RelationTruncate(), crash recovery
would continue failing until the situation of the underlying file is
recovered if a PANIC happens. The current comment in
RelationTruncate() says it’s worse than the disease. But considering
physical replication, as Andres mentioned, a failure to truncate the
file after logging WAL is no longer a harmless failure. Also, the
critical section would be necessary even if we reversed the order of
truncation and dropping buffers and resolved the issue. So I agree to
proceed with the patch that adds a critical section independent of
fixing other related things discussed in this thread. If Teja seems
not to work on this I’ll write the patch.

Regards,


--
Masahiko Sawada
EnterpriseDB:  https://www.enterprisedb.com/




Re: Corruption during WAL replay

2020-10-30 Thread Anastasia Lubennikova
Status update for a commitfest entry.

I see quite a few unanswered questions in the thread since the last patch 
version was sent. So, I move it to "Waiting on Author".

The new status of this patch is: Waiting on Author


Re: Corruption during WAL replay

2020-08-17 Thread Andres Freund
Hi,

On 2020-08-17 14:05:37 +0300, Heikki Linnakangas wrote:
> On 14/04/2020 22:04, Teja Mupparti wrote:
> > Thanks Kyotaro and Masahiko for the feedback. I think there is a
> > consensus on the critical-section around truncate,
> 
> +1

I'm inclined to think that we should do that independent of the far more
complicated fix for other related issues.


> > but I just want to emphasize the need for reversing the order of the
> > dropping the buffers and the truncation.
> > 
> >   Repro details (when full page write = off)
> > 
> >           1) Page on disk has empty LP 1, Insert into page LP 1
> >           2) checkpoint START (Recovery REDO eventually starts here)
> >           3) Delete all rows on the page (page is empty now)
> >           4) Autovacuum kicks in and truncates the pages
> >                   DropRelFileNodeBuffers - Dirty page NOT written, LP 1
> > on disk still empty
> >           5) Checkpoint completes
> >           6) Crash
> >           7) smgrtruncate - Not reached (this is where we do the
> > physical truncate)
> > 
> >   Now the crash-recovery starts
> > 
> >           Delete-log-replay (above step-3) reads page with empty LP 1
> > and the delete fails with PANIC (old page on disk with no insert)
> > 
> > Doing recovery, truncate is even not reached, a WAL replay of the
> > truncation will happen in the future but the recovery fails (repeatedly)
> > even before reaching that point.
> 
> Hmm. I think simply reversing the order of DropRelFileNodeBuffers() and
> truncating the file would open a different issue:
> 
>   1) Page on disk has empty LP 1, Insert into page LP 1
>   2) checkpoint START (Recovery REDO eventually starts here)
>   3) Delete all rows on the page (page is empty now)
>   4) Autovacuum kicks in and starts truncating
>   5) smgrtruncate() truncates the file
>   6) checkpoint writes out buffers for pages that were just truncated away,
> expanding the file again.
> 
> Your patch had a mechanism to mark the buffers as io-in-progress before
> truncating the file to fix that, but I'm wary of that approach. Firstly, it
> requires scanning the buffers that are dropped twice, which can take a long
> time.

I was thinking that we'd keep track of all the buffers marked as "in
progress" that way, avoiding the second scan.

It's also worth keeping in mind that this code is really only relevant
for partial truncations, which don't happen at the same frequency as
transactional truncations.


> I remember that people have already complained that
> DropRelFileNodeBuffers() is slow, when it has to scan all the buffers
> once.

But that's when dropping many relations, normally. E.g. at the end of a
regression test.


> More importantly, abusing the BM_IO_INPROGRESS flag for this seems
> bad. For starters, because you're not holding buffer's I/O lock, I
> believe the checkpointer would busy-wait on the buffers until the
> truncation has completed. See StartBufferIO() and AbortBufferIO().

I think we should apply Robert's patch that makes io locks into
condition variables. Then we can fairly easily have many many buffers io
locked.  Obviously there's some issues with doing so in the back
branches :(

I'm working on an AIO branch, and that also requires to be able to mark
multiple buffers as in-progress, FWIW.


> Perhaps a better approach would be to prevent the checkpoint from
> completing, until all in-progress truncations have completed. We have a
> mechanism to wait out in-progress commits at the beginning of a checkpoint,
> right after the redo point has been established. See comments around the
> GetVirtualXIDsDelayingChkpt() function call in CreateCheckPoint(). We could
> have a similar mechanism to wait out the truncations before *completing* a
> checkpoint.

What I outlined earlier *is* essentially a way to do so, by preventing
checkpointing from finishing the buffer scan while a dangerous state
exists.

Greetings,

Andres Freund




Re: Corruption during WAL replay

2020-08-17 Thread Heikki Linnakangas

On 14/04/2020 22:04, Teja Mupparti wrote:
Thanks Kyotaro and Masahiko for the feedback. I think there is a 
consensus on the critical-section around truncate,


+1


but I just want to emphasize the need for reversing the order of the
dropping the buffers and the truncation.

  Repro details (when full page write = off)

          1) Page on disk has empty LP 1, Insert into page LP 1
          2) checkpoint START (Recovery REDO eventually starts here)
          3) Delete all rows on the page (page is empty now)
          4) Autovacuum kicks in and truncates the pages
                  DropRelFileNodeBuffers - Dirty page NOT written, LP 1 
on disk still empty

          5) Checkpoint completes
          6) Crash
          7) smgrtruncate - Not reached (this is where we do the 
physical truncate)


  Now the crash-recovery starts

          Delete-log-replay (above step-3) reads page with empty LP 1 
and the delete fails with PANIC (old page on disk with no insert)


Doing recovery, truncate is even not reached, a WAL replay of the 
truncation will happen in the future but the recovery fails (repeatedly) 
even before reaching that point.


Hmm. I think simply reversing the order of DropRelFileNodeBuffers() and 
truncating the file would open a different issue:


  1) Page on disk has empty LP 1, Insert into page LP 1
  2) checkpoint START (Recovery REDO eventually starts here)
  3) Delete all rows on the page (page is empty now)
  4) Autovacuum kicks in and starts truncating
  5) smgrtruncate() truncates the file
  6) checkpoint writes out buffers for pages that were just truncated 
away, expanding the file again.


Your patch had a mechanism to mark the buffers as io-in-progress before 
truncating the file to fix that, but I'm wary of that approach. Firstly, 
it requires scanning the buffers that are dropped twice, which can take 
a long time. I remember that people have already complained that 
DropRelFileNodeBuffers() is slow, when it has to scan all the buffers 
once. More importantly, abusing the BM_IO_INPROGRESS flag for this seems 
bad. For starters, because you're not holding buffer's I/O lock, I 
believe the checkpointer would busy-wait on the buffers until the 
truncation has completed. See StartBufferIO() and AbortBufferIO().


Perhaps a better approach would be to prevent the checkpoint from 
completing, until all in-progress truncations have completed. We have a 
mechanism to wait out in-progress commits at the beginning of a 
checkpoint, right after the redo point has been established. See 
comments around the GetVirtualXIDsDelayingChkpt() function call in 
CreateCheckPoint(). We could have a similar mechanism to wait out the 
truncations before *completing* a checkpoint.


- Heikki




Re: Corruption during WAL replay

2020-06-12 Thread Masahiko Sawada
On Wed, 15 Apr 2020 at 04:04, Teja Mupparti  wrote:
>
> Thanks Kyotaro and Masahiko for the feedback. I think there is a consensus on 
> the critical-section around truncate, but I just want to emphasize the need 
> for reversing the order of the dropping the buffers and the truncation.
>
>  Repro details (when full page write = off)
>
>  1) Page on disk has empty LP 1, Insert into page LP 1
>  2) checkpoint START (Recovery REDO eventually starts here)
>  3) Delete all rows on the page (page is empty now)
>  4) Autovacuum kicks in and truncates the pages
>  DropRelFileNodeBuffers - Dirty page NOT written, LP 1 on 
> disk still empty
>  5) Checkpoint completes
>  6) Crash
>  7) smgrtruncate - Not reached (this is where we do the physical 
> truncate)
>
>  Now the crash-recovery starts
>
>  Delete-log-replay (above step-3) reads page with empty LP 1 and the 
> delete fails with PANIC (old page on disk with no insert)
>

I agree that when replaying the deletion of (3) the page LP 1 is
empty, but does that replay really fail with PANIC? I guess that we
record that page into invalid_page_tab but don't raise a PANIC in this
case.

Regards,

-- 
Masahiko Sawadahttp://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services




Re: Corruption during WAL replay

2020-04-14 Thread Teja Mupparti
Thanks Kyotaro and Masahiko for the feedback. I think there is a consensus on 
the critical-section around truncate, but I just want to emphasize the need for 
reversing the order of the dropping the buffers and the truncation.

 Repro details (when full page write = off)

 1) Page on disk has empty LP 1, Insert into page LP 1
 2) checkpoint START (Recovery REDO eventually starts here)
 3) Delete all rows on the page (page is empty now)
 4) Autovacuum kicks in and truncates the pages
 DropRelFileNodeBuffers - Dirty page NOT written, LP 1 on disk 
still empty
 5) Checkpoint completes
 6) Crash
 7) smgrtruncate - Not reached (this is where we do the physical 
truncate)

 Now the crash-recovery starts

 Delete-log-replay (above step-3) reads page with empty LP 1 and the 
delete fails with PANIC (old page on disk with no insert)

Doing recovery, truncate is even not reached, a WAL replay of the truncation 
will happen in the future but the recovery fails (repeatedly) even before 
reaching that point.

Best regards,
Teja


From: Kyotaro Horiguchi 
Sent: Monday, April 13, 2020 7:35 PM
To: masahiko.saw...@2ndquadrant.com 
Cc: and...@anarazel.de ; tejesw...@hotmail.com 
; pgsql-hack...@postgresql.org 
; hexexp...@comcast.net 
Subject: Re: Corruption during WAL replay

At Mon, 13 Apr 2020 18:53:26 +0900, Masahiko Sawada 
 wrote in
> On Mon, 13 Apr 2020 at 17:40, Andres Freund  wrote:
> >
> > Hi,
> >
> > On 2020-04-13 15:24:55 +0900, Masahiko Sawada wrote:
> > > On Sat, 11 Apr 2020 at 09:00, Teja Mupparti  wrote:
> > > /*
> > >  * We WAL-log the truncation before actually truncating, which means
> > >  * trouble if the truncation fails. If we then crash, the WAL replay
> > >  * likely isn't going to succeed in the truncation either, and cause a
> > >  * PANIC. It's tempting to put a critical section here, but that cure
> > >  * would be worse than the disease. It would turn a usually harmless
> > >  * failure to truncate, that might spell trouble at WAL replay, into a
> > >  * certain PANIC.
> > >  */
> >
> > Yea, but that reasoning is just plain *wrong*. It's *never* ok to WAL
> > log something and then not perform the action. This leads to to primary
> > / replica getting out of sync, crash recovery potentially not completing
> > (because of records referencing the should-be-truncated pages), ...

It is introduced in 2008 by 3396000684, for 8.4.  So it can be said as
an overlook when introducing log-shipping.

The reason other operations like INSERTs (that extends the underlying
file) are "safe" after an extension failure is the following
operations are performed in shared buffers as if the new page exists,
then tries to extend the file again.  So if we continue working after
truncation failure, we need to disguise on shared buffers as if the
truncated pages are gone.  But we don't have a room for another flag
in buffer header.  For example, BM_DIRTY && !BM_VALID might be able to
be used as the state that the page should have been truncated but not
succeeded yet, but I'm not sure.

Anyway, I think the prognosis of a truncation failure is far hopeless
than extension failure in most cases and I doubt that it's good to
introduce such a complex feature only to overcome such a hopeless
situation.

In short, I think we should PANIC in that case.

> > > As a second idea, I wonder if we can defer truncation until commit
> > > time like smgrDoPendingDeletes mechanism. The sequence would be:
> >
> > This is mostly an issue during [auto]vacuum partially truncating the end
> > of the file. We intentionally release the AEL regularly to allow other
> > accesses to continue.
> >
> > For transactional truncations we don't go down this path (as we create a
> > new relfilenode).
> >
> >
> > > At RelationTruncate(),
> > > 1. WAL logging.
> > > 2. Remember buffers to be dropped.
> >
> > You definitely cannot do that, as explained above.
>
> Ah yes, you're right.
>
> So it seems to me currently what we can do for this issue would be to
> enclose the truncation operation in a critical section. IIUC it's not
> enough just to reverse the order of dropping buffers and physical file
> truncation because it cannot solve the problem of inconsistency on the
> standby. And as Horiguchi-san mentioned, there is no need to reverse
> that order if we envelop the truncation operation by a critical
> section because we can recover page changes during crash recovery. The
> strategy of writing out all dirty buffers before dropping buffers,
> proposed as (a) in [1], also seems not enough.

Agreed.  Since it's not acceptable

Re: Corruption during WAL replay

2020-04-13 Thread Kyotaro Horiguchi
At Mon, 13 Apr 2020 18:53:26 +0900, Masahiko Sawada 
 wrote in 
> On Mon, 13 Apr 2020 at 17:40, Andres Freund  wrote:
> >
> > Hi,
> >
> > On 2020-04-13 15:24:55 +0900, Masahiko Sawada wrote:
> > > On Sat, 11 Apr 2020 at 09:00, Teja Mupparti  wrote:
> > > /*
> > >  * We WAL-log the truncation before actually truncating, which means
> > >  * trouble if the truncation fails. If we then crash, the WAL replay
> > >  * likely isn't going to succeed in the truncation either, and cause a
> > >  * PANIC. It's tempting to put a critical section here, but that cure
> > >  * would be worse than the disease. It would turn a usually harmless
> > >  * failure to truncate, that might spell trouble at WAL replay, into a
> > >  * certain PANIC.
> > >  */
> >
> > Yea, but that reasoning is just plain *wrong*. It's *never* ok to WAL
> > log something and then not perform the action. This leads to to primary
> > / replica getting out of sync, crash recovery potentially not completing
> > (because of records referencing the should-be-truncated pages), ...

It is introduced in 2008 by 3396000684, for 8.4.  So it can be said as
an overlook when introducing log-shipping.

The reason other operations like INSERTs (that extends the underlying
file) are "safe" after an extension failure is the following
operations are performed in shared buffers as if the new page exists,
then tries to extend the file again.  So if we continue working after
truncation failure, we need to disguise on shared buffers as if the
truncated pages are gone.  But we don't have a room for another flag
in buffer header.  For example, BM_DIRTY && !BM_VALID might be able to
be used as the state that the page should have been truncated but not
succeeded yet, but I'm not sure.

Anyway, I think the prognosis of a truncation failure is far hopeless
than extension failure in most cases and I doubt that it's good to
introduce such a complex feature only to overcome such a hopeless
situation.

In short, I think we should PANIC in that case.

> > > As a second idea, I wonder if we can defer truncation until commit
> > > time like smgrDoPendingDeletes mechanism. The sequence would be:
> >
> > This is mostly an issue during [auto]vacuum partially truncating the end
> > of the file. We intentionally release the AEL regularly to allow other
> > accesses to continue.
> >
> > For transactional truncations we don't go down this path (as we create a
> > new relfilenode).
> >
> >
> > > At RelationTruncate(),
> > > 1. WAL logging.
> > > 2. Remember buffers to be dropped.
> >
> > You definitely cannot do that, as explained above.
> 
> Ah yes, you're right.
> 
> So it seems to me currently what we can do for this issue would be to
> enclose the truncation operation in a critical section. IIUC it's not
> enough just to reverse the order of dropping buffers and physical file
> truncation because it cannot solve the problem of inconsistency on the
> standby. And as Horiguchi-san mentioned, there is no need to reverse
> that order if we envelop the truncation operation by a critical
> section because we can recover page changes during crash recovery. The
> strategy of writing out all dirty buffers before dropping buffers,
> proposed as (a) in [1], also seems not enough.

Agreed.  Since it's not acceptable ether WAL-logging->not-performing
nor performing->WAL-logging, there's no other way than working as if
truncation is succeeded (and try again) even if it actually
failed. But it would be too complex.

Just making it a critical section seems the right thing here.


> [1] 
> https://www.postgresql.org/message-id/20191207001232.klidxnm756wqxvwx%40alap3.anarazel.de
> Doing sync before truncation

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center




Re: Corruption during WAL replay

2020-04-13 Thread Masahiko Sawada
On Mon, 13 Apr 2020 at 17:40, Andres Freund  wrote:
>
> Hi,
>
> On 2020-04-13 15:24:55 +0900, Masahiko Sawada wrote:
> > On Sat, 11 Apr 2020 at 09:00, Teja Mupparti  wrote:
> > >
> > > Thanks Andres and Kyotaro for the quick review.  I have fixed the typos 
> > > and also included the critical section (emulated it with try-catch block 
> > > since palloc()s are causing issues in the truncate code). This time I 
> > > used git format-patch.
> > >
> >
> > I briefly looked at the latest patch but I'm not sure it's the right
> > thing here to use PG_TRY/PG_CATCH to report the PANIC error. For
> > example, with the following code you changed, we will always end up
> > with emitting a PANIC "failed to truncate the relation" regardless of
> > the actual cause of the error.
> >
> > +   PG_CATCH();
> > +   {
> > +   ereport(PANIC, (errcode(ERRCODE_INTERNAL_ERROR),
> > +   errmsg("failed to truncate the relation")));
> > +   }
> > +   PG_END_TRY();
> >
> > And the comments of RelationTruncate() mentions:
>
> I think that's just a workaround for mdtruncate not being usable in
> critical sections.
>
>
> > /*
> >  * We WAL-log the truncation before actually truncating, which means
> >  * trouble if the truncation fails. If we then crash, the WAL replay
> >  * likely isn't going to succeed in the truncation either, and cause a
> >  * PANIC. It's tempting to put a critical section here, but that cure
> >  * would be worse than the disease. It would turn a usually harmless
> >  * failure to truncate, that might spell trouble at WAL replay, into a
> >  * certain PANIC.
> >  */
>
> Yea, but that reasoning is just plain *wrong*. It's *never* ok to WAL
> log something and then not perform the action. This leads to to primary
> / replica getting out of sync, crash recovery potentially not completing
> (because of records referencing the should-be-truncated pages), ...
>
>
> > As a second idea, I wonder if we can defer truncation until commit
> > time like smgrDoPendingDeletes mechanism. The sequence would be:
>
> This is mostly an issue during [auto]vacuum partially truncating the end
> of the file. We intentionally release the AEL regularly to allow other
> accesses to continue.
>
> For transactional truncations we don't go down this path (as we create a
> new relfilenode).
>
>
> > At RelationTruncate(),
> > 1. WAL logging.
> > 2. Remember buffers to be dropped.
>
> You definitely cannot do that, as explained above.

Ah yes, you're right.

So it seems to me currently what we can do for this issue would be to
enclose the truncation operation in a critical section. IIUC it's not
enough just to reverse the order of dropping buffers and physical file
truncation because it cannot solve the problem of inconsistency on the
standby. And as Horiguchi-san mentioned, there is no need to reverse
that order if we envelop the truncation operation by a critical
section because we can recover page changes during crash recovery. The
strategy of writing out all dirty buffers before dropping buffers,
proposed as (a) in [1], also seems not enough.

Regards,

[1] 
https://www.postgresql.org/message-id/20191207001232.klidxnm756wqxvwx%40alap3.anarazel.deDoing
sync before truncation


--
Masahiko Sawadahttp://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services




Re: Corruption during WAL replay

2020-04-13 Thread Andres Freund
Hi,

On 2020-04-13 15:24:55 +0900, Masahiko Sawada wrote:
> On Sat, 11 Apr 2020 at 09:00, Teja Mupparti  wrote:
> >
> > Thanks Andres and Kyotaro for the quick review.  I have fixed the typos and 
> > also included the critical section (emulated it with try-catch block since 
> > palloc()s are causing issues in the truncate code). This time I used git 
> > format-patch.
> >
> 
> I briefly looked at the latest patch but I'm not sure it's the right
> thing here to use PG_TRY/PG_CATCH to report the PANIC error. For
> example, with the following code you changed, we will always end up
> with emitting a PANIC "failed to truncate the relation" regardless of
> the actual cause of the error.
> 
> +   PG_CATCH();
> +   {
> +   ereport(PANIC, (errcode(ERRCODE_INTERNAL_ERROR),
> +   errmsg("failed to truncate the relation")));
> +   }
> +   PG_END_TRY();
> 
> And the comments of RelationTruncate() mentions:

I think that's just a workaround for mdtruncate not being usable in
critical sections.


> /*
>  * We WAL-log the truncation before actually truncating, which means
>  * trouble if the truncation fails. If we then crash, the WAL replay
>  * likely isn't going to succeed in the truncation either, and cause a
>  * PANIC. It's tempting to put a critical section here, but that cure
>  * would be worse than the disease. It would turn a usually harmless
>  * failure to truncate, that might spell trouble at WAL replay, into a
>  * certain PANIC.
>  */

Yea, but that reasoning is just plain *wrong*. It's *never* ok to WAL
log something and then not perform the action. This leads to to primary
/ replica getting out of sync, crash recovery potentially not completing
(because of records referencing the should-be-truncated pages), ...


> As a second idea, I wonder if we can defer truncation until commit
> time like smgrDoPendingDeletes mechanism. The sequence would be:

This is mostly an issue during [auto]vacuum partially truncating the end
of the file. We intentionally release the AEL regularly to allow other
accesses to continue.

For transactional truncations we don't go down this path (as we create a
new relfilenode).


> At RelationTruncate(),
> 1. WAL logging.
> 2. Remember buffers to be dropped.

You definitely cannot do that, as explained above.


> At CommitTransaction(),
> 3. Revisit the remembered buffers to check if the buffer still has
> table data that needs to be truncated.
> 4-a, If it has, we mark it as IO_IN_PROGRESS.
> 4-b, If it already has different table data, ignore it.
> 5, Truncate physical files.
> 6, Mark the buffer we marked at #4-a as invalid.
> 
> If an error occurs between #3 and #6 or in abort case, we revert all
> IO_IN_PROGRESS flags on the buffers.

What would this help with? If we still need the more complicated
truncation sequence?

Greetings,

Andres Freund




Re: Corruption during WAL replay

2020-04-13 Thread Masahiko Sawada
On Sat, 11 Apr 2020 at 09:00, Teja Mupparti  wrote:
>
> Thanks Andres and Kyotaro for the quick review.  I have fixed the typos and 
> also included the critical section (emulated it with try-catch block since 
> palloc()s are causing issues in the truncate code). This time I used git 
> format-patch.
>

I briefly looked at the latest patch but I'm not sure it's the right
thing here to use PG_TRY/PG_CATCH to report the PANIC error. For
example, with the following code you changed, we will always end up
with emitting a PANIC "failed to truncate the relation" regardless of
the actual cause of the error.

+   PG_CATCH();
+   {
+   ereport(PANIC, (errcode(ERRCODE_INTERNAL_ERROR),
+   errmsg("failed to truncate the relation")));
+   }
+   PG_END_TRY();

And the comments of RelationTruncate() mentions:

/*
 * We WAL-log the truncation before actually truncating, which means
 * trouble if the truncation fails. If we then crash, the WAL replay
 * likely isn't going to succeed in the truncation either, and cause a
 * PANIC. It's tempting to put a critical section here, but that cure
 * would be worse than the disease. It would turn a usually harmless
 * failure to truncate, that might spell trouble at WAL replay, into a
 * certain PANIC.
 */

As a second idea, I wonder if we can defer truncation until commit
time like smgrDoPendingDeletes mechanism. The sequence would be:

At RelationTruncate(),
1. WAL logging.
2. Remember buffers to be dropped.

At CommitTransaction(),
3. Revisit the remembered buffers to check if the buffer still has
table data that needs to be truncated.
4-a, If it has, we mark it as IO_IN_PROGRESS.
4-b, If it already has different table data, ignore it.
5, Truncate physical files.
6, Mark the buffer we marked at #4-a as invalid.

If an error occurs between #3 and #6 or in abort case, we revert all
IO_IN_PROGRESS flags on the buffers.

In the above idea, remembering all buffers having to-be-truncated
table at RelationTruncate(), we reduce the time for checking buffers
at the commit time. Since we acquire AccessExclusiveLock the number of
buffers having to-be-truncated table's data never increases. A
downside would be that since we can truncate multiple relations we
need to remember all buffers of each truncated relations, which is up
to (sizeof(int) * NBuffers) in total.

Regards,

--
Masahiko Sawadahttp://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services




Re: Corruption during WAL replay

2020-04-10 Thread Alvaro Herrera
On 2020-Mar-30, Andres Freund wrote:

> If we are really concerned with truncation failing - I don't know why we
> would be, we accept that we have to be able to modify files etc to stay
> up - we can add a pre-check ensuring that permissions are set up
> appropriately to allow us to truncate.

I remember I saw a case where the datadir was NFS or some other network
filesystem thingy, and it lost connection just before autovacuum
truncation, or something like that -- so there was no permission
failure, but the truncate failed and yet PG soldiered on.  I think the
connection was re-established soon thereafter and things went back to
"normal", with nobody realizing that a truncate had been lost.
Corruption was discovered a long time afterwards IIRC (weeks or months,
I don't remember).

I didn't review Teja's patch carefully, but the idea of panicking on
failure (causing WAL replay) seems better than the current behavior.
I'd rather put the server to wait until storage is really back.

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services




Re: Corruption during WAL replay

2020-04-10 Thread Teja Mupparti
Thanks Andres and Kyotaro for the quick review.  I have fixed the typos and 
also included the critical section (emulated it with try-catch block since 
palloc()s are causing issues in the truncate code). This time I used git 
format-patch.

Regards
Teja




From: Andres Freund 
Sent: Monday, March 30, 2020 4:31 PM
To: Kyotaro Horiguchi 
Cc: tejesw...@hotmail.com ; pgsql-hack...@postgresql.org 
; hexexp...@comcast.net 
Subject: Re: Corruption during WAL replay

Hi,

On 2020-03-24 18:18:12 +0900, Kyotaro Horiguchi wrote:
> At Mon, 23 Mar 2020 20:56:59 +, Teja Mupparti  
> wrote in
> > The original bug reporting-email and the relevant discussion is here
> ...
> > The crux of the fix is, in the current code, engine drops the buffer and 
> > then truncates the file, but a crash before the truncate and after the 
> > buffer-drop is causing the corruption. Patch reverses the order i.e. 
> > truncate the file and drop the buffer later.
>
> BufferAlloc doesn't wait for the BM_IO_IN_PROGRESS for a valid buffer.

I don't think that's true. For any of this to be relevant the buffer has
to be dirty. In which case BufferAlloc() has to call
FlushBuffer(). Which in turn does a WaitIO() if BM_IO_IN_PROGRESS is
set.

What path are you thinking of? Or alternatively, what am I missing?


> I'm not sure it's acceptable to remember all to-be-deleted buffers
> while truncation.

I don't see a real problem with it. Nor really a good alternative. Note
that for autovacuum truncations we'll only truncate a limited number of
buffers at once, and for most relation truncations we don't enter this
path (since we create a new relfilenode instead).


>
> +  /*START_CRIT_SECTION();*/

> Is this a point of argument?  It is not needed if we choose the
> strategy (c) in [1], since the protocol is aiming to allow server to
> continue running after truncation failure.
>
> [1]: 
> https://www.postgresql.org/message-id/20191207001232.klidxnm756wqxvwx%40alap3.anarazel.de

I think it's entirely broken to continue running after a truncation
failure. We obviously have to first WAL log the truncation (since
otherwise we can crash just after doing the truncation). But we cannot
just continue running after WAL logging, but not performing the
associated action: The most obvious reason is that otherwise a replica
will execute the trunction, but the primary will not.

The whole justification for that behaviour "It would turn a usually
harmless failure to truncate, that might spell trouble at WAL replay,
into a certain PANIC." was always dubious (since on-disk and in-memory
state now can diverge), but it's clearly wrong once replication had
entered the picture. There's just no alternative to a critical section
here.


If we are really concerned with truncation failing - I don't know why we
would be, we accept that we have to be able to modify files etc to stay
up - we can add a pre-check ensuring that permissions are set up
appropriately to allow us to truncate.

Greetings,

Andres Freund


0001-Wal-replay-corruption.patch
Description: 0001-Wal-replay-corruption.patch


Re: Corruption during WAL replay

2020-03-31 Thread Kyotaro Horiguchi
At Mon, 30 Mar 2020 16:31:59 -0700, Andres Freund  wrote in 
> Hi,
> 
> On 2020-03-24 18:18:12 +0900, Kyotaro Horiguchi wrote:
> > At Mon, 23 Mar 2020 20:56:59 +, Teja Mupparti  
> > wrote in 
> > > The original bug reporting-email and the relevant discussion is here
> > ...
> > > The crux of the fix is, in the current code, engine drops the buffer and 
> > > then truncates the file, but a crash before the truncate and after the 
> > > buffer-drop is causing the corruption. Patch reverses the order i.e. 
> > > truncate the file and drop the buffer later.
> > 
> > BufferAlloc doesn't wait for the BM_IO_IN_PROGRESS for a valid buffer.
> 
> I don't think that's true. For any of this to be relevant the buffer has
> to be dirty. In which case BufferAlloc() has to call
> FlushBuffer(). Which in turn does a WaitIO() if BM_IO_IN_PROGRESS is
> set.
> 
> What path are you thinking of? Or alternatively, what am I missing?

# I would be wrong with far low odds..

"doesn't" is overstated. Is there a case where the buffer is already
flushed by checkpoint? (If that is the case, dropping clean buffers at
marking truncate would work?)

> > I'm not sure it's acceptable to remember all to-be-deleted buffers
> > while truncation.
> 
> I don't see a real problem with it. Nor really a good alternative. Note
> that for autovacuum truncations we'll only truncate a limited number of
> buffers at once, and for most relation truncations we don't enter this
> path (since we create a new relfilenode instead).

Thank you for the opinion. I agree to that.

> > +/*START_CRIT_SECTION();*/
> 
> > Is this a point of argument?  It is not needed if we choose the
> > strategy (c) in [1], since the protocol is aiming to allow server to
> > continue running after truncation failure.
> > 
> > [1]: 
> > https://www.postgresql.org/message-id/20191207001232.klidxnm756wqxvwx%40alap3.anarazel.de
> 
> I think it's entirely broken to continue running after a truncation
> failure. We obviously have to first WAL log the truncation (since
> otherwise we can crash just after doing the truncation). But we cannot
> just continue running after WAL logging, but not performing the
> associated action: The most obvious reason is that otherwise a replica
> will execute the trunction, but the primary will not.

Hmm. If we allow PANIC on truncation failure why do we need to go on
the complicated steps?  Wouldn't it enough to enclose the sequence
(WAL insert - drop buffers - truncate) in a critical section? I
believed that this project aims to fix the db-breakage on truncation
failure by allowing rollback on truncation failure?

> The whole justification for that behaviour "It would turn a usually
> harmless failure to truncate, that might spell trouble at WAL replay,
> into a certain PANIC." was always dubious (since on-disk and in-memory
> state now can diverge), but it's clearly wrong once replication had
> entered the picture. There's just no alternative to a critical section
> here.

Yeah, I like that direction.

> If we are really concerned with truncation failing - I don't know why we
> would be, we accept that we have to be able to modify files etc to stay
> up - we can add a pre-check ensuring that permissions are set up
> appropriately to allow us to truncate.

I think the question above is the core part of the problem.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center




Re: Corruption during WAL replay

2020-03-30 Thread Andres Freund
Hi,

On 2020-03-24 18:18:12 +0900, Kyotaro Horiguchi wrote:
> At Mon, 23 Mar 2020 20:56:59 +, Teja Mupparti  
> wrote in 
> > The original bug reporting-email and the relevant discussion is here
> ...
> > The crux of the fix is, in the current code, engine drops the buffer and 
> > then truncates the file, but a crash before the truncate and after the 
> > buffer-drop is causing the corruption. Patch reverses the order i.e. 
> > truncate the file and drop the buffer later.
> 
> BufferAlloc doesn't wait for the BM_IO_IN_PROGRESS for a valid buffer.

I don't think that's true. For any of this to be relevant the buffer has
to be dirty. In which case BufferAlloc() has to call
FlushBuffer(). Which in turn does a WaitIO() if BM_IO_IN_PROGRESS is
set.

What path are you thinking of? Or alternatively, what am I missing?


> I'm not sure it's acceptable to remember all to-be-deleted buffers
> while truncation.

I don't see a real problem with it. Nor really a good alternative. Note
that for autovacuum truncations we'll only truncate a limited number of
buffers at once, and for most relation truncations we don't enter this
path (since we create a new relfilenode instead).


> 
> +  /*START_CRIT_SECTION();*/

> Is this a point of argument?  It is not needed if we choose the
> strategy (c) in [1], since the protocol is aiming to allow server to
> continue running after truncation failure.
> 
> [1]: 
> https://www.postgresql.org/message-id/20191207001232.klidxnm756wqxvwx%40alap3.anarazel.de

I think it's entirely broken to continue running after a truncation
failure. We obviously have to first WAL log the truncation (since
otherwise we can crash just after doing the truncation). But we cannot
just continue running after WAL logging, but not performing the
associated action: The most obvious reason is that otherwise a replica
will execute the trunction, but the primary will not.

The whole justification for that behaviour "It would turn a usually
harmless failure to truncate, that might spell trouble at WAL replay,
into a certain PANIC." was always dubious (since on-disk and in-memory
state now can diverge), but it's clearly wrong once replication had
entered the picture. There's just no alternative to a critical section
here.


If we are really concerned with truncation failing - I don't know why we
would be, we accept that we have to be able to modify files etc to stay
up - we can add a pre-check ensuring that permissions are set up
appropriately to allow us to truncate.

Greetings,

Andres Freund




Re: Corruption during WAL replay

2020-03-24 Thread Kyotaro Horiguchi
Thanks for working on this.

At Mon, 23 Mar 2020 20:56:59 +, Teja Mupparti  wrote 
in 
> This is my *first* attempt to submit a Postgres patch, please let me know if 
> I missed any process or format of the patch 

Welcome! The format looks fine to me. It would be better if it had a
commit message that explains what the patch does. (in the format that
git format-patch emits.)

> The original bug reporting-email and the relevant discussion is here
...
> The crux of the fix is, in the current code, engine drops the buffer and then 
> truncates the file, but a crash before the truncate and after the buffer-drop 
> is causing the corruption. Patch reverses the order i.e. truncate the file 
> and drop the buffer later.

BufferAlloc doesn't wait for the BM_IO_IN_PROGRESS for a valid buffer.

I'm not sure it's acceptable to remember all to-be-deleted buffers
while truncation.

+/*START_CRIT_SECTION();*/

Is this a point of argument?  It is not needed if we choose the
strategy (c) in [1], since the protocol is aiming to allow server to
continue running after truncation failure.

[1]: 
https://www.postgresql.org/message-id/20191207001232.klidxnm756wqxvwx%40alap3.anarazel.de

However, note that md truncates a "file" a non-atomic way. mdtruncate
truncates multiple files from the last segment toward the
beginning. If mdtruncate successfully truncated the first several
segments then failed, retaining all buffers triggers assertion failure
in mdwrite while buffer flush.


Some typos found:

+* a backround task might flush them to the disk right after we

s/backround/background/

+ * saved list of buffers that were marked as BM_IO_IN_PRGRESS just
s/BM_IO_IN_PRGRESS/BM_IO_IN_PROGRESS/

+ * as BM_IO_IN_PROGRES. Though the buffers are marked for IO, they
s/BM_IO_IN_PROGRES/BM_IO_IN_PROGRESS/

+ * being dicarded).
s/dicarded/discarded/

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center




Corruption during WAL replay

2020-03-23 Thread Teja Mupparti
This is my *first* attempt to submit a Postgres patch, please let me know if I 
missed any process or format of the patch (I used this link 
https://wiki.postgresql.org/wiki/Working_with_Git
 As reference)



The original bug reporting-email and the relevant discussion is here



https://www.postgresql.org/message-id/20191207001232.klidxnm756wqxvwx%40alap3.anarazel.de

https://www.postgresql.org/message-id/822113470.250068.1573246011818%40connect.xfinity.com

https://www.postgresql.org/message-id/20191206230640.2dvdjpcgn46q3ks2%40alap3.anarazel.de

https://www.postgresql.org/message-id/1880.1281020...@sss.pgh.pa.us



The crux of the fix is, in the current code, engine drops the buffer and then 
truncates the file, but a crash before the truncate and after the buffer-drop 
is causing the corruption. Patch reverses the order i.e. truncate the file and 
drop the buffer later.



Warm regards,

Teja



bug-fix-patch
Description: bug-fix-patch