Re: [HACKERS] corrupt pages detected by enabling checksums

2013-05-13 Thread k...@rice.edu
On Sun, May 12, 2013 at 03:46:00PM -0500, Jim Nasby wrote:
 On 5/10/13 1:06 PM, Jeff Janes wrote:
 Of course the paranoid DBA could turn off restart_after_crash and do a 
 manual investigation on every crash, but in that case the database would 
 refuse to restart even in the case where it perfectly clear that all the 
 following WAL belongs to the recycled file and not the current file.
 
 Perhaps we should also allow for zeroing out WAL files before reuse (or just 
 disable reuse). I know there's a performance hit there, but the reuse idea 
 happened before we had bgWriter. Theoretically the overhead creating a new 
 file would always fall to bgWriter and therefore not be a big deal.
 -- 
 Jim C. Nasby, Data Architect   j...@nasby.net
 512.569.9461 (cell) http://jim.nasby.net
 

Unless something has changed dramtically, creating new files is a LOT more
overhead than reusing existing files. My two cents.

Regards,
Ken


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] corrupt pages detected by enabling checksums

2013-05-13 Thread k...@rice.edu
On Sun, May 12, 2013 at 07:41:26PM -0500, Jon Nelson wrote:
 On Sun, May 12, 2013 at 3:46 PM, Jim Nasby j...@nasby.net wrote:
  On 5/10/13 1:06 PM, Jeff Janes wrote:
 
  Of course the paranoid DBA could turn off restart_after_crash and do a
  manual investigation on every crash, but in that case the database would
  refuse to restart even in the case where it perfectly clear that all the
  following WAL belongs to the recycled file and not the current file.
 
 
  Perhaps we should also allow for zeroing out WAL files before reuse (or just
  disable reuse). I know there's a performance hit there, but the reuse idea
  happened before we had bgWriter. Theoretically the overhead creating a new
  file would always fall to bgWriter and therefore not be a big deal.
 
 For filesystems like btrfs, re-using a WAL file is suboptimal to
 simply creating a new one and removing the old one when it's no longer
 required. Using fallocate (or posix_fallocate) (I have a patch for
 that!) to create a new one is - by my tests - 28 times faster than the
 currently-used method.
 
 
 --
 Jon
 

What about for less cutting (bleeding) edge filesystems?

Ken


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] corrupt pages detected by enabling checksums

2013-05-13 Thread Jon Nelson
On Mon, May 13, 2013 at 7:49 AM, k...@rice.edu k...@rice.edu wrote:
 On Sun, May 12, 2013 at 07:41:26PM -0500, Jon Nelson wrote:
 On Sun, May 12, 2013 at 3:46 PM, Jim Nasby j...@nasby.net wrote:
  On 5/10/13 1:06 PM, Jeff Janes wrote:
 
  Of course the paranoid DBA could turn off restart_after_crash and do a
  manual investigation on every crash, but in that case the database would
  refuse to restart even in the case where it perfectly clear that all the
  following WAL belongs to the recycled file and not the current file.
 
 
  Perhaps we should also allow for zeroing out WAL files before reuse (or 
  just
  disable reuse). I know there's a performance hit there, but the reuse idea
  happened before we had bgWriter. Theoretically the overhead creating a new
  file would always fall to bgWriter and therefore not be a big deal.

 For filesystems like btrfs, re-using a WAL file is suboptimal to
 simply creating a new one and removing the old one when it's no longer
 required. Using fallocate (or posix_fallocate) (I have a patch for
 that!) to create a new one is - by my tests - 28 times faster than the
 currently-used method.


 What about for less cutting (bleeding) edge filesystems?

The patch would be applicable for any filesystem which implements the
fallocate/posix_fallocate calls in an efficient fashion. xfs and ext4
would both work, if I recall properly. I'm certain there are others,
and the technique would probably work on other operating systems like
the *BSDs, etc.. Additionally, it's improbable that there would be a
performance hit for other filesystems versus simply writing zeroes,
since that's the approach that is taken anyway as a fallback.  Another
win is reduction in fragmentation. I would love to be able to turn off
WAL recycling to perform more useful testing.

--
Jon


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] corrupt pages detected by enabling checksums

2013-05-13 Thread Andres Freund
On 2013-05-12 19:41:26 -0500, Jon Nelson wrote:
 On Sun, May 12, 2013 at 3:46 PM, Jim Nasby j...@nasby.net wrote:
  On 5/10/13 1:06 PM, Jeff Janes wrote:
 
  Of course the paranoid DBA could turn off restart_after_crash and do a
  manual investigation on every crash, but in that case the database would
  refuse to restart even in the case where it perfectly clear that all the
  following WAL belongs to the recycled file and not the current file.
 
 
  Perhaps we should also allow for zeroing out WAL files before reuse (or just
  disable reuse). I know there's a performance hit there, but the reuse idea
  happened before we had bgWriter. Theoretically the overhead creating a new
  file would always fall to bgWriter and therefore not be a big deal.
 
 For filesystems like btrfs, re-using a WAL file is suboptimal to
 simply creating a new one and removing the old one when it's no longer
 required. Using fallocate (or posix_fallocate) (I have a patch for
 that!) to create a new one is - by my tests - 28 times faster than the
 currently-used method.

I don't think the comparison between just fallocate()ing and what we
currently do is fair. fallocate() doesn't guarantee that the file is the
same size after a crash, so you would still need an fsync() or we
couldn't use fdatasync() anymore. And I'd guess the benefits aren't all
that big anymore in that case?
That said, using posix_fallocate seems like a good idea in lots of
places inside pg, its just not all that easy to do in some of the
places.

Greetings,

Andres Freund

-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] corrupt pages detected by enabling checksums

2013-05-13 Thread Jon Nelson
On Mon, May 13, 2013 at 8:32 AM, Andres Freund and...@2ndquadrant.com wrote:
 On 2013-05-12 19:41:26 -0500, Jon Nelson wrote:
 On Sun, May 12, 2013 at 3:46 PM, Jim Nasby j...@nasby.net wrote:
  On 5/10/13 1:06 PM, Jeff Janes wrote:
 
  Of course the paranoid DBA could turn off restart_after_crash and do a
  manual investigation on every crash, but in that case the database would
  refuse to restart even in the case where it perfectly clear that all the
  following WAL belongs to the recycled file and not the current file.
 
 
  Perhaps we should also allow for zeroing out WAL files before reuse (or 
  just
  disable reuse). I know there's a performance hit there, but the reuse idea
  happened before we had bgWriter. Theoretically the overhead creating a new
  file would always fall to bgWriter and therefore not be a big deal.

 For filesystems like btrfs, re-using a WAL file is suboptimal to
 simply creating a new one and removing the old one when it's no longer
 required. Using fallocate (or posix_fallocate) (I have a patch for
 that!) to create a new one is - by my tests - 28 times faster than the
 currently-used method.

 I don't think the comparison between just fallocate()ing and what we
 currently do is fair. fallocate() doesn't guarantee that the file is the
 same size after a crash, so you would still need an fsync() or we
 couldn't use fdatasync() anymore. And I'd guess the benefits aren't all
 that big anymore in that case?

fallocate (16MB) + fsync is still almost certainly faster than
write+write+write... + fsync.
The test I performed at the time did exactly that .. posix_fallocate + pg_fsync.

 That said, using posix_fallocate seems like a good idea in lots of
 places inside pg, its just not all that easy to do in some of the
 places.

I should not derail this thread any further. Perhaps, if interested
parties would like to discuss the use of fallocate/posix_fallocate, a
new thread might be more appropriate?


--
Jon


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] corrupt pages detected by enabling checksums

2013-05-13 Thread Andres Freund
On 2013-05-13 08:45:41 -0500, Jon Nelson wrote:
 On Mon, May 13, 2013 at 8:32 AM, Andres Freund and...@2ndquadrant.com wrote:
  On 2013-05-12 19:41:26 -0500, Jon Nelson wrote:
  On Sun, May 12, 2013 at 3:46 PM, Jim Nasby j...@nasby.net wrote:
   On 5/10/13 1:06 PM, Jeff Janes wrote:
  
   Of course the paranoid DBA could turn off restart_after_crash and do a
   manual investigation on every crash, but in that case the database would
   refuse to restart even in the case where it perfectly clear that all the
   following WAL belongs to the recycled file and not the current file.
  
  
   Perhaps we should also allow for zeroing out WAL files before reuse (or 
   just
   disable reuse). I know there's a performance hit there, but the reuse 
   idea
   happened before we had bgWriter. Theoretically the overhead creating a 
   new
   file would always fall to bgWriter and therefore not be a big deal.
 
  For filesystems like btrfs, re-using a WAL file is suboptimal to
  simply creating a new one and removing the old one when it's no longer
  required. Using fallocate (or posix_fallocate) (I have a patch for
  that!) to create a new one is - by my tests - 28 times faster than the
  currently-used method.
 
  I don't think the comparison between just fallocate()ing and what we
  currently do is fair. fallocate() doesn't guarantee that the file is the
  same size after a crash, so you would still need an fsync() or we
  couldn't use fdatasync() anymore. And I'd guess the benefits aren't all
  that big anymore in that case?
 
 fallocate (16MB) + fsync is still almost certainly faster than
 write+write+write... + fsync.
 The test I performed at the time did exactly that .. posix_fallocate + 
 pg_fsync.
Sure, the initial file creation will be faster. But are the actual
individual wal writes (small, frequently fdatasync()ed) still faster?
That's the critical path currently.
Whether it is pretty much depends on how the filesystem manages
allocated but not initialized blocks...

Greetings,

Andres Freund

-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] corrupt pages detected by enabling checksums

2013-05-13 Thread Greg Stark
On Mon, May 13, 2013 at 2:49 PM, Andres Freund and...@2ndquadrant.com wrote:
 Sure, the initial file creation will be faster. But are the actual
 individual wal writes (small, frequently fdatasync()ed) still faster?
 That's the critical path currently.
 Whether it is pretty much depends on how the filesystem manages
 allocated but not initialized blocks...

In ext4 aIui it doesn't actually pick target blocks. It just adjusts
the accounting so it knows that many blocks will be needed for this
file and guarantees they'll be available. If you read from them it
knows to provide 0s. So in theory the performance in the critical path
would be worse but I think by an insignificant amount.

The reason Postgres pre-allocates the blocks is not for the
performance optimization. It's for safety. To guarantee -- as best as
possible -- that it won't get a write error when the time comes to
write to it. Especially to guarantee that the disk won't suddenly turn
out to be full.

It seems possible that some file systems would not protect you against
media errors nearly as well using it. It might take time to respond to
a media error and in a poorly written filesystem it might even be
reported to the application even though there's no need. But media
errors can occur any time, even after the initial write so I don't
think this should be a blocker.  I think posix_fallocate is good
enough for us and I would support using it.

-- 
greg


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] corrupt pages detected by enabling checksums

2013-05-13 Thread Andres Freund
On 2013-05-13 16:03:11 +0100, Greg Stark wrote:
 On Mon, May 13, 2013 at 2:49 PM, Andres Freund and...@2ndquadrant.com wrote:
  Sure, the initial file creation will be faster. But are the actual
  individual wal writes (small, frequently fdatasync()ed) still faster?
  That's the critical path currently.
  Whether it is pretty much depends on how the filesystem manages
  allocated but not initialized blocks...
 
 In ext4 aIui it doesn't actually pick target blocks. It just adjusts
 the accounting so it knows that many blocks will be needed for this
 file and guarantees they'll be available. If you read from them it
 knows to provide 0s. So in theory the performance in the critical path
 would be worse but I think by an insignificant amount.
 
 The reason Postgres pre-allocates the blocks is not for the
 performance optimization. It's for safety. To guarantee -- as best as
 possible -- that it won't get a write error when the time comes to
 write to it. Especially to guarantee that the disk won't suddenly turn
 out to be full.

posix_fallocate() guarantees that. And if you fsync() the file
afterwards its even supposed to still have enough space after the crash.

After a successful call to posix_fallocate(), subsequent writes to
bytes in the specified range are guaranteed not to fail because of lack
of disk space.


 It seems possible that some file systems would not protect you against
 media errors nearly as well using it.

True. The same probably is true for modifications of existing files for
those fancy COW filesystems...

 I think posix_fallocate is good enough for us and I would support
 using it.

Me too, although this isn't the place where I'd be really excited to see
a patch implementing it properly ;)

Greetings,

Andres Freund

-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] corrupt pages detected by enabling checksums

2013-05-13 Thread Simon Riggs
On 13 May 2013 14:45, Jon Nelson jnelson+pg...@jamponi.net wrote:

 I should not derail this thread any further. Perhaps, if interested
 parties would like to discuss the use of fallocate/posix_fallocate, a
 new thread might be more appropriate?

Sounds like a good idea. Always nice to see a fresh take on earlier ideas.

--
 Simon Riggs   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] corrupt pages detected by enabling checksums

2013-05-12 Thread Jim Nasby

On 5/9/13 5:18 PM, Jeff Davis wrote:

On Thu, 2013-05-09 at 14:28 -0500, Jim Nasby wrote:

What about moving some critical data from the beginning of the WAL
record to the end? That would make it easier to detect that we don't
have a complete record. It wouldn't necessarily replace the CRC
though, so maybe that's not good enough.

Actually, what if we actually *duplicated* some of the same WAL header
info at the end of the record? Given a reasonable amount of data that
would damn-near ensure that a torn record was detected, because the
odds of having the exact same sequence of random bytes would be so
low. Potentially even just duplicating the LSN would suffice.


I think both of these ideas have some false positives and false
negatives.

If the corruption happens at the record boundary, and wipes out the
special information at the end of the record, then you might think it
was not fully flushed, and we're in the same position as today.

If the WAL record is large, and somehow the beginning and the end get
written to disk but not the middle, then it will look like corruption;
but really the WAL was just not completely flushed. This seems pretty
unlikely, but not impossible.

That being said, I like the idea of introducing some extra checks if a
perfect solution is not possible.


Yeah, I don't think a perfect solution is possible, short of attempting to tie 
directly into the filesystem (ie: on a journaling FS have some way to 
essentially treat the FS journal as WAL).

One additional step we might be able to take would be to scan forward looking 
for a record that would tell us when an fsync must have occurred (heck, maybe 
we should add an fsync WAL record...). If we find a corrupt WAL record followed 
by an fsync we know that we've now lost data. That closes some of the holes. 
Actually, that might handle all the holes...


On the separate write idea, if that could be controlled by a GUC I
think it'd be worth doing. Anyone that needs to worry about this
corner case probably has hardware that would support that.


It sounds pretty easy to do that naively. I'm just worried that the
performance will be so bad for so many users that it's not a very
reasonable choice.

Today, it would probably make more sense to just use sync rep. If the
master's WAL is corrupt, and it starts up too early, then that should be
obvious when you try to reconnect streaming replication. I haven't tried
it, but I'm assuming that it gives a useful error message.


I wonder if there are DW environments that are too large to keep a SR copy but 
would be able to afford the double-write overhead.

BTW, isn't performance what killed the double-buffer idea?
--
Jim C. Nasby, Data Architect   j...@nasby.net
512.569.9461 (cell) http://jim.nasby.net


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] corrupt pages detected by enabling checksums

2013-05-12 Thread Jim Nasby

On 5/10/13 1:06 PM, Jeff Janes wrote:

Of course the paranoid DBA could turn off restart_after_crash and do a manual 
investigation on every crash, but in that case the database would refuse to 
restart even in the case where it perfectly clear that all the following WAL 
belongs to the recycled file and not the current file.


Perhaps we should also allow for zeroing out WAL files before reuse (or just 
disable reuse). I know there's a performance hit there, but the reuse idea 
happened before we had bgWriter. Theoretically the overhead creating a new file 
would always fall to bgWriter and therefore not be a big deal.
--
Jim C. Nasby, Data Architect   j...@nasby.net
512.569.9461 (cell) http://jim.nasby.net


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] corrupt pages detected by enabling checksums

2013-05-12 Thread Jon Nelson
On Sun, May 12, 2013 at 3:46 PM, Jim Nasby j...@nasby.net wrote:
 On 5/10/13 1:06 PM, Jeff Janes wrote:

 Of course the paranoid DBA could turn off restart_after_crash and do a
 manual investigation on every crash, but in that case the database would
 refuse to restart even in the case where it perfectly clear that all the
 following WAL belongs to the recycled file and not the current file.


 Perhaps we should also allow for zeroing out WAL files before reuse (or just
 disable reuse). I know there's a performance hit there, but the reuse idea
 happened before we had bgWriter. Theoretically the overhead creating a new
 file would always fall to bgWriter and therefore not be a big deal.

For filesystems like btrfs, re-using a WAL file is suboptimal to
simply creating a new one and removing the old one when it's no longer
required. Using fallocate (or posix_fallocate) (I have a patch for
that!) to create a new one is - by my tests - 28 times faster than the
currently-used method.


--
Jon


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] corrupt pages detected by enabling checksums

2013-05-11 Thread Simon Riggs
On 10 May 2013 23:41, Jeff Davis pg...@j-davis.com wrote:
 On Fri, 2013-05-10 at 18:32 +0100, Simon Riggs wrote:
 We don't write() WAL except with an immediate sync(), so the chances
 of what you say happening are very low to impossible.

 Are you sure? An XLogwrtRqst contains a write and a flush pointer, so I
 assume they can be different.

I'm answering this just to complete the discussion.

Yes, the write and flush pointer can be different. The write/flush are
two actions; we do one first, then the other, very quickly, inside
XLogWrite().

But we cannot rely on that, since there are some times when we don't
do that, such as wal_buffer overflow and when background walwriter
writes are at the end of the ring buffer. Not common, but they do
exist and when they exist they write many blocks. Which is counter to
what I had said earlier.

This part of the proposal isn't necessary for us to get a good answer
95% of the time, so it is dropped.

As mentioned on other post, we can write
UpdateMinRecoveryPoint(LogwrtResult.Flush, false) every time we do
XLogBackgroundFlush() and some other rework to make that happen
correctly. I'll post a separate patch.

--
 Simon Riggs   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] corrupt pages detected by enabling checksums

2013-05-10 Thread Simon Riggs
On 9 May 2013 23:13, Greg Stark st...@mit.edu wrote:
 On Thu, May 9, 2013 at 10:45 PM, Simon Riggs si...@2ndquadrant.com wrote:
 On 9 May 2013 22:39, Tom Lane t...@sss.pgh.pa.us wrote:
 Simon Riggs si...@2ndquadrant.com writes:
 If the current WAL record is corrupt and the next WAL record is in
 every way valid, we can potentially continue.

 That seems like a seriously bad idea.

 I agree. But if you knew that were true, is stopping a better idea?

 Having one corrupt record followed by a valid record is not an
 abnormal situation. It could easily be the correct end of WAL.

I disagree, that *is* an abnormal situation and would not be the
correct end-of-WAL.

Each WAL record contains a prev pointer to the last WAL record. So
for the next record to be valid the prev pointer would need to be
exactly correct.


 However it is possible to reduce the window. Every time the
 transaction log is synced a different file can be updated with the a
 known minimum transaction log recovery point. Even if it's not synced
 consistently on every transaction commit or wal sync it would serve as
 a low water mark. Recovering to that point is not sufficient but is
 necessary for a consistent recovery. That file could be synced lazily,
 say, every 10s or something like that and would guarantee that any wal
 corruption would be caught except for the last 10s of wal traffic for
 example.

I think it would be easy enough to have the WALwriter update the
minRecoveryPoint once per cycle, after it has flushed WAL.

Given the importance of pg_control and its small size, it seems like
it would be a good idea to take a backup copy of it every checkpoint
to make sure we have that data safe. And have pg_resetxlog keep a copy
also every time that is run.

--
 Simon Riggs   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] corrupt pages detected by enabling checksums

2013-05-10 Thread Greg Stark
On Fri, May 10, 2013 at 7:44 AM, Simon Riggs si...@2ndquadrant.com wrote:
 Having one corrupt record followed by a valid record is not an
 abnormal situation. It could easily be the correct end of WAL.

 I disagree, that *is* an abnormal situation and would not be the
 correct end-of-WAL.

 Each WAL record contains a prev pointer to the last WAL record. So
 for the next record to be valid the prev pointer would need to be
 exactly correct.

Well then you're wrong. The OS makes no guarantee that blocks are
written out in order. When the system crashes any random subset of the
blocks written but not synced might have been written to disk and
others not. There could be megabytes of correct WAL written with just
one block in the middle of the first record not written. If no xlog
sync had occurred (or one was in progress but not completed) then
that's the correct end of WAL.


-- 
greg


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] corrupt pages detected by enabling checksums

2013-05-10 Thread Amit Kapila
On Friday, May 10, 2013 6:09 PM Greg Stark wrote:
 On Fri, May 10, 2013 at 7:44 AM, Simon Riggs si...@2ndquadrant.com
 wrote:
  Having one corrupt record followed by a valid record is not an
  abnormal situation. It could easily be the correct end of WAL.
 
  I disagree, that *is* an abnormal situation and would not be the
  correct end-of-WAL.
 
  Each WAL record contains a prev pointer to the last WAL record. So
  for the next record to be valid the prev pointer would need to be
  exactly correct.
 
 Well then you're wrong. The OS makes no guarantee that blocks are
 written out in order. When the system crashes any random subset of the
 blocks written but not synced might have been written to disk and
 others not. There could be megabytes of correct WAL written with just
 one block in the middle of the first record not written. If no xlog
 sync had occurred (or one was in progress but not completed) then
 that's the correct end of WAL.

In the case where one block is missing, how can it even reach to next record
to check prev pointer.
I think it can be possible when one of the record is corrupt and following
are okay which I think is the 
case in which it can proceed with warning as suggested by Simon.

With Regards,
Amit Kapila.



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] corrupt pages detected by enabling checksums

2013-05-10 Thread Greg Stark
On Fri, May 10, 2013 at 5:31 PM, Amit Kapila amit.kap...@huawei.com wrote:
 In the case where one block is missing, how can it even reach to next record
 to check prev pointer.
 I think it can be possible when one of the record is corrupt and following
 are okay which I think is the
 case in which it can proceed with warning as suggested by Simon.

A single WAL record can be over 24kB. The checksum covers the entire
WAL record and if it reports corruption it can be because a chunk in
the middle wasn't flushed to disk before the system crashed. The
beginning of the WAL record with the length and checksum and the
entire following record with its prev pointer might have been flushed
but the missing block in the middle of this record means it can't be
replayed. This would be a normal situation in case of a system crash.

If you replayed the following record but not this record you would
have an inconsistent database. The following record could be an insert
for a child table with a foreign key reference to a tuple that was
inserted by the skipped record for example. Resulting in a database
that is logically inconsistent.

Or it could be an index insert for that tuple would would result in a
physically inconsistent database with index pointers that point to
incorrect tuples. Index scans would return tuples that didn't match
the index  or would miss tuples that should be returned.

-- 
greg


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] corrupt pages detected by enabling checksums

2013-05-10 Thread Tom Lane
Greg Stark st...@mit.edu writes:
 A single WAL record can be over 24kB.

pedantic
Actually, WAL records can run to megabytes.  Consider for example a
commit record for a transaction that dropped thousands of tables ---
there'll be info about each such table in the commit record, to cue
replay to remove those files.
/pedantic

 If you replayed the following record but not this record you would
 have an inconsistent database. ...
 Or it could be an index insert for that tuple would would result in a
 physically inconsistent database with index pointers that point to
 incorrect tuples. Index scans would return tuples that didn't match
 the index  or would miss tuples that should be returned.

Skipping actions such as index page splits would lead to even more fun.

Even in simple cases such as successive inserts and deletions in the
same heap page, failing to replay some of the actions is going to be
disastrous.  The *best case* scenario for that is that WAL replay
PANICs when it notices that the action it's trying to replay is
inconsistent with the current state of the page, eg it's trying to
insert at a TID that already exists.

IMO we can't proceed past a broken WAL record.  The actually useful
suggestion upthread was that we try to notice whether there seem
to be valid WAL records past the broken one, so that we could warn
the DBA that some commits might have been lost.  I don't think we
can do much in the way of automatic data recovery, but we could give
the DBA a chance to do forensics rather than blindly starting up (and
promptly overwriting all the evidence).

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] corrupt pages detected by enabling checksums

2013-05-10 Thread Simon Riggs
On 10 May 2013 13:39, Greg Stark st...@mit.edu wrote:
 On Fri, May 10, 2013 at 7:44 AM, Simon Riggs si...@2ndquadrant.com wrote:
 Having one corrupt record followed by a valid record is not an
 abnormal situation. It could easily be the correct end of WAL.

 I disagree, that *is* an abnormal situation and would not be the
 correct end-of-WAL.

 Each WAL record contains a prev pointer to the last WAL record. So
 for the next record to be valid the prev pointer would need to be
 exactly correct.

 Well then you're wrong. The OS makes no guarantee that blocks are
 written out in order. When the system crashes any random subset of the
 blocks written but not synced might have been written to disk and
 others not. There could be megabytes of correct WAL written with just
 one block in the middle of the first record not written. If no xlog
 sync had occurred (or one was in progress but not completed) then
 that's the correct end of WAL.

I agree that the correct end of WAL is where the last sync occurred.
We don't write() WAL except with an immediate sync(), so the chances
of what you say happening are very low to impossible. The timing
window between the write and the sync is negligible and yet I/O would
need to occur in that window and also be out of order from the order
of the write, which is unlikely because an I/O elevator would either
not touch the order of writes at all, or would want to maintain
sequential order to avoid head movement, which is what we want. I
guess we should add here ...with disks, maybe not with SSDs.

In any case, what is more important is that your idea to make an
occasional write of the minRecoveryPoint and then use that to cross
check against current LSN would allow us to at least confirm that we
have a single corrupt record and report that situation accurately to
the user. That idea will cover 95+% of such problems anyway, since
what we care about is long sequences of WAL records, not just the last
few written at crash, which the above discussion focuses on.

--
 Simon Riggs   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] corrupt pages detected by enabling checksums

2013-05-10 Thread Simon Riggs
On 10 May 2013 18:23, Tom Lane t...@sss.pgh.pa.us wrote:
 Greg Stark st...@mit.edu writes:
 A single WAL record can be over 24kB.

 pedantic
 Actually, WAL records can run to megabytes.  Consider for example a
 commit record for a transaction that dropped thousands of tables ---
 there'll be info about each such table in the commit record, to cue
 replay to remove those files.
 /pedantic

 If you replayed the following record but not this record you would
 have an inconsistent database. ...
 Or it could be an index insert for that tuple would would result in a
 physically inconsistent database with index pointers that point to
 incorrect tuples. Index scans would return tuples that didn't match
 the index  or would miss tuples that should be returned.

 Skipping actions such as index page splits would lead to even more fun.

 Even in simple cases such as successive inserts and deletions in the
 same heap page, failing to replay some of the actions is going to be
 disastrous.  The *best case* scenario for that is that WAL replay
 PANICs when it notices that the action it's trying to replay is
 inconsistent with the current state of the page, eg it's trying to
 insert at a TID that already exists.

Agreed

 IMO we can't proceed past a broken WAL record.  The actually useful
 suggestion upthread was that we try to notice whether there seem
 to be valid WAL records past the broken one, so that we could warn
 the DBA that some commits might have been lost.  I don't think we
 can do much in the way of automatic data recovery, but we could give
 the DBA a chance to do forensics rather than blindly starting up (and
 promptly overwriting all the evidence).

The usual answer is switchover to the standby, hoping that the WAL was
not corrupted before it got sent there also.

If all servers go down people will want a carry on regardless option
as well, since the fault can be investigated on a standby. Even with
all of the above caveats,  lying on our backs with our feet in the air
'cos we lost a few blocks will not impress anyone. Doing that will
likely be a medium-high risk thing, but that will still be better than
the certainty of a down server. It would need to be a manually iniated
option.

--
 Simon Riggs   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] corrupt pages detected by enabling checksums

2013-05-10 Thread Jeff Janes
On Fri, May 10, 2013 at 9:54 AM, Greg Stark st...@mit.edu wrote:

 On Fri, May 10, 2013 at 5:31 PM, Amit Kapila amit.kap...@huawei.com
 wrote:
  In the case where one block is missing, how can it even reach to next
 record
  to check prev pointer.
  I think it can be possible when one of the record is corrupt and
 following
  are okay which I think is the
  case in which it can proceed with warning as suggested by Simon.

 A single WAL record can be over 24kB. The checksum covers the entire
 WAL record and if it reports corruption it can be because a chunk in
 the middle wasn't flushed to disk before the system crashed. The
 beginning of the WAL record with the length and checksum and the
 entire following record with its prev pointer might have been flushed
 but the missing block in the middle of this record means it can't be
 replayed. This would be a normal situation in case of a system crash.

 If you replayed the following record but not this record you would
 have an inconsistent database.


I don't think we would ever want to *skip* the record and play the next
one.  But if it looks like the next record is valid, we might not want to
automatically open the database in a possibly inconsistent state and in the
process overwrite the only existing copy of those WAL records which would
be necessary to make it consistent.  Instead, could we present the DBA with
an explicit choice to either open the database, or try to reconstruct the
corrupted record via forensic inspection so that it can be played through
(I have no idea how likely it is that such an attempt would succeed), or to
copy the database for later inspection and then open it.

But based on your description, perhaps refusing to automatically restart
and forcing an explicit decision would happen a lot more often, during
normal crashes with no corruption, than I was thinking it would.

Of course the paranoid DBA could turn off restart_after_crash and do a
manual investigation on every crash, but in that case the database would
refuse to restart even in the case where it perfectly clear that all the
following WAL belongs to the recycled file and not the current file.  They
would also have to turn off any startup scripts in init.d, to make sure a
rebooting server doesn't do recovery automatically and destroy evidence
that way.


Cheers,

Jeff


Re: [HACKERS] corrupt pages detected by enabling checksums

2013-05-10 Thread Robert Haas
On Fri, May 10, 2013 at 2:06 PM, Jeff Janes jeff.ja...@gmail.com wrote:
 But based on your description, perhaps refusing to automatically restart and
 forcing an explicit decision would happen a lot more often, during normal
 crashes with no corruption, than I was thinking it would.

I bet it would.  But I think Greg Stark's idea of periodically
flushing the current minimum recovery point to disk has a lot to
recommend it.  That would be massively more reliable than what we have
now.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] corrupt pages detected by enabling checksums

2013-05-10 Thread Jeff Davis
On Fri, 2013-05-10 at 18:32 +0100, Simon Riggs wrote:
 We don't write() WAL except with an immediate sync(), so the chances
 of what you say happening are very low to impossible.

Are you sure? An XLogwrtRqst contains a write and a flush pointer, so I
assume they can be different.

I agree that it sounds unlikely that blocks 100 and 102 would be
written, but not 101. But perhaps that's more likely in systems like ZFS
where the physical blocks might be in very different places.

Regards,
Jeff Davis




-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] corrupt pages detected by enabling checksums

2013-05-10 Thread Greg Smith

On 5/10/13 1:32 PM, Simon Riggs wrote:

The timing
window between the write and the sync is negligible and yet I/O would
need to occur in that window and also be out of order from the order
of the write, which is unlikely because an I/O elevator would either
not touch the order of writes at all, or would want to maintain
sequential order to avoid head movement, which is what we want. I
guess we should add here ...with disks, maybe not with SSDs.


It's not really safe to make any assumptions about I/O elevators. 
Reordering gets done from the perspective of the last item written. 
When the previous write was at the logical end of the disk, it can just 
as easily re-order a queue of writes in the complete reverse order they 
were issued in.


The only way you can ever get a useful guarantee is when an fsync 
returns completion.  Writes can effectively go out in a completely 
random order until that point.  All you can rely on is throwing up a 
stop sign that says tell me when all of them are done.  In between 
those, you have no idea of the ordering.


--
Greg Smith   2ndQuadrant USg...@2ndquadrant.com   Baltimore, MD
PostgreSQL Training, Services, and 24x7 Support www.2ndQuadrant.com


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] corrupt pages detected by enabling checksums

2013-05-10 Thread Amit kapila
On Friday, May 10, 2013 10:24 PM Greg Stark wrote:
On Fri, May 10, 2013 at 5:31 PM, Amit Kapila amit.kap...@huawei.com wrote:
 In the case where one block is missing, how can it even reach to next record
 to check prev pointer.
 I think it can be possible when one of the record is corrupt and following
 are okay which I think is the
 case in which it can proceed with warning as suggested by Simon.

A single WAL record can be over 24kB. The checksum covers the entire
WAL record and if it reports corruption it can be because a chunk in
the middle wasn't flushed to disk before the system crashed. The
beginning of the WAL record with the length and checksum and the
entire following record with its prev pointer might have been flushed
but the missing block in the middle of this record means it can't be
replayed. This would be a normal situation in case of a system crash.

The only point I wanted to say was it can be only one such record, 
length can be large or small.


If you replayed the following record but not this record you would
have an inconsistent database. The following record could be an insert
for a child table with a foreign key reference to a tuple that was
inserted by the skipped record for example. Resulting in a database
that is logically inconsistent.

The corrupt record can be such that it can lead to inconsistency in database or
it could be a commit record of transaction which has performed only select 
operation.
It will be difficult or not possible to find such information during recovery, 
but informing DBA/user at such occasion can be useful and with an optional way 
for him to
continue (although I am not sure how in such a case DBA can decide, may be need 
some other information as well). 
The reason why it can be useful to allow DBA/user intervention in such cases is 
that, in case
of ignoring data after one corrupt record, it can still take time for DBA/user 
to find
which of the operations he needs to perform again.

With Regards,
Amit Kapila. 

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] corrupt pages detected by enabling checksums

2013-05-09 Thread Jim Nasby

On 5/8/13 7:34 PM, Jeff Davis wrote:

On Wed, 2013-05-08 at 17:56 -0500, Jim Nasby wrote:

Apologies if this is a stupid question, but is this mostly an issue
due to torn pages? IOW, if we had a way to ensure we never see torn
pages, would that mean an invalid CRC on a WAL page indicated there
really was corruption on that page?

Maybe it's worth putting (yet more) thought into the torn page
issue... :/


Sort of. For data, a page is the logically-atomic unit that is expected
to be intact. For WAL, a record is the logically-atomic unit that is
expected to be intact.

So it might be better to say that the issue for the WAL is torn
records. A record might be larger than a page (it can hold up to three
full-page images in one record), but is often much smaller.

We use a CRC to validate that the WAL record is fully intact. The
concern is that, if it fails the CRC check, we *assume* that it's
because it wasn't completely flushed yet (i.e. a torn record). Based
on that assumption, neither that record nor any later record contains
committed transactions, so we can safely consider that the end of the
WAL (as of the crash) and bring the system up.

The problem is that the assumption is not always true: a CRC failure
could also indicate real corruption of WAL records that have been
previously flushed successfully, and may contain committed transactions.
That can mean we bring the system up way too early, corrupting the
database.

Unfortunately, it seems that doing any kind of validation to determine
that we have a valid end-of-the-WAL inherently requires some kind of
separate durable write somewhere. It would be a tiny amount of data (an
LSN and maybe some extra crosscheck information), so I could imagine
that would be just fine given the right hardware; but if we just write
to disk that would be pretty bad. Ideas welcome.


What about moving some critical data from the beginning of the WAL record to 
the end? That would make it easier to detect that we don't have a complete 
record. It wouldn't necessarily replace the CRC though, so maybe that's not 
good enough.

Actually, what if we actually *duplicated* some of the same WAL header info at 
the end of the record? Given a reasonable amount of data that would damn-near 
ensure that a torn record was detected, because the odds of having the exact 
same sequence of random bytes would be so low. Potentially even just 
duplicating the LSN would suffice.

On the separate write idea, if that could be controlled by a GUC I think it'd 
be worth doing. Anyone that needs to worry about this corner case probably has 
hardware that would support that.
--
Jim C. Nasby, Data Architect   j...@nasby.net
512.569.9461 (cell) http://jim.nasby.net


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] corrupt pages detected by enabling checksums

2013-05-09 Thread Simon Riggs
On 9 May 2013 20:28, Jim Nasby j...@nasby.net wrote:

 Unfortunately, it seems that doing any kind of validation to determine
 that we have a valid end-of-the-WAL inherently requires some kind of
 separate durable write somewhere. It would be a tiny amount of data (an
 LSN and maybe some extra crosscheck information), so I could imagine
 that would be just fine given the right hardware; but if we just write
 to disk that would be pretty bad. Ideas welcome.

Not so sure.

If the WAL record length is intact, and it probably is, then we can
test whether the next WAL record is valid also.

If the current WAL record is corrupt and the next WAL record is
corrupt, then we have a problem.

If the current WAL record is corrupt and the next WAL record is in
every way valid, we can potentially continue. But we need to keep
track of accumulated errors to avoid getting into a worse situation.
Obviously, we would need to treat the next WAL record with complete
scepticism, but I have seen cases where only a single WAL record was
corrupt.

--
 Simon Riggs   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] corrupt pages detected by enabling checksums

2013-05-09 Thread Tom Lane
Simon Riggs si...@2ndquadrant.com writes:
 If the current WAL record is corrupt and the next WAL record is in
 every way valid, we can potentially continue.

That seems like a seriously bad idea.

regards, tom lane


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] corrupt pages detected by enabling checksums

2013-05-09 Thread Simon Riggs
On 9 May 2013 22:39, Tom Lane t...@sss.pgh.pa.us wrote:
 Simon Riggs si...@2ndquadrant.com writes:
 If the current WAL record is corrupt and the next WAL record is in
 every way valid, we can potentially continue.

 That seems like a seriously bad idea.

I agree. But if you knew that were true, is stopping a better idea?

Any corrupt WAL record needs to halt recovery. What I'm thinking is to
check whether it might be possible to continue, and if all looks good,
offer the user the option to continue.

Continuing could be dangerous and might cause deeper corruption of
your database. We suggest you take a backup of the server before
continuing

--
 Simon Riggs   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] corrupt pages detected by enabling checksums

2013-05-09 Thread Greg Stark
On Thu, May 9, 2013 at 10:45 PM, Simon Riggs si...@2ndquadrant.com wrote:
 On 9 May 2013 22:39, Tom Lane t...@sss.pgh.pa.us wrote:
 Simon Riggs si...@2ndquadrant.com writes:
 If the current WAL record is corrupt and the next WAL record is in
 every way valid, we can potentially continue.

 That seems like a seriously bad idea.

 I agree. But if you knew that were true, is stopping a better idea?

Having one corrupt record followed by a valid record is not an
abnormal situation. It could easily be the correct end of WAL.

I think it's not possible to protect 100% against this without giving
up the checksum optimization which implies doing two fsyncs per commit
instead of 1.

However it is possible to reduce the window. Every time the
transaction log is synced a different file can be updated with the a
known minimum transaction log recovery point. Even if it's not synced
consistently on every transaction commit or wal sync it would serve as
a low water mark. Recovering to that point is not sufficient but is
necessary for a consistent recovery. That file could be synced lazily,
say, every 10s or something like that and would guarantee that any wal
corruption would be caught except for the last 10s of wal traffic for
example.

If you're only interested in database consistency and not lost commits
then that file could be synced on buffer xlog flushes (making a
painful case even more painful). Off the top of my head that would be
sufficient to guarantee that a corrupt xlog that would create an
inconsistent database would not be missed. I may be missing cases
involving checkpoints or the like though.


-- 
greg


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] corrupt pages detected by enabling checksums

2013-05-09 Thread Jeff Davis
On Thu, 2013-05-09 at 14:28 -0500, Jim Nasby wrote:
 What about moving some critical data from the beginning of the WAL
 record to the end? That would make it easier to detect that we don't
 have a complete record. It wouldn't necessarily replace the CRC
 though, so maybe that's not good enough.
 
 Actually, what if we actually *duplicated* some of the same WAL header
 info at the end of the record? Given a reasonable amount of data that
 would damn-near ensure that a torn record was detected, because the
 odds of having the exact same sequence of random bytes would be so
 low. Potentially even just duplicating the LSN would suffice.

I think both of these ideas have some false positives and false
negatives.

If the corruption happens at the record boundary, and wipes out the
special information at the end of the record, then you might think it
was not fully flushed, and we're in the same position as today.

If the WAL record is large, and somehow the beginning and the end get
written to disk but not the middle, then it will look like corruption;
but really the WAL was just not completely flushed. This seems pretty
unlikely, but not impossible.

That being said, I like the idea of introducing some extra checks if a
perfect solution is not possible.

 On the separate write idea, if that could be controlled by a GUC I
 think it'd be worth doing. Anyone that needs to worry about this
 corner case probably has hardware that would support that.

It sounds pretty easy to do that naively. I'm just worried that the
performance will be so bad for so many users that it's not a very
reasonable choice.

Today, it would probably make more sense to just use sync rep. If the
master's WAL is corrupt, and it starts up too early, then that should be
obvious when you try to reconnect streaming replication. I haven't tried
it, but I'm assuming that it gives a useful error message.

Regards,
Jeff Davis




--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] corrupt pages detected by enabling checksums

2013-05-09 Thread Jeff Davis
On Thu, 2013-05-09 at 23:13 +0100, Greg Stark wrote:
 However it is possible to reduce the window...

Sounds reasonable.

It's fairly limited though -- the window is already a checkpoint
(typically 5-30 minutes), and we'd bring that down an order of magnitude
(10s). I speculate that, if it got corrupted within 30 minutes, it
probably got corrupted at the time of being written (as happened in Jeff
Janes's case, due to a bug).

So, the question is: if the WAL is corrupted on write, does reducing the
window significantly increase the chances that the wal writer will hang
around long enough before a crash to flush this other file?

On the other hand, checkpoint hides any corrupt WAL records by not
replaying them, whereas your scheme would identify that there is a
problem.

I don't think this would have helped Jeff Janes's case because I think
the crashes were happening too quickly. But that is artificial, so it
may help in real cases.

I just had a thought: we don't necessarily need to flush the auxiliary
file each time; merely writing it to the kernel buffers would help a
lot. Maybe an extra write() of the auxiliary file during a WAL flush
isn't so bad; and combined with periodic fsync()s of the auxiliary file,
should offer a lot of coverage against problems.

Regards,
Jeff Davis




--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] corrupt pages detected by enabling checksums

2013-05-08 Thread Jim Nasby

On 4/5/13 6:39 PM, Jeff Davis wrote:

On Fri, 2013-04-05 at 10:34 +0200, Florian Pflug wrote:

Maybe we could scan forward to check whether a corrupted WAL record is
followed by one or more valid ones with sensible LSNs. If it is,
chances are high that we haven't actually hit the end of the WAL. In
that case, we could either log a warning, or (better, probably) abort
crash recovery.


+1.


Corruption of fields which we require to scan past the record would
cause false negatives, i.e. no trigger an error even though we do
abort recovery mid-way through. There's a risk of false positives too,
but they require quite specific orderings of writes and thus seem
rather unlikely. (AFAICS, the OS would have to write some parts of
record N followed by the whole of record N+1 and then crash to cause a
false positive).


Does the xlp_pageaddr help solve this?

Also, we'd need to be a little careful when written-but-not-flushed WAL
data makes it to disk, which could cause a false positive and may be a
fairly common case.


Apologies if this is a stupid question, but is this mostly an issue due to torn 
pages? IOW, if we had a way to ensure we never see torn pages, would that mean 
an invalid CRC on a WAL page indicated there really was corruption on that page?

Maybe it's worth putting (yet more) thought into the torn page issue... :/
--
Jim C. Nasby, Data Architect   j...@nasby.net
512.569.9461 (cell) http://jim.nasby.net


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] corrupt pages detected by enabling checksums

2013-05-08 Thread Jeff Davis
On Wed, 2013-05-08 at 17:56 -0500, Jim Nasby wrote:
 Apologies if this is a stupid question, but is this mostly an issue
 due to torn pages? IOW, if we had a way to ensure we never see torn
 pages, would that mean an invalid CRC on a WAL page indicated there
 really was corruption on that page?
 
 Maybe it's worth putting (yet more) thought into the torn page
 issue... :/

Sort of. For data, a page is the logically-atomic unit that is expected
to be intact. For WAL, a record is the logically-atomic unit that is
expected to be intact.

So it might be better to say that the issue for the WAL is torn
records. A record might be larger than a page (it can hold up to three
full-page images in one record), but is often much smaller.

We use a CRC to validate that the WAL record is fully intact. The
concern is that, if it fails the CRC check, we *assume* that it's
because it wasn't completely flushed yet (i.e. a torn record). Based
on that assumption, neither that record nor any later record contains
committed transactions, so we can safely consider that the end of the
WAL (as of the crash) and bring the system up.

The problem is that the assumption is not always true: a CRC failure
could also indicate real corruption of WAL records that have been
previously flushed successfully, and may contain committed transactions.
That can mean we bring the system up way too early, corrupting the
database.

Unfortunately, it seems that doing any kind of validation to determine
that we have a valid end-of-the-WAL inherently requires some kind of
separate durable write somewhere. It would be a tiny amount of data (an
LSN and maybe some extra crosscheck information), so I could imagine
that would be just fine given the right hardware; but if we just write
to disk that would be pretty bad. Ideas welcome.

Regards,
Jeff Davis




-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] corrupt pages detected by enabling checksums

2013-05-07 Thread Robert Haas
On Mon, May 6, 2013 at 5:04 PM, Jeff Davis pg...@j-davis.com wrote:
 On Mon, 2013-05-06 at 15:31 -0400, Robert Haas wrote:
 On Wed, May 1, 2013 at 3:04 PM, Jeff Davis pg...@j-davis.com wrote:
  Regardless, you have a reasonable claim that my patch had effects that
  were not necessary. I have attached a draft patch to remedy that. Only
  rudimentary testing was done.

 This looks reasonable to me.

 Can you please explain the scenario that loses many VM bits at once
 during a crash, and results in a bunch of already-all-visible heap pages
 being dirtied for no reason?

Hmm.  Rereading your last email, I see your point: since we now have
HEAP_XLOG_VISIBLE, this is much less of an issue than it would have
been before.  I'm still not convinced that simplifying that code is a
good idea, but maybe it doesn't really hurt us much in practice.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] corrupt pages detected by enabling checksums

2013-05-07 Thread Jeff Davis
On Tue, 2013-05-07 at 13:20 -0400, Robert Haas wrote:
 Hmm.  Rereading your last email, I see your point: since we now have
 HEAP_XLOG_VISIBLE, this is much less of an issue than it would have
 been before.  I'm still not convinced that simplifying that code is a
 good idea, but maybe it doesn't really hurt us much in practice.

Given that there's not a big impact one way or another, I don't mind
whether this particular patch is committed or not. Whichever you think
is more understandable and safer at this late hour. Also, to be clear,
the fact that I posted a patch was not meant to advocate either way;
merely to present the options.

Not sure exactly what you mean about the code simplification, but I
agree that anything more substantial than this patch should be left for
9.4.

Regards,
Jeff Davis




-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] corrupt pages detected by enabling checksums

2013-05-06 Thread Robert Haas
On Wed, May 1, 2013 at 3:04 PM, Jeff Davis pg...@j-davis.com wrote:
 Regardless, you have a reasonable claim that my patch had effects that
 were not necessary. I have attached a draft patch to remedy that. Only
 rudimentary testing was done.

This looks reasonable to me.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] corrupt pages detected by enabling checksums

2013-05-06 Thread Jeff Davis
On Mon, 2013-05-06 at 15:31 -0400, Robert Haas wrote:
 On Wed, May 1, 2013 at 3:04 PM, Jeff Davis pg...@j-davis.com wrote:
  Regardless, you have a reasonable claim that my patch had effects that
  were not necessary. I have attached a draft patch to remedy that. Only
  rudimentary testing was done.
 
 This looks reasonable to me.

Can you please explain the scenario that loses many VM bits at once
during a crash, and results in a bunch of already-all-visible heap pages
being dirtied for no reason?

Regards,
Jeff Davis




-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] corrupt pages detected by enabling checksums

2013-05-04 Thread Simon Riggs
On 3 May 2013 21:53, Jeff Davis pg...@j-davis.com wrote:

 At this point, I don't think more changes are required.

After detailed further analysis, I agree, no further changes are required.

I think the code in that area needs considerable refactoring to
improve things. I've looked for an easy way to avoid calling
PageSetLSN() twice, but I don't see one, which is the thing I thought
was a problem. I don't much like the nested critical sections either.
But overall, we do follow the requirements for WAL laid out in the
README, in that we dirty the buffer first, insert WAL, then set LSN,
all within a critical section and with buffer locking. So I don't see
any place where this will break with the current coding.

In the future I would hope to see that code simplified so that we do
just one WAL record per block, rather than the 3 (or more?) records
that can get written now: freeze, clean and visibility.

--
 Simon Riggs   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] corrupt pages detected by enabling checksums

2013-05-03 Thread Simon Riggs
On 1 May 2013 20:40, Jeff Davis pg...@j-davis.com wrote:

 Looks easy. There is no additional logic for checksums, so there's no
 third complexity.

 So we either have
 * cleanup info with vismap setting info
 * cleanup info only

 which is the same number of WAL records as we have now, just that we
 never emit 2 records when one will do.

 What if we only set PD_ALL_VISIBLE and the VM bit, and make no other
 changes? Right now, that generates no WAL for the heap page (unless
 checksums are enabled, of course), which means no full page images for
 the heap pages.

The only place I see that code path is when we clear up a heap page
that is empty.

Is that what you meant?

Empty pages are pretty rare, so I guess I meant that we should just
get rid of the vismap update when we see that. Since we're adding the
block straught back into the fsm, it won't stay all visible for very
long.



Bottom line is: is there a problem there, or not?
If there's not, I'm happy. If there is, then do we have another plan
than making this into one WAL record, so it is atomic?
And presumably y'all want me to fix it.

--
 Simon Riggs   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] corrupt pages detected by enabling checksums

2013-05-03 Thread Jeff Davis
On Fri, 2013-05-03 at 19:52 +0100, Simon Riggs wrote:
 On 1 May 2013 20:40, Jeff Davis pg...@j-davis.com wrote:
 
  Looks easy. There is no additional logic for checksums, so there's no
  third complexity.
 
  So we either have
  * cleanup info with vismap setting info
  * cleanup info only
 
  which is the same number of WAL records as we have now, just that we
  never emit 2 records when one will do.
 
  What if we only set PD_ALL_VISIBLE and the VM bit, and make no other
  changes? Right now, that generates no WAL for the heap page (unless
  checksums are enabled, of course), which means no full page images for
  the heap pages.
 
 The only place I see that code path is when we clear up a heap page
 that is empty.

Or if there are no other changes to make to the heap page. For example,
if you do a fresh data load, then set the hint bits, and then do a
VACCUM. The only changes needed during VACUUM are setting PD_ALL_VISIBLE
and the VM bit.

Right now, that does not cause a wal record to be written for the heap
page, so there are no full-page images for the heap page.

At this point, I don't think more changes are required. Robert seemed
concerned about dirtying extra pages after a crash, but I didn't
entirely understand his reasoning. I am inclined toward simplifying this
part of the code as you suggest, but maybe that's better left for 9.4
(which would give me a chance to reignite the discussion about whether
PD_ALL_VISIBLE is needed at all) unless there is an actual problem.

Regards,
Jeff Davis




-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] corrupt pages detected by enabling checksums

2013-05-01 Thread Simon Riggs
On 30 April 2013 22:54, Jeff Davis pg...@j-davis.com wrote:
 On Tue, 2013-04-30 at 08:34 -0400, Robert Haas wrote:
 Uh, wait a minute.  I think this is completely wrong.  The buffer is
 LOCKED for this entire sequence of operations.  For a checkpoint to
 happen, it's got to write every buffer, which it will not be able to
 do for so long as the buffer is locked.

 I went back and forth on this, so you could be right, but here was my
 reasoning:

 I was worried because SyncOneBuffer checks whether it needs writing
 without taking a content lock, so the exclusive lock doesn't help. That
 makes sense, because you don't want a checkpoint to have to get a
 content lock on every buffer in the buffer pool. But it also means we
 need to follow the rules laid out in transam/README and dirty the pages
 before writing WAL.

 The effect of the change to lazy_scan_heap is to force the buffer to
 be written even if we're only updating the visibility map page.
 That's a bad idea and should be reverted.

 The only time the VM and the data page are out of sync during vacuum is
 after a crash, right? If that's the case, I didn't think it was a big
 deal to dirty one extra page (should be extremely rare). Am I missing
 something?

 The reason I removed that special case was just code
 complexity/readability. I tried preserving the previous behavior, and
 it's not so bad, but it seemed unnecessarily ugly for the benefit of a
 rare case.

All of that makes perfect sense to me.

Waiting to hear back from Robert whether he still has an objection.

--
 Simon Riggs   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] corrupt pages detected by enabling checksums

2013-05-01 Thread Robert Haas
On Tue, Apr 30, 2013 at 5:54 PM, Jeff Davis pg...@j-davis.com wrote:
 On Tue, 2013-04-30 at 08:34 -0400, Robert Haas wrote:
 Uh, wait a minute.  I think this is completely wrong.  The buffer is
 LOCKED for this entire sequence of operations.  For a checkpoint to
 happen, it's got to write every buffer, which it will not be able to
 do for so long as the buffer is locked.

 I went back and forth on this, so you could be right, but here was my
 reasoning:

 I was worried because SyncOneBuffer checks whether it needs writing
 without taking a content lock, so the exclusive lock doesn't help. That
 makes sense, because you don't want a checkpoint to have to get a
 content lock on every buffer in the buffer pool. But it also means we
 need to follow the rules laid out in transam/README and dirty the pages
 before writing WAL.

On further review, I think you're right about this.  We'd have a
problem if the following sequence of events were to occur:

1. vacuum writes the WAL record, but does not yet mark the buffer
dirty, and then goes into the tank
2. checkpointer decides where the checkpoint will begin
3. buffer pool is scanned as part of the checkpoint process, observing
target buffer not to be dirty
4. vacuum finally wakes up and marks the buffer dirty
5. crash, after WAL is flushed but before buffer is written

However, on looking at this a little more, shouldn't we be doing
log_heap_clean() *before* we do visibilitymap_set()?

 The effect of the change to lazy_scan_heap is to force the buffer to
 be written even if we're only updating the visibility map page.
 That's a bad idea and should be reverted.

 The only time the VM and the data page are out of sync during vacuum is
 after a crash, right? If that's the case, I didn't think it was a big
 deal to dirty one extra page (should be extremely rare). Am I missing
 something?

 The reason I removed that special case was just code
 complexity/readability. I tried preserving the previous behavior, and
 it's not so bad, but it seemed unnecessarily ugly for the benefit of a
 rare case.

Well, I don't find it that ugly, but then again

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] corrupt pages detected by enabling checksums

2013-05-01 Thread Robert Haas
On Wed, May 1, 2013 at 11:29 AM, Robert Haas robertmh...@gmail.com wrote:
 I was worried because SyncOneBuffer checks whether it needs writing
 without taking a content lock, so the exclusive lock doesn't help. That
 makes sense, because you don't want a checkpoint to have to get a
 content lock on every buffer in the buffer pool. But it also means we
 need to follow the rules laid out in transam/README and dirty the pages
 before writing WAL.

 On further review, I think you're right about this.  We'd have a
 problem if the following sequence of events were to occur:

 1. vacuum writes the WAL record, but does not yet mark the buffer
 dirty, and then goes into the tank
 2. checkpointer decides where the checkpoint will begin
 3. buffer pool is scanned as part of the checkpoint process, observing
 target buffer not to be dirty
 4. vacuum finally wakes up and marks the buffer dirty
 5. crash, after WAL is flushed but before buffer is written

 However, on looking at this a little more, shouldn't we be doing
 log_heap_clean() *before* we do visibilitymap_set()?

Hit send too soon.

To finish that thought: right now the order of operations is...

1. Perform the cleanup operations on the buffer.
2. Set the visibility map bit.
3. Log the visibility map change.
4. Log the cleanup operations.

It seems to me that it would be better to do (4) as close to (1) as
possible.  Isn't it bad if the operations are replayed in an order
that differs from the order in which they were performed - or if, say,
the first WAL record were replayed but the second were not, for
whatever reason?

 The effect of the change to lazy_scan_heap is to force the buffer to
 be written even if we're only updating the visibility map page.
 That's a bad idea and should be reverted.

 The only time the VM and the data page are out of sync during vacuum is
 after a crash, right? If that's the case, I didn't think it was a big
 deal to dirty one extra page (should be extremely rare). Am I missing
 something?

 The reason I removed that special case was just code
 complexity/readability. I tried preserving the previous behavior, and
 it's not so bad, but it seemed unnecessarily ugly for the benefit of a
 rare case.

 Well, I don't find it that ugly, but then again

...I've done a fair amount of hacking on this code.  The scenario that
I find problematic here is that you could easily lose a couple of
visibility map pages in a crash.  Each one you lose, with this patch,
potentially involves half a gigabyte of unnecessary disk writes, if
the relation is being vacuumed at the time.  For the few extra lines
of code it takes to protect against that scenario, I think we should
protect against it.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] corrupt pages detected by enabling checksums

2013-05-01 Thread Simon Riggs
On 1 May 2013 16:33, Robert Haas robertmh...@gmail.com wrote:
 On Wed, May 1, 2013 at 11:29 AM, Robert Haas robertmh...@gmail.com wrote:
 I was worried because SyncOneBuffer checks whether it needs writing
 without taking a content lock, so the exclusive lock doesn't help. That
 makes sense, because you don't want a checkpoint to have to get a
 content lock on every buffer in the buffer pool. But it also means we
 need to follow the rules laid out in transam/README and dirty the pages
 before writing WAL.

 On further review, I think you're right about this.  We'd have a
 problem if the following sequence of events were to occur:

 1. vacuum writes the WAL record, but does not yet mark the buffer
 dirty, and then goes into the tank
 2. checkpointer decides where the checkpoint will begin
 3. buffer pool is scanned as part of the checkpoint process, observing
 target buffer not to be dirty
 4. vacuum finally wakes up and marks the buffer dirty
 5. crash, after WAL is flushed but before buffer is written

 However, on looking at this a little more, shouldn't we be doing
 log_heap_clean() *before* we do visibilitymap_set()?

 Hit send too soon.

 To finish that thought: right now the order of operations is...

 1. Perform the cleanup operations on the buffer.
 2. Set the visibility map bit.
 3. Log the visibility map change.
 4. Log the cleanup operations.

 It seems to me that it would be better to do (4) as close to (1) as
 possible.  Isn't it bad if the operations are replayed in an order
 that differs from the order in which they were performed - or if, say,
 the first WAL record were replayed but the second were not, for
 whatever reason?

I agree, but that was in the original coding wasn't it?

Why aren't we writing just one WAL record for this action? We use a
single WAL record in other places where we make changes to multiple
blocks with multiple full page writes, e.g. index block split. That
would make the action atomic and we'd just have this...

1. Perform the cleanup operations on the buffer.
2. Set the visibility map bit.
3. Log the cleanup operations and visibility map change.

which can then be replayed with correct sequence, locking etc.
and AFAICS would likely be faster also.

--
 Simon Riggs   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] corrupt pages detected by enabling checksums

2013-05-01 Thread Robert Haas
On Wed, May 1, 2013 at 1:02 PM, Simon Riggs si...@2ndquadrant.com wrote:
 I agree, but that was in the original coding wasn't it?

I believe the problem was introduced by this commit:

commit fdf9e21196a6f58c6021c967dc5776a16190f295
Author: Heikki Linnakangas heikki.linnakan...@iki.fi
Date:   Wed Feb 13 17:46:23 2013 +0200

Update visibility map in the second phase of vacuum.

There's a high chance that a page becomes all-visible when the second phase
of vacuum removes all the dead tuples on it, so it makes sense to check for
that. Otherwise the visibility map won't get updated until the next vacuum.

Pavan Deolasee, reviewed by Jeff Janes.

 Why aren't we writing just one WAL record for this action? We use a
 single WAL record in other places where we make changes to multiple
 blocks with multiple full page writes, e.g. index block split. That
 would make the action atomic and we'd just have this...

 1. Perform the cleanup operations on the buffer.
 2. Set the visibility map bit.
 3. Log the cleanup operations and visibility map change.

 which can then be replayed with correct sequence, locking etc.
 and AFAICS would likely be faster also.

I thought about that, too.  It certainly seems like more than we want
to try to do for 9.3 at this point.  The other complication is that
there's a lot of conditional logic here.  We're definitely going to
emit a cleanup record.  We're going to emit a record to make the page
all-visible only sometimes, because it might not be all-visible yet:
it could have tuples on it that are deleted but not yet dead.  And
then there's additional logic to handle the checksum case.  Plus, the
all-visible marking can happen in other code paths, too, specifically
in phase 1 of vacuum.  So it might be possible to consolidate this,
but off-hand it looks messy to me out of proportion to the benefits.

Now that I'm looking at this, I'm a bit confused by the new logic in
visibilitymap_set().  When checksums are enabled, we set the page LSN,
which is described like this: we need to protect the heap page from
being torn.  But how does setting the page LSN do that?  Don't we
need to mark the buffer dirty or something like that?  Sorry if this
is a dumb question.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] corrupt pages detected by enabling checksums

2013-05-01 Thread Jeff Davis
On Wed, 2013-05-01 at 11:33 -0400, Robert Haas wrote:
  The only time the VM and the data page are out of sync during vacuum is
  after a crash, right? If that's the case, I didn't think it was a big
  deal to dirty one extra page (should be extremely rare). Am I missing
  something?
 
  The reason I removed that special case was just code
  complexity/readability. I tried preserving the previous behavior, and
  it's not so bad, but it seemed unnecessarily ugly for the benefit of a
  rare case.
 
  Well, I don't find it that ugly, but then again
 
 ...I've done a fair amount of hacking on this code.  The scenario that
 I find problematic here is that you could easily lose a couple of
 visibility map pages in a crash.  Each one you lose, with this patch,
 potentially involves half a gigabyte of unnecessary disk writes, if
 the relation is being vacuumed at the time.  For the few extra lines
 of code it takes to protect against that scenario, I think we should
 protect against it.

I'm still unclear on how we'd lose so many VM bits at once, because they
are logged. I see how we can lose one if the heap page makes it to disk
before the VM bit's WAL is flushed, but that seems to be bounded to me.

Regardless, you have a reasonable claim that my patch had effects that
were not necessary. I have attached a draft patch to remedy that. Only
rudimentary testing was done.

Regards,
Jeff Davis

*** a/src/backend/access/heap/heapam.c
--- b/src/backend/access/heap/heapam.c
***
*** 5755,5762  log_heap_freeze(Relation reln, Buffer buffer,
   * corresponding visibility map block.	Both should have already been modified
   * and dirtied.
   *
!  * If checksums are enabled, we also add the heap_buffer to the chain to
!  * protect it from being torn.
   */
  XLogRecPtr
  log_heap_visible(RelFileNode rnode, Buffer heap_buffer, Buffer vm_buffer,
--- 5755,5762 
   * corresponding visibility map block.	Both should have already been modified
   * and dirtied.
   *
!  * If checksums are enabled and the heap buffer was changed (PD_ALL_VISIBLE
!  * set), we add it to the chain to protect it from being torn.
   */
  XLogRecPtr
  log_heap_visible(RelFileNode rnode, Buffer heap_buffer, Buffer vm_buffer,
***
*** 5784,5790  log_heap_visible(RelFileNode rnode, Buffer heap_buffer, Buffer vm_buffer,
  	rdata[1].buffer_std = false;
  	rdata[1].next = NULL;
  
! 	if (DataChecksumsEnabled())
  	{
  		rdata[1].next = (rdata[2]);
  
--- 5784,5790 
  	rdata[1].buffer_std = false;
  	rdata[1].next = NULL;
  
! 	if (DataChecksumsEnabled()  BufferIsValid(heap_buffer))
  	{
  		rdata[1].next = (rdata[2]);
  
*** a/src/backend/access/heap/visibilitymap.c
--- b/src/backend/access/heap/visibilitymap.c
***
*** 233,242  visibilitymap_pin_ok(BlockNumber heapBlk, Buffer buf)
   * marked all-visible; it is needed for Hot Standby, and can be
   * InvalidTransactionId if the page contains no tuples.
   *
!  * Caller is expected to set the heap page's PD_ALL_VISIBLE bit before calling
!  * this function. Except in recovery, caller should also pass the heap
!  * buffer. When checksums are enabled and we're not in recovery, we must add
!  * the heap buffer to the WAL chain to protect it from being torn.
   *
   * You must pass a buffer containing the correct map page to this function.
   * Call visibilitymap_pin first to pin the right one. This function doesn't do
--- 233,243 
   * marked all-visible; it is needed for Hot Standby, and can be
   * InvalidTransactionId if the page contains no tuples.
   *
!  * The caller should supply heapBuf if and only if it set PD_ALL_VISIBLE on the
!  * heap page, and we're not in recovery.  In that case, the caller should have
!  * already set PD_ALL_VISIBLE and marked the page dirty.  When checksums are
!  * enabled and heapBuf is valid, this function will add heapBuf to the WAL
!  * chain to protect it from being torn.
   *
   * You must pass a buffer containing the correct map page to this function.
   * Call visibilitymap_pin first to pin the right one. This function doesn't do
***
*** 257,263  visibilitymap_set(Relation rel, BlockNumber heapBlk, Buffer heapBuf,
  #endif
  
  	Assert(InRecovery || XLogRecPtrIsInvalid(recptr));
- 	Assert(InRecovery || BufferIsValid(heapBuf));
  
  	/* Check that we have the right heap page pinned, if present */
  	if (BufferIsValid(heapBuf)  BufferGetBlockNumber(heapBuf) != heapBlk)
--- 258,263 
***
*** 290,296  visibilitymap_set(Relation rel, BlockNumber heapBlk, Buffer heapBuf,
   * If data checksums are enabled, we need to protect the heap
   * page from being torn.
   */
! if (DataChecksumsEnabled())
  {
  	Page heapPage = BufferGetPage(heapBuf);
  
--- 290,296 
   * If data checksums are enabled, we need to protect the heap
   * page from being torn.
   */
! if (DataChecksumsEnabled()  BufferIsValid(heapBuf))
  {
  	Page heapPage = 

Re: [HACKERS] corrupt pages detected by enabling checksums

2013-05-01 Thread Simon Riggs
On 1 May 2013 19:16, Robert Haas robertmh...@gmail.com wrote:
 On Wed, May 1, 2013 at 1:02 PM, Simon Riggs si...@2ndquadrant.com wrote:
 I agree, but that was in the original coding wasn't it?

 I believe the problem was introduced by this commit:

 commit fdf9e21196a6f58c6021c967dc5776a16190f295
 Author: Heikki Linnakangas heikki.linnakan...@iki.fi
 Date:   Wed Feb 13 17:46:23 2013 +0200

 Update visibility map in the second phase of vacuum.

 There's a high chance that a page becomes all-visible when the second 
 phase
 of vacuum removes all the dead tuples on it, so it makes sense to check 
 for
 that. Otherwise the visibility map won't get updated until the next 
 vacuum.

 Pavan Deolasee, reviewed by Jeff Janes.

 Why aren't we writing just one WAL record for this action? We use a
 single WAL record in other places where we make changes to multiple
 blocks with multiple full page writes, e.g. index block split. That
 would make the action atomic and we'd just have this...

 1. Perform the cleanup operations on the buffer.
 2. Set the visibility map bit.
 3. Log the cleanup operations and visibility map change.

 which can then be replayed with correct sequence, locking etc.
 and AFAICS would likely be faster also.

 I thought about that, too.  It certainly seems like more than we want
 to try to do for 9.3 at this point.  The other complication is that
 there's a lot of conditional logic here.  We're definitely going to
 emit a cleanup record.  We're going to emit a record to make the page
 all-visible only sometimes, because it might not be all-visible yet:
 it could have tuples on it that are deleted but not yet dead.  And
 then there's additional logic to handle the checksum case.  Plus, the
 all-visible marking can happen in other code paths, too, specifically
 in phase 1 of vacuum.  So it might be possible to consolidate this,
 but off-hand it looks messy to me out of proportion to the benefits.

Looks easy. There is no additional logic for checksums, so there's no
third complexity.

So we either have
* cleanup info with vismap setting info
* cleanup info only

which is the same number of WAL records as we have now, just that we
never emit 2 records when one will do.

 Now that I'm looking at this, I'm a bit confused by the new logic in
 visibilitymap_set().  When checksums are enabled, we set the page LSN,
 which is described like this: we need to protect the heap page from
 being torn.  But how does setting the page LSN do that?

It doesn't

 Don't we
 need to mark the buffer dirty or something like that?

We do.

--
 Simon Riggs   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] corrupt pages detected by enabling checksums

2013-05-01 Thread Jeff Davis
On Wed, 2013-05-01 at 14:16 -0400, Robert Haas wrote:
 Now that I'm looking at this, I'm a bit confused by the new logic in
 visibilitymap_set().  When checksums are enabled, we set the page LSN,
 which is described like this: we need to protect the heap page from
 being torn.  But how does setting the page LSN do that?  Don't we
 need to mark the buffer dirty or something like that?  Sorry if this
 is a dumb question.

The caller is supposed to dirty the heap page when setting
PD_ALL_VISIBLE. I thought I said that explicitly in the comments
somewhere, but now I don't see it.

It is slightly awkward to put the comment about the page being torn just
before setting the LSN. Feel free to move/remove/reword if you can think
of something better.

Because setting PD_ALL_VISIBLE does not write WAL by itself, my design
was to have it piggyback on the VM WAL record if checksums are turned
on. It would be nice to just use MarkBufferDirtyHint, but that does not
guarantee that the page will actually be marked dirty (see comments);
and we really need it to be properly marked dirty (otherwise we risk the
VM bit making it and the heap page not).

I also explored the idea of passing more options to MarkBufferDirty so
that we could force it to do what we want in each case, but decided
against it:

http://www.postgresql.org/message-id/1357798000.5162.38.camel@jdavis-laptop

Regards,
Jeff Davis




-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] corrupt pages detected by enabling checksums

2013-05-01 Thread Jeff Davis
On Wed, 2013-05-01 at 20:06 +0100, Simon Riggs wrote:
  Why aren't we writing just one WAL record for this action?

...

 
  I thought about that, too.  It certainly seems like more than we want
  to try to do for 9.3 at this point.  The other complication is that
  there's a lot of conditional logic here.

...

 Looks easy. There is no additional logic for checksums, so there's no
 third complexity.
 
 So we either have
 * cleanup info with vismap setting info
 * cleanup info only
 
 which is the same number of WAL records as we have now, just that we
 never emit 2 records when one will do.

What if we only set PD_ALL_VISIBLE and the VM bit, and make no other
changes? Right now, that generates no WAL for the heap page (unless
checksums are enabled, of course), which means no full page images for
the heap pages.

I think we have to special-case that somehow, and I believe that's the
conditional logic that Robert is referring to.

That being said, there may be a simpler way to achieve the same result,
and that may be what you have in mind.

Regards,
Jeff Davis




-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] corrupt pages detected by enabling checksums

2013-04-30 Thread Simon Riggs
On 9 April 2013 08:36, Jeff Davis pg...@j-davis.com wrote:

 1. I believe that the issue I brought up at the end of this email:

 http://www.postgresql.org/message-id/1365035537.7580.380.camel@sussancws0025

 is a real issue. In lazy_vacuum_page(), the following sequence can
 happen when checksums are on:

a. PageSetAllVisible
b. Pass heap page to visibilitymap_set
c. visibilitymap_set logs the heap page and sets the LSN
d. MarkBufferDirty

 If a checkpoint happens between (c) and (d), then we have a problem. The
 fix is easy: just mark the heap buffer dirty first. There's another call
 site that looks like a potential problem, but I don't think it is. I
 simplified the code there to make it (hopefully) more clearly correct.


Applied

 2. A cleanup patch to pass the buffer_std flag down through
 MarkBufferDirtyHint. This is a matter of preference and purely cosmetic,
 so it might not be wanted. The reason I thought it was useful is that a
 future caller that sets a hint on a non-standard buffer might easily
 miss the assumption that we have a standard buffer.

Skipped that for now. Do we really need it? Can you set a hint on a
non-standard buffer?

--
 Simon Riggs   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] corrupt pages detected by enabling checksums

2013-04-30 Thread Robert Haas
On Tue, Apr 30, 2013 at 6:58 AM, Simon Riggs si...@2ndquadrant.com wrote:
 On 9 April 2013 08:36, Jeff Davis pg...@j-davis.com wrote:

 1. I believe that the issue I brought up at the end of this email:

 http://www.postgresql.org/message-id/1365035537.7580.380.camel@sussancws0025

 is a real issue. In lazy_vacuum_page(), the following sequence can
 happen when checksums are on:

a. PageSetAllVisible
b. Pass heap page to visibilitymap_set
c. visibilitymap_set logs the heap page and sets the LSN
d. MarkBufferDirty

 If a checkpoint happens between (c) and (d), then we have a problem. The
 fix is easy: just mark the heap buffer dirty first. There's another call
 site that looks like a potential problem, but I don't think it is. I
 simplified the code there to make it (hopefully) more clearly correct.

 Applied

Uh, wait a minute.  I think this is completely wrong.  The buffer is
LOCKED for this entire sequence of operations.  For a checkpoint to
happen, it's got to write every buffer, which it will not be able to
do for so long as the buffer is locked.

The effect of the change to lazy_scan_heap is to force the buffer to
be written even if we're only updating the visibility map page.
That's a bad idea and should be reverted.  The change to
lazy_vacuum_page is harmless, but the comment is wrong.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] corrupt pages detected by enabling checksums

2013-04-30 Thread Simon Riggs
On 30 April 2013 13:34, Robert Haas robertmh...@gmail.com wrote:
 On Tue, Apr 30, 2013 at 6:58 AM, Simon Riggs si...@2ndquadrant.com wrote:
 On 9 April 2013 08:36, Jeff Davis pg...@j-davis.com wrote:

 1. I believe that the issue I brought up at the end of this email:

 http://www.postgresql.org/message-id/1365035537.7580.380.camel@sussancws0025

 is a real issue. In lazy_vacuum_page(), the following sequence can
 happen when checksums are on:

a. PageSetAllVisible
b. Pass heap page to visibilitymap_set
c. visibilitymap_set logs the heap page and sets the LSN
d. MarkBufferDirty

 If a checkpoint happens between (c) and (d), then we have a problem. The
 fix is easy: just mark the heap buffer dirty first. There's another call
 site that looks like a potential problem, but I don't think it is. I
 simplified the code there to make it (hopefully) more clearly correct.

 Applied

 Uh, wait a minute.  I think this is completely wrong.

Thanks for your review.

 The buffer is
 LOCKED for this entire sequence of operations.  For a checkpoint to
 happen, it's got to write every buffer, which it will not be able to
 do for so long as the buffer is locked.

I was following the logic we use for WAL: mark the buffer dirty, then write WAL.

The important thing is for us to signal that the buffer should be
written, which we do by marking it dirty. The actual writing of the
buffer comes later, often minutes later.

 The effect of the change to lazy_scan_heap is to force the buffer to
 be written even if we're only updating the visibility map page.
 That's a bad idea and should be reverted.

OK, agreed. Will wait for this discussion to end before acting though,
so I'm sure exactly what you mean.

 The change to
 lazy_vacuum_page is harmless, but the comment is wrong.

In what way, wrong? What should it say?

--
 Simon Riggs   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] corrupt pages detected by enabling checksums

2013-04-30 Thread Jeff Davis
On Tue, 2013-04-30 at 08:34 -0400, Robert Haas wrote:
 Uh, wait a minute.  I think this is completely wrong.  The buffer is
 LOCKED for this entire sequence of operations.  For a checkpoint to
 happen, it's got to write every buffer, which it will not be able to
 do for so long as the buffer is locked.

I went back and forth on this, so you could be right, but here was my
reasoning:

I was worried because SyncOneBuffer checks whether it needs writing
without taking a content lock, so the exclusive lock doesn't help. That
makes sense, because you don't want a checkpoint to have to get a
content lock on every buffer in the buffer pool. But it also means we
need to follow the rules laid out in transam/README and dirty the pages
before writing WAL.

 The effect of the change to lazy_scan_heap is to force the buffer to
 be written even if we're only updating the visibility map page.
 That's a bad idea and should be reverted.

The only time the VM and the data page are out of sync during vacuum is
after a crash, right? If that's the case, I didn't think it was a big
deal to dirty one extra page (should be extremely rare). Am I missing
something?

The reason I removed that special case was just code
complexity/readability. I tried preserving the previous behavior, and
it's not so bad, but it seemed unnecessarily ugly for the benefit of a
rare case.

Regards,
Jeff Davis




-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] corrupt pages detected by enabling checksums

2013-04-11 Thread Simon Riggs
On 11 April 2013 00:37, Robert Haas robertmh...@gmail.com wrote:

 On Sat, Apr 6, 2013 at 10:44 AM, Andres Freund and...@2ndquadrant.com
 wrote:
  I feel pretty strongly that we shouldn't add any such complications to
  XLogInsert() itself, its complicated enough already and it should be
  made simpler, not more complicated.

 +1, emphatically.  XLogInsert is a really nasty scalability
 bottleneck.  We need to move as much logic out of that function as
 possible, and particularly out from under WALInsertLock.


Andres' patch was applied, so not sure what you mean by +1ing a comment
made in relation to that patch.

I'm aware that WALInsertLock is a bottleneck and am not going to be making
that worse.

-- 
 Simon Riggs   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services


Re: [HACKERS] corrupt pages detected by enabling checksums

2013-04-10 Thread Robert Haas
On Sat, Apr 6, 2013 at 10:44 AM, Andres Freund and...@2ndquadrant.com wrote:
 I feel pretty strongly that we shouldn't add any such complications to
 XLogInsert() itself, its complicated enough already and it should be
 made simpler, not more complicated.

+1, emphatically.  XLogInsert is a really nasty scalability
bottleneck.  We need to move as much logic out of that function as
possible, and particularly out from under WALInsertLock.

...Robert


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] corrupt pages detected by enabling checksums

2013-04-09 Thread Jeff Davis
On Sat, 2013-04-06 at 16:44 +0200, Andres Freund wrote:
 I think we can just make up the rule that changing full page writes also
 requires SpinLockAcquire(xlogctl-info_lck);. Then its easy enough. And
 it can hardly be a performance bottleneck given how infrequently its
 modified.

That seems like a good idea to me. As it stands, checksums basically
force full page writes to be on; so we should either fix that or
document it.

 In retrospect I think making up the rule that full_page_writes changes
 imply a checkpoint would have made things easier performance and
 codewise.

I don't even see why we allow changing that while the server is on.
Either the I/O system requires it for safety or not, right?

Regards,
Jeff Davis




-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] corrupt pages detected by enabling checksums

2013-04-09 Thread Jeff Davis
On Mon, 2013-04-08 at 09:19 +0100, Simon Riggs wrote:

 Applied, with this as the only code change.
 
 
 Thanks everybody for good research and coding and fast testing.
 
 
 We're in good shape now.

Thank you.

I have attached two more patches:

1. I believe that the issue I brought up at the end of this email:

http://www.postgresql.org/message-id/1365035537.7580.380.camel@sussancws0025

is a real issue. In lazy_vacuum_page(), the following sequence can
happen when checksums are on:

   a. PageSetAllVisible
   b. Pass heap page to visibilitymap_set
   c. visibilitymap_set logs the heap page and sets the LSN
   d. MarkBufferDirty

If a checkpoint happens between (c) and (d), then we have a problem. The
fix is easy: just mark the heap buffer dirty first. There's another call
site that looks like a potential problem, but I don't think it is. I
simplified the code there to make it (hopefully) more clearly correct.

2. A cleanup patch to pass the buffer_std flag down through
MarkBufferDirtyHint. This is a matter of preference and purely cosmetic,
so it might not be wanted. The reason I thought it was useful is that a
future caller that sets a hint on a non-standard buffer might easily
miss the assumption that we have a standard buffer.

Regards,
Jeff Davis
*** a/src/backend/access/hash/hash.c
--- b/src/backend/access/hash/hash.c
***
*** 287,293  hashgettuple(PG_FUNCTION_ARGS)
  			/*
  			 * Since this can be redone later if needed, mark as a hint.
  			 */
! 			MarkBufferDirtyHint(buf);
  		}
  
  		/*
--- 287,293 
  			/*
  			 * Since this can be redone later if needed, mark as a hint.
  			 */
! 			MarkBufferDirtyHint(buf, true);
  		}
  
  		/*
*** a/src/backend/access/heap/pruneheap.c
--- b/src/backend/access/heap/pruneheap.c
***
*** 262,268  heap_page_prune(Relation relation, Buffer buffer, TransactionId OldestXmin,
  		{
  			((PageHeader) page)-pd_prune_xid = prstate.new_prune_xid;
  			PageClearFull(page);
! 			MarkBufferDirtyHint(buffer);
  		}
  	}
  
--- 262,268 
  		{
  			((PageHeader) page)-pd_prune_xid = prstate.new_prune_xid;
  			PageClearFull(page);
! 			MarkBufferDirtyHint(buffer, true);
  		}
  	}
  
*** a/src/backend/access/nbtree/nbtinsert.c
--- b/src/backend/access/nbtree/nbtinsert.c
***
*** 413,421  _bt_check_unique(Relation rel, IndexTuple itup, Relation heapRel,
  	 * crucial. Be sure to mark the proper buffer dirty.
  	 */
  	if (nbuf != InvalidBuffer)
! 		MarkBufferDirtyHint(nbuf);
  	else
! 		MarkBufferDirtyHint(buf);
  }
  			}
  		}
--- 413,421 
  	 * crucial. Be sure to mark the proper buffer dirty.
  	 */
  	if (nbuf != InvalidBuffer)
! 		MarkBufferDirtyHint(nbuf, true);
  	else
! 		MarkBufferDirtyHint(buf, true);
  }
  			}
  		}
*** a/src/backend/access/nbtree/nbtree.c
--- b/src/backend/access/nbtree/nbtree.c
***
*** 1052,1058  restart:
  opaque-btpo_cycleid == vstate-cycleid)
  			{
  opaque-btpo_cycleid = 0;
! MarkBufferDirtyHint(buf);
  			}
  		}
  
--- 1052,1058 
  opaque-btpo_cycleid == vstate-cycleid)
  			{
  opaque-btpo_cycleid = 0;
! MarkBufferDirtyHint(buf, true);
  			}
  		}
  
*** a/src/backend/access/nbtree/nbtutils.c
--- b/src/backend/access/nbtree/nbtutils.c
***
*** 1789,1795  _bt_killitems(IndexScanDesc scan, bool haveLock)
  	if (killedsomething)
  	{
  		opaque-btpo_flags |= BTP_HAS_GARBAGE;
! 		MarkBufferDirtyHint(so-currPos.buf);
  	}
  
  	if (!haveLock)
--- 1789,1795 
  	if (killedsomething)
  	{
  		opaque-btpo_flags |= BTP_HAS_GARBAGE;
! 		MarkBufferDirtyHint(so-currPos.buf, true);
  	}
  
  	if (!haveLock)
*** a/src/backend/access/transam/xlog.c
--- b/src/backend/access/transam/xlog.c
***
*** 7661,7667  XLogRestorePoint(const char *rpName)
   * i.e. those for which buffer_std == true
   */
  XLogRecPtr
! XLogSaveBufferForHint(Buffer buffer)
  {
  	XLogRecPtr recptr = InvalidXLogRecPtr;
  	XLogRecPtr lsn;
--- 7661,7667 
   * i.e. those for which buffer_std == true
   */
  XLogRecPtr
! XLogSaveBufferForHint(Buffer buffer, bool buffer_std)
  {
  	XLogRecPtr recptr = InvalidXLogRecPtr;
  	XLogRecPtr lsn;
***
*** 7683,7689  XLogSaveBufferForHint(Buffer buffer)
  	 * We reuse and reset rdata for any actual WAL record insert.
  	 */
  	rdata[0].buffer = buffer;
! 	rdata[0].buffer_std = true;
  
  	/*
  	 * Check buffer while not holding an exclusive lock.
--- 7683,7689 
  	 * We reuse and reset rdata for any actual WAL record insert.
  	 */
  	rdata[0].buffer = buffer;
! 	rdata[0].buffer_std = buffer_std;
  
  	/*
  	 * Check buffer while not holding an exclusive lock.
*** a/src/backend/commands/sequence.c
--- b/src/backend/commands/sequence.c
***
*** 1118,1124  read_seq_tuple(SeqTable elm, Relation rel, Buffer *buf, HeapTuple seqtuple)
  		HeapTupleHeaderSetXmax(seqtuple-t_data, InvalidTransactionId);
  		

Re: [HACKERS] corrupt pages detected by enabling checksums

2013-04-08 Thread Simon Riggs
On 6 April 2013 15:44, Andres Freund and...@2ndquadrant.com wrote:


  * In xlog_redo, it seemed slightly awkward to call XLogRecGetData twice.
  Merely a matter of preference but I thought I would mention it.

 Youre absolutely right, memcpy should have gotten passed 'data', not
 XLogRecGetData().


Applied, with this as the only code change.

Thanks everybody for good research and coding and fast testing.

We're in good shape now.

-- 
 Simon Riggs   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services


Re: [HACKERS] corrupt pages detected by enabling checksums

2013-04-07 Thread Jaime Casanova
On Sat, Apr 6, 2013 at 1:36 PM, Jeff Janes jeff.ja...@gmail.com wrote:
 On Fri, Apr 5, 2013 at 6:09 AM, Andres Freund and...@2ndquadrant.com
 wrote:


 How does the attached version look? I verified that it survives
 recovery, but not more.

 Jeff, any chance you can run this for a round with your suite?



 I've run it for a while now and have found no problems.


fwiw, i have run installcheck (serial and parallel) and isolationtest,
also combination of those (one installcheck, one isolationtest) at the
same time while executing vacuum full, reindex database and manual
checkpoint...

i also check that the bgwriter was doing some work.

i did all of this in a master node in a cluster with a standby and a
cascade standby that were later promoted... and i have no problem with
checksums at all, so i would say that the combination of Jeff's and
Andres' patches fixed the problems we have seen until now

--
Jaime Casanova www.2ndQuadrant.com
Professional PostgreSQL: Soporte 24x7 y capacitaciĆ³n
Phone: +593 4 5107566 Cell: +593 987171157


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] corrupt pages detected by enabling checksums

2013-04-06 Thread Andres Freund
On 2013-04-05 16:29:47 -0700, Jeff Davis wrote:
 On Fri, 2013-04-05 at 15:09 +0200, Andres Freund wrote:
  How does the attached version look? I verified that it survives
  recovery, but not more.
 
 Comments:
 
 * Regarding full page writes, we can:
   - always write full pages (as in your current patch), regardless of
 the current settings
   - take WALInsertLock briefly to get the current settings from XLogCtl
   - always calculate the full page record, but in XLogInsert, if it
 happens to be a HINT record, and full page writes are not necessary,
 then discard it (right now I somewhat favor this option but I haven't
 looked at the details)

I feel pretty strongly that we shouldn't add any such complications to
XLogInsert() itself, its complicated enough already and it should be
made simpler, not more complicated.

I think we can just make up the rule that changing full page writes also
requires SpinLockAcquire(xlogctl-info_lck);. Then its easy enough. And
it can hardly be a performance bottleneck given how infrequently its
modified.

In retrospect I think making up the rule that full_page_writes changes
imply a checkpoint would have made things easier performance and
codewise.

 * typo in Backup blocks are not used in **xlog xlog** records

Thats just me being funny after some long days ;). See its an 'xlog'
'xlog record'.

 * To get the appropriate setting for buffer_std, we should pass it down
 through MarkBufferDirty as an extra flag, I think.

Or just declare as part of the api that only std buffers are allowed to
be passed down. After a quick look it looks like all callers use enough
of the standard page layout to make that viable. But that really was
just a quick look.

 * I'm a bit worried that we'll need a cleanup lock when restoring an FSM
 page. What do you think?

I don't yet see why, while recovery is ongoing there shouldn't be anyone
doing anything concurrently to it but startup itself?

 * In xlog_redo, it seemed slightly awkward to call XLogRecGetData twice.
 Merely a matter of preference but I thought I would mention it.

Youre absolutely right, memcpy should have gotten passed 'data', not
XLogRecGetData().

  Jeff, any chance you can run this for a round with your suite?
 
 Yes. I don't have a rigorous test suite, but I'll do some manual tests
 and walk through it with gdb.

Heh, in this and only this I was talking to Jeff Janes. Strangely I
never had noticed that you share the same name ;)

Thanks!

Andres Freund

-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] corrupt pages detected by enabling checksums

2013-04-06 Thread Jeff Janes
On Fri, Apr 5, 2013 at 6:09 AM, Andres Freund and...@2ndquadrant.comwrote:


 How does the attached version look? I verified that it survives
 recovery, but not more.

 Jeff, any chance you can run this for a round with your suite?



I've run it for a while now and have found no problems.

Thanks,

Jeff


Re: [HACKERS] corrupt pages detected by enabling checksums

2013-04-05 Thread Florian Pflug
On Apr4, 2013, at 23:21 , Jeff Janes jeff.ja...@gmail.com wrote:
 This brings up a pretty frightening possibility to me, unrelated to data 
 checksums.  If a bit gets twiddled in the WAL file due to a hardware issue or 
 a cosmic ray, and then a crash happens, automatic recovery will stop early 
 with the failed WAL checksum with an innocuous looking message.  The system 
 will start up but will be invisibly inconsistent, and will proceed to 
 overwrite that portion of the WAL file which contains the old data (real data 
 that would have been necessary to reconstruct, once the corruption is finally 
 realized ) with an end-of-recovery checkpoint record and continue to chew up 
 real data from there.

Maybe we could scan forward to check whether a corrupted WAL record is followed 
by one or more valid ones with sensible LSNs. If it is, chances are high that 
we haven't actually hit the end of the WAL. In that case, we could either log a 
warning, or (better, probably) abort crash recovery. The user would then need 
to either restore the broken WAL segment from backup, or override the check by 
e.g. setting recovery_target_record=invalid_record. (The default would be 
recovery_target_record=last_record. The name of the GUC tries to be 
consistent with existing recovery.conf settings, even though it affects crash 
recovery, not archive recovery.)

Corruption of fields which we require to scan past the record would cause false 
negatives, i.e. no trigger an error even though we do abort recovery mid-way 
through. There's a risk of false positives too, but they require quite specific 
orderings of writes and thus seem rather unlikely. (AFAICS, the OS would have 
to write some parts of record N followed by the whole of record N+1 and then 
crash to cause a false positive).

best regards,
Florian Pflug



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] corrupt pages detected by enabling checksums

2013-04-05 Thread Andres Freund
On 2013-04-04 17:39:16 -0700, Jeff Davis wrote:
 On Thu, 2013-04-04 at 22:39 +0200, Andres Freund wrote:
  I don't think its really slower. Earlier the code took WalInsertLock
  everytime, even if we ended up not logging anything. Thats far more
  epensive than a single spinlock. And the copy should also only be taken
  in the case we need to log. So I think we end up ahead of the current
  state.
 
 Good point.
 
   The code looks good to me except that we should be consistent about the
   page hole -- XLogCheckBuffer is calculating it, but then we copy the
   entire page. I don't think anything can change the size of the page hole
   while we have a shared lock on the buffer, so it seems OK to skip the
   page hole during the copy.
  
  I don't think it can change either, but I doubt that there's a
  performance advantage by not copying the hole. I'd guess the simpler
  code ends up faster.
 
 I was thinking more about the WAL size, but I don't have a strong
 opinion.

I was just a bit dense. No idea what I missed there.

How does the attached version look? I verified that it survives
recovery, but not more.

Jeff, any chance you can run this for a round with your suite?

Greetings,

Andres Freund

-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services
From 3c086daffb994665b23bc65a1017cb71abef17bf Mon Sep 17 00:00:00 2001
From: Andres Freund and...@anarazel.de
Date: Fri, 5 Apr 2013 15:02:35 +0200
Subject: [PATCH] checksums: Log hint bit writes in a concurrency safe manner

The previous manner was racy since share locked pages - which
XLogSaveBufferForHint() gets passed - may still be modified. Namely further
hint bit updates and PageSetLSN() can be performed. That could lead to checksum
failures in WAL since the page changed between checksum computation and the
actual write to WAL.

Instead, and only if required, write a copy of the page as a normal
record. struct BkpBlock is still used so we can use most of the usual backup
block restoration code.
---
 src/backend/access/rmgrdesc/xlogdesc.c |6 +-
 src/backend/access/transam/xlog.c  |  233 ++--
 src/include/catalog/catversion.h   |2 +-
 3 files changed, 136 insertions(+), 105 deletions(-)

diff --git a/src/backend/access/rmgrdesc/xlogdesc.c b/src/backend/access/rmgrdesc/xlogdesc.c
index 52cf759..4c68b6a 100644
--- a/src/backend/access/rmgrdesc/xlogdesc.c
+++ b/src/backend/access/rmgrdesc/xlogdesc.c
@@ -17,6 +17,7 @@
 #include access/xlog.h
 #include access/xlog_internal.h
 #include catalog/pg_control.h
+#include common/relpath.h
 #include utils/guc.h
 #include utils/timestamp.h
 
@@ -83,7 +84,10 @@ xlog_desc(StringInfo buf, uint8 xl_info, char *rec)
 	}
 	else if (info == XLOG_HINT)
 	{
-		appendStringInfo(buf, page hint);
+		BkpBlock *bkp = (BkpBlock *) rec;
+		appendStringInfo(buf, page hint: %s block %u,
+		 relpathperm(bkp-node, bkp-fork),
+		 bkp-block);
 	}
 	else if (info == XLOG_BACKUP_END)
 	{
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 4a9462e..3e59f94 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -648,6 +648,8 @@ static void KeepLogSeg(XLogRecPtr recptr, XLogSegNo *logSegNo);
 
 static bool XLogCheckBuffer(XLogRecData *rdata, bool doPageWrites,
 XLogRecPtr *lsn, BkpBlock *bkpb);
+static Buffer RestoreBackupBlockContents(XLogRecPtr lsn, BkpBlock bkpb,
+char *blk, bool get_cleanup_lock, bool keep_buffer);
 static bool AdvanceXLInsertBuffer(bool new_segment);
 static bool XLogCheckpointNeeded(XLogSegNo new_segno);
 static void XLogWrite(XLogwrtRqst WriteRqst, bool flexible, bool xlog_switch);
@@ -731,7 +733,6 @@ XLogInsert(RmgrId rmid, uint8 info, XLogRecData *rdata)
 	bool		updrqst;
 	bool		doPageWrites;
 	bool		isLogSwitch = (rmid == RM_XLOG_ID  info == XLOG_SWITCH);
-	bool		isHint = (rmid == RM_XLOG_ID  info == XLOG_HINT);
 	uint8		info_orig = info;
 	static XLogRecord *rechdr;
 
@@ -1002,18 +1003,6 @@ begin:;
 	}
 
 	/*
-	 * If this is a hint record and we don't need a backup block then
-	 * we have no more work to do and can exit quickly without inserting
-	 * a WAL record at all. In that case return InvalidXLogRecPtr.
-	 */
-	if (isHint  !(info  XLR_BKP_BLOCK_MASK))
-	{
-		LWLockRelease(WALInsertLock);
-		END_CRIT_SECTION();
-		return InvalidXLogRecPtr;
-	}
-
-	/*
 	 * If the current page is completely full, the record goes to the next
 	 * page, right after the page header.
 	 */
@@ -3158,8 +3147,6 @@ Buffer
 RestoreBackupBlock(XLogRecPtr lsn, XLogRecord *record, int block_index,
    bool get_cleanup_lock, bool keep_buffer)
 {
-	Buffer		buffer;
-	Page		page;
 	BkpBlock	bkpb;
 	char	   *blk;
 	int			i;
@@ -3177,42 +3164,8 @@ RestoreBackupBlock(XLogRecPtr lsn, XLogRecord *record, int block_index,
 		if (i == block_index)
 		{
 			/* Found it, apply the update */
-			buffer = XLogReadBufferExtended(bkpb.node, bkpb.fork, 

Re: [HACKERS] corrupt pages detected by enabling checksums

2013-04-05 Thread Jeff Davis
On Fri, 2013-04-05 at 15:09 +0200, Andres Freund wrote:
 How does the attached version look? I verified that it survives
 recovery, but not more.

Comments:

* Regarding full page writes, we can:
  - always write full pages (as in your current patch), regardless of
the current settings
  - take WALInsertLock briefly to get the current settings from XLogCtl
  - always calculate the full page record, but in XLogInsert, if it
happens to be a HINT record, and full page writes are not necessary,
then discard it (right now I somewhat favor this option but I haven't
looked at the details)

* typo in Backup blocks are not used in **xlog xlog** records

* To get the appropriate setting for buffer_std, we should pass it down
through MarkBufferDirty as an extra flag, I think.

* I'm a bit worried that we'll need a cleanup lock when restoring an FSM
page. What do you think?

* In xlog_redo, it seemed slightly awkward to call XLogRecGetData twice.
Merely a matter of preference but I thought I would mention it.

 Jeff, any chance you can run this for a round with your suite?

Yes. I don't have a rigorous test suite, but I'll do some manual tests
and walk through it with gdb.

Regards,
Jeff Davis



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] corrupt pages detected by enabling checksums

2013-04-05 Thread Jeff Davis
On Fri, 2013-04-05 at 10:34 +0200, Florian Pflug wrote:
 Maybe we could scan forward to check whether a corrupted WAL record is
 followed by one or more valid ones with sensible LSNs. If it is,
 chances are high that we haven't actually hit the end of the WAL. In
 that case, we could either log a warning, or (better, probably) abort
 crash recovery.

+1.

 Corruption of fields which we require to scan past the record would
 cause false negatives, i.e. no trigger an error even though we do
 abort recovery mid-way through. There's a risk of false positives too,
 but they require quite specific orderings of writes and thus seem
 rather unlikely. (AFAICS, the OS would have to write some parts of
 record N followed by the whole of record N+1 and then crash to cause a
 false positive).

Does the xlp_pageaddr help solve this?

Also, we'd need to be a little careful when written-but-not-flushed WAL
data makes it to disk, which could cause a false positive and may be a
fairly common case.

Regards,
Jeff Davis




-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] corrupt pages detected by enabling checksums

2013-04-05 Thread Jaime Casanova
On Fri, Apr 5, 2013 at 8:09 AM, Andres Freund and...@2ndquadrant.com wrote:

 How does the attached version look? I verified that it survives
 recovery, but not more.


I still got errors when executing make installcheck in a just compiled
9.3devel + this_patch, this is when setting wal_level higher than
minimal.
Attached regression.diffs

--
Jaime Casanova www.2ndQuadrant.com
Professional PostgreSQL: Soporte 24x7 y capacitaciĆ³n
Phone: +593 4 5107566 Cell: +593 987171157


regression.diffs
Description: Binary data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] corrupt pages detected by enabling checksums

2013-04-05 Thread Jeff Davis
On Fri, 2013-04-05 at 19:22 -0500, Jaime Casanova wrote:
 On Fri, Apr 5, 2013 at 8:09 AM, Andres Freund and...@2ndquadrant.com wrote:
 
  How does the attached version look? I verified that it survives
  recovery, but not more.
 
 
 I still got errors when executing make installcheck in a just compiled
 9.3devel + this_patch, this is when setting wal_level higher than
 minimal.
 Attached regression.diffs

Did you also apply the other patch here:

http://www.postgresql.org/message-id/1364340203.21411.143.camel@sussancws0025

?

I think that is a completely separate issue. Simon, should that patch be
applied or are you waiting on the issue Jeff Janes raised to be
resolved, as well?

Regards,
Jeff Davis




-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] corrupt pages detected by enabling checksums

2013-04-05 Thread Jaime Casanova
On Fri, Apr 5, 2013 at 7:39 PM, Jeff Davis pg...@j-davis.com wrote:
 On Fri, 2013-04-05 at 19:22 -0500, Jaime Casanova wrote:
 On Fri, Apr 5, 2013 at 8:09 AM, Andres Freund and...@2ndquadrant.com wrote:
 
  How does the attached version look? I verified that it survives
  recovery, but not more.
 

 I still got errors when executing make installcheck in a just compiled
 9.3devel + this_patch, this is when setting wal_level higher than
 minimal.
 Attached regression.diffs

 Did you also apply the other patch here:

 http://www.postgresql.org/message-id/1364340203.21411.143.camel@sussancws0025


i missed that one... with both applied, first problems disappear...
will try some more tests later

--
Jaime Casanova www.2ndQuadrant.com
Professional PostgreSQL: Soporte 24x7 y capacitaciĆ³n
Phone: +593 4 5107566 Cell: +593 987171157


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] corrupt pages detected by enabling checksums

2013-04-04 Thread Simon Riggs
On 4 April 2013 02:39, Andres Freund and...@2ndquadrant.com wrote:

 Ok, I think I see the bug. And I think its been introduced in the
 checkpoints patch.

Well spotted. (I think you mean checksums patch).

 If by now the first backend has proceeded to PageSetLSN() we are writing
 different data to disk than the one we computed the checksum of
 before. Boom.

Right, so nothing else we were doing was wrong, that's why we couldn't
spot a bug. The problem is that we aren't replaying enough WAL because
the checksum on the WAL record is broke.

 I think the whole locking interactions in MarkBufferDirtyHint() need to
 be thought over pretty carefully.

When we write out a buffer with checksums enabled, we take a copy of
the buffer so that the checksum is consistent, even while other
backends may be writing hints to the same bufer.

I missed out on doing that with XLOG_HINT records, so the WAL CRC can
be incorrect because the data is scanned twice; normally that would be
OK because we have an exclusive lock on the block, but with hints we
only have share lock. So what we need to do is take a copy of the
buffer before we do XLogInsert().

Simple patch to do this attached for discussion. (Not tested).

We might also do this by modifying the WAL record to take the whole
block and bypass the BkpBlock mechanism entirely. But that's more work
and doesn't seem like it would be any cleaner. I figure lets solve the
problem first then discuss which approach is best.

-- 
 Simon Riggs   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services


copy_before_XLOG_HINT.v1.patch
Description: Binary data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] corrupt pages detected by enabling checksums

2013-04-04 Thread Andres Freund
On 2013-04-04 13:30:40 +0100, Simon Riggs wrote:
 On 4 April 2013 02:39, Andres Freund and...@2ndquadrant.com wrote:

  Ok, I think I see the bug. And I think its been introduced in the
  checkpoints patch.

 Well spotted. (I think you mean checksums patch).

Heh, yes. I was slightly tired at that point ;)

  If by now the first backend has proceeded to PageSetLSN() we are writing
  different data to disk than the one we computed the checksum of
  before. Boom.

 Right, so nothing else we were doing was wrong, that's why we couldn't
 spot a bug. The problem is that we aren't replaying enough WAL because
 the checksum on the WAL record is broke.

Well, there might be other bugs we can't see yet... But lets hope not.

  I think the whole locking interactions in MarkBufferDirtyHint() need to
  be thought over pretty carefully.

 When we write out a buffer with checksums enabled, we take a copy of
 the buffer so that the checksum is consistent, even while other
 backends may be writing hints to the same bufer.

 I missed out on doing that with XLOG_HINT records, so the WAL CRC can
 be incorrect because the data is scanned twice; normally that would be
 OK because we have an exclusive lock on the block, but with hints we
 only have share lock. So what we need to do is take a copy of the
 buffer before we do XLogInsert().

 Simple patch to do this attached for discussion. (Not tested).

 We might also do this by modifying the WAL record to take the whole
 block and bypass the BkpBlock mechanism entirely. But that's more work
 and doesn't seem like it would be any cleaner. I figure lets solve the
 problem first then discuss which approach is best.

Unfortunately I find that approach unacceptably ugly. I don't think its
ok to make an already *very* complicated function (XLogInsert()) in one
of the most complicated files (xlog.c) even more complicated. It
literally took me years to understand a large percentage of it.
I even think the XLOG_HINT escape hatch should be taken out and be
replaced by something outside of XLogInsert().

Don't think that approach would work with as few lines anyway, since you
need to properly pass it into XLogCheckBuffer() et al which checks the
buffer tags and such - which it needs to properly compute the struct
BkpBlock. Also we iterate over rdata-page for the CRC computation.

I think the route you quickly sketched is more realistic. That would
remove all knowledge obout XLOG_HINT from generic code hich is a very
good thing, I spent like 15minutes yesterday wondering whether the early
return in there might be the cause of the bug...

Something like:

XLogRecPtr
XLogSaveBufferForHint(Buffer buffer)
{
XLogRecPtr recptr = InvalidXLogRecPtr;
XLogRecPtr lsn;
XLogRecData rdata[2];
BkbBlock bkp;

/*
 * make sure no checkpoint can happen while we decide about logging
 * something or not, since we don't hold any lock preventing that
 * otherwise.
 */
Assert(MyPgXact-delayChkpt);

/* update RedoRecPtr */
GetRedoRecPtr();

/* setup phony rdata element */
rdata[0].buffer = buffer;
rdata[0].buffer_std = true; /* is that actually ok? */

/* force full pages on, otherwise checksums won't work? */
if (XLogCheckBuffer(rdata, true, lsn, bkp))
{
char copied_buffer[BLCKSZ];

/*
 * copy buffer so we don't have to worry about
 * concurrent hint bit or lsn updates. We assume pd_lower/upper
 * cannot be changed without an exclusive lock, so the contents
 * bkp are not racy.
 */
memcpy(copied_buffer, BufferGetBlock(buffer), BLCKSZ);

rdata[0].data = bkp;
rdata[0].len = sizeof(BkbBlock);
rdata[0].buffer = InvalidBuffer;
rdata[0].buffer_std = false;
rdata[0].next = (rdata[1]);

rdata[1].data = copied_buffer;
rdata[1].len = BLCKSZ;
rdata[1].buffer = InvalidBuffer;
rdata[1].buffer_std = false;
rdata[1].next = NULL;

recptr = XLogInsert(RM_XLOG_ID, XLOG_HINT, rdata);
}

return recptr;
}

That should work?

Greetings,

Andres Freund

--
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] corrupt pages detected by enabling checksums

2013-04-04 Thread Simon Riggs
On 4 April 2013 15:53, Andres Freund and...@2ndquadrant.com wrote:

 Unfortunately I find that approach unacceptably ugly.

Yeh. If we can confirm its a fix we can discuss a cleaner patch and
that is much better.

-- 
 Simon Riggs   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] corrupt pages detected by enabling checksums

2013-04-04 Thread Jeff Davis
Andres,

Thank you for diagnosing this problem!

On Thu, 2013-04-04 at 16:53 +0200, Andres Freund wrote:
 I think the route you quickly sketched is more realistic. That would
 remove all knowledge obout XLOG_HINT from generic code hich is a very
 good thing, I spent like 15minutes yesterday wondering whether the early
 return in there might be the cause of the bug...

I like this approach. It may have some performance impact though,
because there are a couple extra spinlocks taken, and an extra memcpy.
The code looks good to me except that we should be consistent about the
page hole -- XLogCheckBuffer is calculating it, but then we copy the
entire page. I don't think anything can change the size of the page hole
while we have a shared lock on the buffer, so it seems OK to skip the
page hole during the copy.

Another possible approach is to drop the lock on the buffer and
re-acquire it in exclusive mode after we find out we'll need to do
XLogInsert. It means that MarkBufferDirtyHint may end up blocking for
longer, but would avoid the memcpy. I haven't really thought through the
details.

Regards,
Jeff Davis




-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] corrupt pages detected by enabling checksums

2013-04-04 Thread Andres Freund
On 2013-04-04 12:59:36 -0700, Jeff Davis wrote:
 Andres,
 
 Thank you for diagnosing this problem!
 
 On Thu, 2013-04-04 at 16:53 +0200, Andres Freund wrote:
  I think the route you quickly sketched is more realistic. That would
  remove all knowledge obout XLOG_HINT from generic code hich is a very
  good thing, I spent like 15minutes yesterday wondering whether the early
  return in there might be the cause of the bug...
 
 I like this approach. It may have some performance impact though,
 because there are a couple extra spinlocks taken, and an extra memcpy.

I don't think its really slower. Earlier the code took WalInsertLock
everytime, even if we ended up not logging anything. Thats far more
epensive than a single spinlock. And the copy should also only be taken
in the case we need to log. So I think we end up ahead of the current
state.

 The code looks good to me except that we should be consistent about the
 page hole -- XLogCheckBuffer is calculating it, but then we copy the
 entire page. I don't think anything can change the size of the page hole
 while we have a shared lock on the buffer, so it seems OK to skip the
 page hole during the copy.

I don't think it can change either, but I doubt that there's a
performance advantage by not copying the hole. I'd guess the simpler
code ends up faster.

 Another possible approach is to drop the lock on the buffer and
 re-acquire it in exclusive mode after we find out we'll need to do
 XLogInsert. It means that MarkBufferDirtyHint may end up blocking for
 longer, but would avoid the memcpy. I haven't really thought through the
 details.

That sounds like it would be prone to deadlocks. So I would dislike to
go there.

Will write up a patch tomorrow.

Greetings,

Andres Freund

-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] corrupt pages detected by enabling checksums

2013-04-04 Thread Jeff Janes
On Thu, Apr 4, 2013 at 5:30 AM, Simon Riggs si...@2ndquadrant.com wrote:

 On 4 April 2013 02:39, Andres Freund and...@2ndquadrant.com wrote:

  If by now the first backend has proceeded to PageSetLSN() we are writing
  different data to disk than the one we computed the checksum of
  before. Boom.

 Right, so nothing else we were doing was wrong, that's why we couldn't
 spot a bug. The problem is that we aren't replaying enough WAL because
 the checksum on the WAL record is broke.


This brings up a pretty frightening possibility to me, unrelated to data
checksums.  If a bit gets twiddled in the WAL file due to a hardware issue
or a cosmic ray, and then a crash happens, automatic recovery will stop
early with the failed WAL checksum with an innocuous looking message.  The
system will start up but will be invisibly inconsistent, and will proceed
to overwrite that portion of the WAL file which contains the old data (real
data that would have been necessary to reconstruct, once the corruption is
finally realized ) with an end-of-recovery checkpoint record and continue
to chew up real data from there.

I don't know a solution here, though, other than trusting your hardware.
 Changing timelines upon ordinary crash recovery, not just media recovery,
seems excessive but also seems to be exactly what timelines were invented
for, right?

Back to the main topic here, Jeff Davis mentioned earlier You'd still
think this would cause incorrect results, but  I didn't realize the
significance of that until now.  It does produce incorrect query results.
 I was just bailing out before detecting them.  Once I specify
ignore_checksum_failure=1
my test harness complains bitterly about the data not being consistent with
what the Perl program knows it is supposed to be.


 I missed out on doing that with XLOG_HINT records, so the WAL CRC can
 be incorrect because the data is scanned twice; normally that would be
 OK because we have an exclusive lock on the block, but with hints we
 only have share lock. So what we need to do is take a copy of the
 buffer before we do XLogInsert().

 Simple patch to do this attached for discussion. (Not tested).


 We might also do this by modifying the WAL record to take the whole
 block and bypass the BkpBlock mechanism entirely. But that's more work
 and doesn't seem like it would be any cleaner. I figure lets solve the
 problem first then discuss which approach is best.



I've tested your patch it and it seems to do the job.  It has survived far
longer than unpatched ever did, with neither checksum warnings, nor
complaints of inconsistent data.  (I haven't analyzed the code much, just
the results, and leave the discussion of the best approach to others)


 Thanks,

Jeff


Re: [HACKERS] corrupt pages detected by enabling checksums

2013-04-04 Thread Jeff Davis
On Thu, 2013-04-04 at 14:21 -0700, Jeff Janes wrote:

 This brings up a pretty frightening possibility to me, unrelated to
 data checksums.  If a bit gets twiddled in the WAL file due to a
 hardware issue or a cosmic ray, and then a crash happens, automatic
 recovery will stop early with the failed WAL checksum with
 an innocuous looking message.  The system will start up but will be
 invisibly inconsistent, and will proceed to overwrite that portion of
 the WAL file which contains the old data (real data that would have
 been necessary to reconstruct, once the corruption is finally realized
 ) with an end-of-recovery checkpoint record and continue to chew up
 real data from there.

I've been worried about that for a while, and I may have even seen
something like this happen before. We could perhaps do some checks, but
in general it seems hard to solve without writing flushing some data to
two different places. For example, you could flush WAL, and then update
an LSN stored somewhere else indicating how far the WAL has been
written. Recovery could complain if it gets an error in the WAL before
that point.

But obviously, that makes WAL flushes expensive (in many cases, about
twice as expensive).

Maybe it's not out of the question to offer that as an option if nobody
has a better idea. Performance-conscious users could place the extra LSN
on an SSD or NVRAM or something; or maybe use commit_delay or async
commits. It would only need to store a few bytes.

Streaming replication mitigates the problem somewhat, by being a second
place to write WAL.

Regards,
Jeff Davis




-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] corrupt pages detected by enabling checksums

2013-04-04 Thread Jeff Davis
On Thu, 2013-04-04 at 22:39 +0200, Andres Freund wrote:
 I don't think its really slower. Earlier the code took WalInsertLock
 everytime, even if we ended up not logging anything. Thats far more
 epensive than a single spinlock. And the copy should also only be taken
 in the case we need to log. So I think we end up ahead of the current
 state.

Good point.

  The code looks good to me except that we should be consistent about the
  page hole -- XLogCheckBuffer is calculating it, but then we copy the
  entire page. I don't think anything can change the size of the page hole
  while we have a shared lock on the buffer, so it seems OK to skip the
  page hole during the copy.
 
 I don't think it can change either, but I doubt that there's a
 performance advantage by not copying the hole. I'd guess the simpler
 code ends up faster.

I was thinking more about the WAL size, but I don't have a strong
opinion.

Regards,
Jeff Davis



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] corrupt pages detected by enabling checksums

2013-04-04 Thread Tom Lane
Jeff Davis pg...@j-davis.com writes:
 On Thu, 2013-04-04 at 14:21 -0700, Jeff Janes wrote:
 This brings up a pretty frightening possibility to me, unrelated to
 data checksums.  If a bit gets twiddled in the WAL file due to a
 hardware issue or a cosmic ray, and then a crash happens, automatic
 recovery will stop early with the failed WAL checksum with
 an innocuous looking message.  The system will start up but will be
 invisibly inconsistent, and will proceed to overwrite that portion of
 the WAL file which contains the old data (real data that would have
 been necessary to reconstruct, once the corruption is finally realized
 ) with an end-of-recovery checkpoint record and continue to chew up
 real data from there.

 I've been worried about that for a while, and I may have even seen
 something like this happen before. We could perhaps do some checks, but
 in general it seems hard to solve without writing flushing some data to
 two different places. For example, you could flush WAL, and then update
 an LSN stored somewhere else indicating how far the WAL has been
 written. Recovery could complain if it gets an error in the WAL before
 that point.

 But obviously, that makes WAL flushes expensive (in many cases, about
 twice as expensive).

 Maybe it's not out of the question to offer that as an option if nobody
 has a better idea. Performance-conscious users could place the extra LSN
 on an SSD or NVRAM or something; or maybe use commit_delay or async
 commits. It would only need to store a few bytes.

At least on traditional rotating media, this is only going to perform
well if you dedicate two drives to the purpose.  At which point you
might as well just say let's write two copies of WAL.  Or maybe three
copies, so that you can take a vote when they disagree.  While this is
not so unreasonable on its face for ultra-high-reliability requirements,
I can't escape the feeling that we'd just be reinventing software RAID.
There's no reason to think that we can deal with this class of problems
better than the storage system can.

 Streaming replication mitigates the problem somewhat, by being a second
 place to write WAL.

Yeah, if you're going to do this at all it makes more sense for the
redundant copies to be on other machines.  So the questions that that
leads to are how smart is our SR code about dealing with a master that
tries to re-send WAL regions it already sent, and what a slave ought to
do in such a situation if the new data doesn't match.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] corrupt pages detected by enabling checksums

2013-04-04 Thread Jeff Davis
On Thu, 2013-04-04 at 21:06 -0400, Tom Lane wrote:
 I can't escape the feeling that we'd just be reinventing software RAID.
 There's no reason to think that we can deal with this class of problems
 better than the storage system can.

The goal would be to reliably detect a situation where WAL that has been
flushed successfully was later corrupted; not necessarily to recover
when we find it. Unless it's something like ZFS, I don't think most
software RAID will reliably detect corruption.

A side benefit is that it would also help catch bugs like this in the
future.

I'm not advocating for this particular solution, but I do concur with
Jeff Janes that it's a little bit scary and that we can entertain some
ideas about how to mitigate it.

Regards,
Jeff Davis




-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] corrupt pages detected by enabling checksums

2013-04-03 Thread Andres Freund
On 2013-04-03 15:57:49 -0700, Jeff Janes wrote:
 I've changed the subject from regression test failed when enabling
 checksum because I now know they are totally unrelated.
 
 My test case didn't need to depend on archiving being on, and so with a
 simple tweak I rendered the two issues orthogonal.
 
 
 On Wed, Apr 3, 2013 at 12:15 PM, Jeff Davis pg...@j-davis.com wrote:
 
  On Mon, 2013-04-01 at 19:51 -0700, Jeff Janes wrote:
 
   What I would probably really want is the data as it existed after the
   crash but before recovery started, but since the postmaster
   immediately starts recovery after the crash, I don't know of a good
   way to capture this.
 
  Can you just turn off the restart_after_crash GUC? I had a chance to
  look at this, and seeing the block before and after recovery would be
  nice. I didn't see a log file in the data directory, but it didn't go
  through recovery, so I assume it already did that.
 
 
 You don't know that the cluster is in the bad state until after it goes
 through recovery because most crashes recover perfectly fine.  So it would
 have to make a side-copy of the cluster after the crash, then recover the
 original and see how things go, then either retain or delete the side-copy.
  Unfortunately my testing harness can't do this at the moment, because the
 perl script storing the consistency info needs to survive over the crash
 and recovery.   It might take me awhile to figure out how to make it do
 this.
 
 I have the server log just go to stderr, where it gets mingled together
 with messages from my testing harness.  I had not uploaded that file.
 
 Here is a new upload. It contains both a data_dir tarball including xlog,
 and the mingled stderr (do_new.out)
 
 https://drive.google.com/folderview?id=0Bzqrh1SO9FcEQmVzSjlmdWZvUHcusp=sharing
 
 The other 3 files in it constitute the harness.  It is a quite a mess, with
 some hard-coded paths.  The just-posted fix for wal_keep_segments will also
 have to be applied.
 
 
 
 
  The block is corrupt as far as I can tell. The first third is written,
  and the remainder is all zeros. The header looks like this:
 
 
 Yes, that part is by my design.  Why it didn't get fixed from a FPI is not
 by my design, or course.

There are no FPIs (if you mean a full page image with that) afaics:

Your logs tell us about recovery:
27692  2013-04-03 10:09:15.621 PDT:LOG:  redo starts at 1/3190
27692  2013-04-03 10:09:15.647 PDT:LOG:  incorrect resource manager data 
checksum in record at 1/31169A68
27692  2013-04-03 10:09:15.647 PDT:LOG:  redo done at 1/31169A38

And then the error:

27698 SELECT 2013-04-03 10:09:16.688 PDT:WARNING:  page verification failed, 
calculated checksum 16296 but expected 49517
27698 SELECT 2013-04-03 10:09:16.688 PDT:ERROR:  invalid page in block 90 of 
relation base/16384/835589


looking at xlog from that time, the first records are:
rmgr: XLOGlen (rec/tot): 72/   104, tx:  0, lsn: 
1/3128, prev 1/3090, bkp: , desc: checkpoint: redo 1/3128; tli 
1; prev tli 1; fpw true; xid 0/26254997; oid 835589; multi 1; offset 0; oldest 
xid 1799 in DB 1; oldest multi 1 in DB 1; oldest running xid 0; online
rmgr: XLOGlen (rec/tot):  4/36, tx:  0, lsn: 
1/3190, prev 1/3128, bkp: , desc: nextOid: 843781
rmgr: Storage len (rec/tot): 16/48, tx:  0, lsn: 
1/31B8, prev 1/3190, bkp: , desc: file create: base/16384/835589
rmgr: Heaplen (rec/tot): 37/  7905, tx:   26254997, lsn: 
1/31E8, prev 1/31B8, bkp: 1000, desc: hot_update: rel 1663/16384/12660; 
tid 0/47 xmax 26254997 ; new tid 0/33 xmax 0

So the file was created after the last checkpoint, so no full pages need
to be logged. And we theoretically should be able to recovery all things
like torn pages from the WAL since that should cover everything that
happened to that file.

The bkp block 1 indicated in the 4th record above is the only one in
that period of WAL btw.

Looking a bit more, the last page that's covered by WAL is:
rmgr: Heaplen (rec/tot): 35/67, tx:   26254998, lsn: 
1/311699A8, prev 1/31169960, bkp: , desc: insert: rel 1663/16384/835589; 
tid 44/56

which is far earlier than the block 90 the error is found in unless i
misunderstand something. Which is strange, since non-zero content to
those pages shouldn't go to disk until the respective WAL is
flushed. Huh, are we missing something here?

Greetings,

Andres Freund

-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] corrupt pages detected by enabling checksums

2013-04-03 Thread Andres Freund
On 2013-04-04 01:52:41 +0200, Andres Freund wrote:
 On 2013-04-03 15:57:49 -0700, Jeff Janes wrote:
  I've changed the subject from regression test failed when enabling
  checksum because I now know they are totally unrelated.
  
  My test case didn't need to depend on archiving being on, and so with a
  simple tweak I rendered the two issues orthogonal.
  
  
  On Wed, Apr 3, 2013 at 12:15 PM, Jeff Davis pg...@j-davis.com wrote:
  
   On Mon, 2013-04-01 at 19:51 -0700, Jeff Janes wrote:
  
What I would probably really want is the data as it existed after the
crash but before recovery started, but since the postmaster
immediately starts recovery after the crash, I don't know of a good
way to capture this.
  
   Can you just turn off the restart_after_crash GUC? I had a chance to
   look at this, and seeing the block before and after recovery would be
   nice. I didn't see a log file in the data directory, but it didn't go
   through recovery, so I assume it already did that.
  
  
  You don't know that the cluster is in the bad state until after it goes
  through recovery because most crashes recover perfectly fine.  So it would
  have to make a side-copy of the cluster after the crash, then recover the
  original and see how things go, then either retain or delete the side-copy.
   Unfortunately my testing harness can't do this at the moment, because the
  perl script storing the consistency info needs to survive over the crash
  and recovery.   It might take me awhile to figure out how to make it do
  this.
  
  I have the server log just go to stderr, where it gets mingled together
  with messages from my testing harness.  I had not uploaded that file.
  
  Here is a new upload. It contains both a data_dir tarball including xlog,
  and the mingled stderr (do_new.out)
  
  https://drive.google.com/folderview?id=0Bzqrh1SO9FcEQmVzSjlmdWZvUHcusp=sharing
  
  The other 3 files in it constitute the harness.  It is a quite a mess, with
  some hard-coded paths.  The just-posted fix for wal_keep_segments will also
  have to be applied.
  
  
  
  
   The block is corrupt as far as I can tell. The first third is written,
   and the remainder is all zeros. The header looks like this:
  
  
  Yes, that part is by my design.  Why it didn't get fixed from a FPI is not
  by my design, or course.
 
 There are no FPIs (if you mean a full page image with that) afaics:
 
 Your logs tell us about recovery:
 27692  2013-04-03 10:09:15.621 PDT:LOG:  redo starts at 1/3190
 27692  2013-04-03 10:09:15.647 PDT:LOG:  incorrect resource manager data 
 checksum in record at 1/31169A68
 27692  2013-04-03 10:09:15.647 PDT:LOG:  redo done at 1/31169A38
 
 And then the error:
 
 27698 SELECT 2013-04-03 10:09:16.688 PDT:WARNING:  page verification failed, 
 calculated checksum 16296 but expected 49517
 27698 SELECT 2013-04-03 10:09:16.688 PDT:ERROR:  invalid page in block 90 of 
 relation base/16384/835589
 
 
 looking at xlog from that time, the first records are:
 rmgr: XLOGlen (rec/tot): 72/   104, tx:  0, lsn: 
 1/3128, prev 1/3090, bkp: , desc: checkpoint: redo 1/3128; 
 tli 1; prev tli 1; fpw true; xid 0/26254997; oid 835589; multi 1; offset 0; 
 oldest xid 1799 in DB 1; oldest multi 1 in DB 1; oldest running xid 0; online
 rmgr: XLOGlen (rec/tot):  4/36, tx:  0, lsn: 
 1/3190, prev 1/3128, bkp: , desc: nextOid: 843781
 rmgr: Storage len (rec/tot): 16/48, tx:  0, lsn: 
 1/31B8, prev 1/3190, bkp: , desc: file create: base/16384/835589
 rmgr: Heaplen (rec/tot): 37/  7905, tx:   26254997, lsn: 
 1/31E8, prev 1/31B8, bkp: 1000, desc: hot_update: rel 
 1663/16384/12660; tid 0/47 xmax 26254997 ; new tid 0/33 xmax 0
 
 So the file was created after the last checkpoint, so no full pages need
 to be logged. And we theoretically should be able to recovery all things
 like torn pages from the WAL since that should cover everything that
 happened to that file.
 
 The bkp block 1 indicated in the 4th record above is the only one in
 that period of WAL btw.
 
 Looking a bit more, the last page that's covered by WAL is:
 rmgr: Heaplen (rec/tot): 35/67, tx:   26254998, lsn: 
 1/311699A8, prev 1/31169960, bkp: , desc: insert: rel 1663/16384/835589; 
 tid 44/56
 
 which is far earlier than the block 90 the error is found in unless i
 misunderstand something. Which is strange, since non-zero content to
 those pages shouldn't go to disk until the respective WAL is
 flushed. Huh, are we missing something here?

Looking at the page lsn's with dd I noticed something peculiar:

page 0:
01 00 00 00 18 c2 00 31 = 1/3100C218
page 1:
01 00 00 00 80 44 01 31 = 1/31014480
page 10:
01 00 00 00 60 ce 05 31 = 1/3105ce60
page 43:
01 00 00 00 58 7a 16 31 = 1/31167a58
page 44:
01 00 00 00 f0 99 16 31 = 1/311699f0
page 45:
00 00 00 00 00 00 00 00 = 0/0
page 90:
01 00 00 00 90 17 1d 32 = 1/321d1790
page 

Re: [HACKERS] corrupt pages detected by enabling checksums

2013-04-03 Thread Jeff Davis
 On Wed, 2013-04-03 at 15:57 -0700, Jeff Janes wrote:



 You don't know that the cluster is in the bad state until after it
 goes through recovery because most crashes recover perfectly fine.  So
 it would have to make a side-copy of the cluster after the crash, then
 recover the original and see how things go, then either retain or
 delete the side-copy.  Unfortunately my testing harness can't do this
 at the moment, because the perl script storing the consistency info
 needs to survive over the crash and recovery.   It might take
 me awhile to figure out how to make it do this.

Hmm... you're right, that is difficult to integrate into a test.

Another approach is to expand PageHeaderIsValid in a pre-checksums
version to do some extra checks (perhaps the same checks as pg_filedump
does). It might be able to at least detect simple cases, like all zeros
between pd_upper and pd_special (when pd_upper  pd_special). Still not
easy, but maybe more practical than saving the directory like that. We
may even want a GUC for that to aid future debugging (obviously it would
have a performance impact).

 The other 3 files in it constitute the harness.  It is a quite a mess,
 with some hard-coded paths.  The just-posted fix for wal_keep_segments
 will also have to be applied.

I'm still trying to reproduce the problem myself. I keep trying simpler
versions of your test and I don't see the problem. It's a little awkward
to try to run the full version of your test (and I'm jumping between
code inspection and various other ideas).

 I'll work on it, but it will take awhile to figure out exactly how to
 use pg_filedump to do that, and how to automate that process and
 incorporate it into the harness.

It might be more productive to try to expand PageHeaderIsValid as I
suggested above.

 In the meantime, I tested the checksum commit itself (96ef3b8ff1c) and
 the problem occurs there, so if the problem is not the checksums
 themselves (and I agree it probably isn't) it must have been
 introduced before that rather than after.

Thank you for all of your work on this. I have also attached another
test patch that disables most of the complex areas of the checksums
patch (VM bits, hint bits, etc.). If you apply different portions of the
patch (or the whole thing), it will help eliminate possibilities. In
particular, eliminating the calls to PageSetAllVisible+visibilitymap_set
could tell us a lot.

Also, the first data directory you posted had a problem with a page that
appeared to be a new page (otherwise I would have expected either old or
new data at the end). Do you think this might be a problem only for new
pages? Or do you think your test just makes it likely that many writes
will happen on new pages?

I did find one potential problem in the checksums code (aside from my
previous patch involving wal_level=archive): visibilitymap_set is called
sometimes before the heap page is marked dirty. I will examine a little
more closely, but I expect that to require another patch. Not sure if
that could explain this problem or not.

Regards,
Jeff Davis

*** a/src/backend/access/heap/pruneheap.c
--- b/src/backend/access/heap/pruneheap.c
***
*** 257,264  heap_page_prune(Relation relation, Buffer buffer, TransactionId OldestXmin,
  		 * point in repeating the prune/defrag process until something else
  		 * happens to the page.
  		 */
! 		if (((PageHeader) page)-pd_prune_xid != prstate.new_prune_xid ||
! 			PageIsFull(page))
  		{
  			((PageHeader) page)-pd_prune_xid = prstate.new_prune_xid;
  			PageClearFull(page);
--- 257,264 
  		 * point in repeating the prune/defrag process until something else
  		 * happens to the page.
  		 */
! 		if (false  (((PageHeader) page)-pd_prune_xid != prstate.new_prune_xid ||
! 	  PageIsFull(page)))
  		{
  			((PageHeader) page)-pd_prune_xid = prstate.new_prune_xid;
  			PageClearFull(page);
*** a/src/backend/commands/vacuumlazy.c
--- b/src/backend/commands/vacuumlazy.c
***
*** 668,674  lazy_scan_heap(Relation onerel, LVRelStats *vacrelstats,
  			freespace = PageGetHeapFreeSpace(page);
  
  			/* empty pages are always all-visible */
! 			if (!PageIsAllVisible(page))
  			{
  PageSetAllVisible(page);
  MarkBufferDirty(buf);
--- 668,674 
  			freespace = PageGetHeapFreeSpace(page);
  
  			/* empty pages are always all-visible */
! 			if (false  !PageIsAllVisible(page))
  			{
  PageSetAllVisible(page);
  MarkBufferDirty(buf);
***
*** 901,907  lazy_scan_heap(Relation onerel, LVRelStats *vacrelstats,
  		freespace = PageGetHeapFreeSpace(page);
  
  		/* mark page all-visible, if appropriate */
! 		if (all_visible)
  		{
  			if (!PageIsAllVisible(page))
  			{
--- 901,907 
  		freespace = PageGetHeapFreeSpace(page);
  
  		/* mark page all-visible, if appropriate */
! 		if (false  all_visible)
  		{
  			if (!PageIsAllVisible(page))
  			{
***
*** 1149,1155  lazy_vacuum_page(Relation onerel, 

Re: [HACKERS] corrupt pages detected by enabling checksums

2013-04-03 Thread Andres Freund
On 2013-04-04 02:28:32 +0200, Andres Freund wrote:
 On 2013-04-04 01:52:41 +0200, Andres Freund wrote:
  On 2013-04-03 15:57:49 -0700, Jeff Janes wrote:
   I've changed the subject from regression test failed when enabling
   checksum because I now know they are totally unrelated.
   
   My test case didn't need to depend on archiving being on, and so with a
   simple tweak I rendered the two issues orthogonal.
   
   
   On Wed, Apr 3, 2013 at 12:15 PM, Jeff Davis pg...@j-davis.com wrote:
   
On Mon, 2013-04-01 at 19:51 -0700, Jeff Janes wrote:
   
 What I would probably really want is the data as it existed after the
 crash but before recovery started, but since the postmaster
 immediately starts recovery after the crash, I don't know of a good
 way to capture this.
   
Can you just turn off the restart_after_crash GUC? I had a chance to
look at this, and seeing the block before and after recovery would be
nice. I didn't see a log file in the data directory, but it didn't go
through recovery, so I assume it already did that.
   
   
   You don't know that the cluster is in the bad state until after it goes
   through recovery because most crashes recover perfectly fine.  So it would
   have to make a side-copy of the cluster after the crash, then recover the
   original and see how things go, then either retain or delete the 
   side-copy.
Unfortunately my testing harness can't do this at the moment, because the
   perl script storing the consistency info needs to survive over the crash
   and recovery.   It might take me awhile to figure out how to make it do
   this.
   
   I have the server log just go to stderr, where it gets mingled together
   with messages from my testing harness.  I had not uploaded that file.
   
   Here is a new upload. It contains both a data_dir tarball including xlog,
   and the mingled stderr (do_new.out)
   
   https://drive.google.com/folderview?id=0Bzqrh1SO9FcEQmVzSjlmdWZvUHcusp=sharing
   
   The other 3 files in it constitute the harness.  It is a quite a mess, 
   with
   some hard-coded paths.  The just-posted fix for wal_keep_segments will 
   also
   have to be applied.
   
   
   
   
The block is corrupt as far as I can tell. The first third is written,
and the remainder is all zeros. The header looks like this:
   
   
   Yes, that part is by my design.  Why it didn't get fixed from a FPI is not
   by my design, or course.
  
  There are no FPIs (if you mean a full page image with that) afaics:
  
  Your logs tell us about recovery:
  27692  2013-04-03 10:09:15.621 PDT:LOG:  redo starts at 1/3190
  27692  2013-04-03 10:09:15.647 PDT:LOG:  incorrect resource manager data 
  checksum in record at 1/31169A68
  27692  2013-04-03 10:09:15.647 PDT:LOG:  redo done at 1/31169A38

 Looking at the page lsn's with dd I noticed something peculiar:
 
 page 0:
 01 00 00 00 18 c2 00 31 = 1/3100C218
 page 1:
 01 00 00 00 80 44 01 31 = 1/31014480
 page 10:
 01 00 00 00 60 ce 05 31 = 1/3105ce60
 page 43:
 01 00 00 00 58 7a 16 31 = 1/31167a58
 page 44:
 01 00 00 00 f0 99 16 31 = 1/311699f0
 page 45:
 00 00 00 00 00 00 00 00 = 0/0
 page 90:
 01 00 00 00 90 17 1d 32 = 1/321d1790
 page 91:
 01 00 00 00 38 ef 1b 32 = 1/321bef38
 
 So we have written out pages that are after pages without a LSN that
 have an LSN thats *beyond* the point XLOG has successfully been written
 to disk (1/31169A38)?

By now I am pretty sure the issue is that the end-of-wal is detected too
early. Above the end is detected at 1/31169A38, but the next WAL file
contains valid data:
pg_xlogdump /tmp/tmp/data2/pg_xlog/000100010032 -n 10
rmgr: Heap2   len (rec/tot): 26/58, tx:  0, lsn: 
1/3230, prev 1/31D8, bkp: , desc: clean: rel 1663/16384/835589; blk 
122 remxid 26328926
rmgr: Heaplen (rec/tot): 51/83, tx:   26328964, lsn: 
1/3270, prev 1/3230, bkp: , desc: update: rel 1663/16384/835589; 
tid 122/191 xmax 26328964 ; new tid 129/140 xmax 0
rmgr: Heap2   len (rec/tot): 30/62, tx:  0, lsn: 
1/32C8, prev 1/3270, bkp: , desc: clean: rel 1663/16384/835589; blk 
63 remxid 26328803
rmgr: Btree   len (rec/tot): 34/66, tx:   26328964, lsn: 
1/32000108, prev 1/32C8, bkp: , desc: insert: rel 1663/16384/835590; 
tid 25/14

That would explains exactly those symptopms, wouldn't it? Whats causing
that - I am not sure yet.

Greetings,

Andres Freund

-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] corrupt pages detected by enabling checksums

2013-04-03 Thread Tom Lane
and...@anarazel.de (Andres Freund) writes:
 Looking at the page lsn's with dd I noticed something peculiar:

 page 0:
 01 00 00 00 18 c2 00 31 = 1/3100C218
 page 1:
 01 00 00 00 80 44 01 31 = 1/31014480
 page 10:
 01 00 00 00 60 ce 05 31 = 1/3105ce60
 page 43:
 01 00 00 00 58 7a 16 31 = 1/31167a58
 page 44:
 01 00 00 00 f0 99 16 31 = 1/311699f0
 page 45:
 00 00 00 00 00 00 00 00 = 0/0
 page 90:
 01 00 00 00 90 17 1d 32 = 1/321d1790
 page 91:
 01 00 00 00 38 ef 1b 32 = 1/321bef38

 So we have written out pages that are after pages without a LSN that
 have an LSN thats *beyond* the point XLOG has successfully been written
 to disk (1/31169A38)?

If you're looking into the FPIs, those would contain the page's older
LSN, not the one assigned by the current WAL record.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] corrupt pages detected by enabling checksums

2013-04-03 Thread Andres Freund
On 2013-04-03 20:45:51 -0400, Tom Lane wrote:
 and...@anarazel.de (Andres Freund) writes:
  Looking at the page lsn's with dd I noticed something peculiar:
 
  page 0:
  01 00 00 00 18 c2 00 31 = 1/3100C218
  page 1:
  01 00 00 00 80 44 01 31 = 1/31014480
  page 10:
  01 00 00 00 60 ce 05 31 = 1/3105ce60
  page 43:
  01 00 00 00 58 7a 16 31 = 1/31167a58
  page 44:
  01 00 00 00 f0 99 16 31 = 1/311699f0
  page 45:
  00 00 00 00 00 00 00 00 = 0/0
  page 90:
  01 00 00 00 90 17 1d 32 = 1/321d1790
  page 91:
  01 00 00 00 38 ef 1b 32 = 1/321bef38
 
  So we have written out pages that are after pages without a LSN that
  have an LSN thats *beyond* the point XLOG has successfully been written
  to disk (1/31169A38)?
 
 If you're looking into the FPIs, those would contain the page's older
 LSN, not the one assigned by the current WAL record.

Nope, thats from the heap, and the LSNs are *newer* than what startup
recovered to. I am pretty sure by now we are missing out on valid WAL, I
am just not sure why.

Unfortunately we can't easily diagnose what happened at:
27692  2013-04-03 10:09:15.647 PDT:LOG:  incorrect resource manager data 
checksum in record at 1/31169A68
since the startup process wrote its end of recovery checkpoint there:
rmgr: XLOGlen (rec/tot): 72/   104, tx:  0, lsn: 
1/31169A68, prev 1/31169A38, bkp: , desc: checkpoint: redo 1/31169A68; tli 
1; prev tli 1; fpw true; xid 0/26254999; oid 843781; multi 1; offset 0; oldest 
xid 1799 in DB 1; oldest multi 1 in DB 1; oldest running xid 0; shutdown

Starting from a some blocks in that wal segments later:
pg_xlogdump /tmp/tmp/data2/pg_xlog/000100010031 -s 1/3116c000 -n 10
first record is after 1/3116C000, at 1/3116D9D8, skipping over 6616 bytes
rmgr: Heaplen (rec/tot): 51/83, tx:   26254999, lsn: 
1/3116D9D8, prev 1/3116BA20, bkp: , desc: update: rel 1663/16384/835589; 
tid 38/148 xmax 26254999 ; new tid 44/57 xmax 0
rmgr: Btree   len (rec/tot): 34/66, tx:   26254999, lsn: 
1/3116DA30, prev 1/3116D9D8, bkp: , desc: insert: rel 1663/16384/835590; 
tid 25/319
rmgr: Heaplen (rec/tot): 51/83, tx:   26255000, lsn: 
1/3116DA78, prev 1/3116DA30, bkp: , desc: update: rel 1663/16384/835589; 
tid 19/214 xmax 26255000 ; new tid 44/58 xmax 0

the records continue again.

Greetings,


Andres Freund

-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] corrupt pages detected by enabling checksums

2013-04-03 Thread Andres Freund
On 2013-04-04 02:58:43 +0200, Andres Freund wrote:
 On 2013-04-03 20:45:51 -0400, Tom Lane wrote:
  and...@anarazel.de (Andres Freund) writes:
   Looking at the page lsn's with dd I noticed something peculiar:
 
   page 0:
   01 00 00 00 18 c2 00 31 = 1/3100C218
   page 1:
   01 00 00 00 80 44 01 31 = 1/31014480
   page 10:
   01 00 00 00 60 ce 05 31 = 1/3105ce60
   page 43:
   01 00 00 00 58 7a 16 31 = 1/31167a58
   page 44:
   01 00 00 00 f0 99 16 31 = 1/311699f0
   page 45:
   00 00 00 00 00 00 00 00 = 0/0
   page 90:
   01 00 00 00 90 17 1d 32 = 1/321d1790
   page 91:
   01 00 00 00 38 ef 1b 32 = 1/321bef38
 
   So we have written out pages that are after pages without a LSN that
   have an LSN thats *beyond* the point XLOG has successfully been written
   to disk (1/31169A38)?
 
  If you're looking into the FPIs, those would contain the page's older
  LSN, not the one assigned by the current WAL record.

 Nope, thats from the heap, and the LSNs are *newer* than what startup
 recovered to. I am pretty sure by now we are missing out on valid WAL, I
 am just not sure why.

Ok, I think I see the bug. And I think its been introduced in the
checkpoints patch.

Remember, as explained in other mails, I am pretty sure the end of wal
was errorneously detected, resulting in the following error:
27692  2013-04-03 10:09:15.647 PDT:LOG:  incorrect resource manager data 
checksum in record at 1/31169A68

Not that without a *hardware* crash end of wal is normally found first
by the checks for wrong LSNs or wrong record lengths. Not here.

My theory is that a page that was full page written changed while it was
inserted into the WAL which caused the error. A possibly scenario for
that would e.g. two concurrent calls to MarkBufferDirtyHint().
Note:
 * 2. The caller might have only share-lock instead of exclusive-lock on the
 *buffer's content lock.

if ((bufHdr-flags  (BM_DIRTY | BM_JUST_DIRTIED)) !=
(BM_DIRTY | BM_JUST_DIRTIED))
{
if (DataChecksumsEnabled()  (bufHdr-flags  BM_PERMANENT))
{
MyPgXact-delayChkpt = delayChkpt = true;
lsn = XLogSaveBufferForHint(buffer); /* PART 1 */
}
}

LockBufHdr(bufHdr);
Assert(bufHdr-refcount  0);

if (!(bufHdr-flags  BM_DIRTY))
{
dirtied = true; /* Means will be dirtied by 
this action */

/*
 * Set the page LSN if we wrote a backup block. We 
aren't
 * supposed to set this when only holding a share lock 
but
 * as long as we serialise it somehow we're OK. We 
choose to
 * set LSN while holding the buffer header lock, which 
causes
 * any reader of an LSN who holds only a share lock to 
also
 * obtain a buffer header lock before using 
PageGetLSN().
 * Fortunately, thats not too many places.
 *
 * If checksums are enabled, you might think we should 
reset the
 * checksum here. That will happen when the page is 
written
 * sometime later in this checkpoint cycle.
 */
if (!XLogRecPtrIsInvalid(lsn))
PageSetLSN(page, lsn); /* PART 2 */
}

So XLogSaveBufferForHint(buffer) is called for all buffers which are not
yet dirty. Without any locks. It just does:

XLogSaveBufferForHint(Buffer buffer)
{
rdata[0].data = (char *) (watermark);
rdata[0].len = sizeof(int);
rdata[0].next = (rdata[1]);

rdata[1].data = NULL;
rdata[1].len = 0;
rdata[1].buffer = buffer;
rdata[1].buffer_std = true;
rdata[1].next = NULL;

return XLogInsert(RM_XLOG_ID, XLOG_HINT, rdata);
}

I think what happens is that that one backend grabs WALInsertLock first,
it computes a full page write for the buffer, including a CRC. And
returns a valid LSN. Then it continues towards the PageSetLSN() after
LockBufHdr().
Allthewhile the second backend notices that WALInsertLock is free again
and continues towards the WALInsertLock protected parts of XLogInsert().

Only that it already has done that part:
/*
 * Calculate CRC of the data, including all the backup blocks
 *
 * Note that the record header isn't added into the CRC initially since 
we
 * don't know the prev-link yet.  Thus, the CRC will represent the CRC 
of
 * the whole record in the order: rdata, then backup blocks, then record
 * header.
 */
INIT_CRC32(rdata_crc);
for (rdt = rdata; rdt != NULL; rdt = rdt-next)
COMP_CRC32(rdata_crc, rdt-data, rdt-len);

It goes on to write all the data to the