Re: [HACKERS] Checksums, state of play

2012-03-08 Thread Peter Geoghegan
On 7 March 2012 20:56, Bruce Momjian br...@momjian.us wrote:
 Yep, good summary.  Giving ourselves a few months to think about it and
 consider other failure cases will make this a great 9.3 addition.

Recent Intel processors that support SSE 4.2, including those based on
the core microarchitecture, can calculate a CRC-32C in hardware using
a higher level instruction, similar to the AES related instructions
that Intel chips have had for some time now. Perhaps we should
consider using hardware acceleration where available.

Some interesting details are available from here:

http://lwn.net/Articles/292984/

-- 
Peter Geoghegan       http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training and Services

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Checksums, state of play

2012-03-08 Thread Thom Brown
On 8 March 2012 20:55, Peter Geoghegan pe...@2ndquadrant.com wrote:
 On 7 March 2012 20:56, Bruce Momjian br...@momjian.us wrote:
 Yep, good summary.  Giving ourselves a few months to think about it and
 consider other failure cases will make this a great 9.3 addition.

 Recent Intel processors that support SSE 4.2, including those based on
 the core microarchitecture, can calculate a CRC-32C in hardware using
 a higher level instruction, similar to the AES related instructions
 that Intel chips have had for some time now. Perhaps we should
 consider using hardware acceleration where available.

 Some interesting details are available from here:

 http://lwn.net/Articles/292984/

That's what Facebook did to speed up MySQL.
-- 
Thom

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Checksums, state of play

2012-03-07 Thread Robert Haas
On Tue, Mar 6, 2012 at 2:27 PM, Bruce Momjian br...@momjian.us wrote:
 The feature is no where near complete, and we should not be designing
 features at this stage.

I agree, on both counts.  Although Simon did a good job pulling
together something that basically works in a short amount of time, the
edge cases still need a lot more thought, and work.  Yesterday's
discussion was mostly about turning the feature on and off, which
certainly seems to be the most significant problem with the patch as
it stands.  But there are also a number of other things that have been
discussed and not fully resolved, such as the performance impact of
WAL-logging hint bit changes, the exact way we're going to sandwhich
this into the page header, and the right way to handle the necessary
buffer locking.  I think all of those issues can be resolved but it's
not going to happen in a day, and even once it does there will still
be other, smaller things that need to be cleaned up here and there.
Really measuring and fixing all of these issues will be a matter of
months, not weeks.

Simon seems to be proposing that, in lieu of spending too much more
time fixing this, we just commit it and document the known
limitations.  I don't agree with that.  In particular, I think the
idea of committing a checksum patch that can produce false positives
in the event of a torn page situation is a really bad idea.  The whole
point of the patch is to distinguish between hardware failure and
software failure; if we can't reliably do that, I don't see this as
being much of an advance over the status quo.  I think we're going to
find that the cost of WAL-logging hints is bad enough that people are
only going to do it when they already suspect a problem and want
confirmation.  If they can't rely on that confirmation being real, as
opposed to an outgrowth of a known limitation of the feature, I don't
see the point.  I'd much rather see this feature wait for 9.3 than
ship something that's unreliable in this regard.

So I think it's time to push this one out to 9.3.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Checksums, state of play

2012-03-07 Thread Simon Riggs
On Wed, Mar 7, 2012 at 5:28 PM, Robert Haas robertmh...@gmail.com wrote:
 On Tue, Mar 6, 2012 at 2:27 PM, Bruce Momjian br...@momjian.us wrote:
 The feature is no where near complete, and we should not be designing
 features at this stage.

 I agree, on both counts.  Although Simon did a good job pulling
 together something that basically works in a short amount of time, the
 edge cases still need a lot more thought, and work.  Yesterday's
 discussion was mostly about turning the feature on and off, which
 certainly seems to be the most significant problem with the patch as
 it stands.  But there are also a number of other things that have been
 discussed and not fully resolved, such as the performance impact of
 WAL-logging hint bit changes,

I think saying there are some performance doubts doesn't actually
make it so. We know it will impact performance, and how. I don't see
any big mystery. It's not pretty, but thats what it is. Prepared
transactions are slow, but they're still there.

 the exact way we're going to sand which
 this into the page header,

I'm OK with either pd_tli or pg_pagesize_version. We just need 16
contiguous bytes.

 and the right way to handle the necessary
 buffer locking.

That is resolved, AFAIK

 Simon seems to be proposing that, in lieu of spending too much more
 time fixing this, we just commit it and document the known
 limitations.  I don't agree with that.

Neither do I. It's pretty clear from our last discussion that the
fix proposed doesn't actually work fully so I don't think its going
to be either more robust or more certain to give low false positives.
So I don't think more time fixing this will actually improve the
situation. I'm not suggesting I skip any work, I think the extra work
is pointless.

I do understand the issue of risk that exists, I just think there's
other ways of ameliorating that other than heaping more software onto
the problem.

  In particular, I think the
 idea of committing a checksum patch that can produce false positives
 in the event of a torn page situation is a really bad idea.

Again, that isn't a correct description of the issue, so yes, if that
were the case I would agree it would be a really bad idea.

If we keep misdescribing risk situations of course people will sense
high risk and want to hold back any such patch. So we need to be
balanced and accurate about the risks. Since people have been pretty
negative about this patch for some time I'm not really surprised they
now feel its a high risk decision to accept it. If I knew nothing, I
would think that also based upon what has been said.

 The whole
 point of the patch is to distinguish between hardware failure and
 software failure; if we can't reliably do that, I don't see this as
 being much of an advance over the status quo.  I think we're going to
 find that the cost of WAL-logging hints is bad enough that people are
 only going to do it when they already suspect a problem and want
 confirmation.  If they can't rely on that confirmation being real, as
 opposed to an outgrowth of a known limitation of the feature, I don't
 see the point.  I'd much rather see this feature wait for 9.3 than
 ship something that's unreliable in this regard.

 So I think it's time to push this one out to 9.3.

I accept this and was not expecting to commit anything now.

I think this decision was actually made quite some time previously and
reviewing these points again is just a further waste of time at this
point.

-- 
 Simon Riggs   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Checksums, state of play

2012-03-07 Thread Tom Lane
Robert Haas robertmh...@gmail.com writes:
 So I think it's time to push this one out to 9.3.

Frankly, I thought it was probably too late for 9.2 when it was first
proposed, and the number of issues that have been identified since then
just confirm that feeling.

There is another point beyond the ones in your summary: if we squeeze
checksums into the page header without increasing its size, however we
do it there will be no going back, because we will certainly have used
up all the remaining wiggle room there.  That means we'll be absolutely
committed to this implementation, permanently (or at least until we
invent the page format upgrade code that no one seems to want to build).
I think we need to be a lot more certain that we're happy with the
design and can't improve on it before we make that commitment.

What I'd like to see is for Simon to keep hacking on the identified
issues, with the hope of having a solid patch ready to commit early
in the 9.3 cycle.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Checksums, state of play

2012-03-07 Thread Robert Haas
On Wed, Mar 7, 2012 at 3:09 PM, Simon Riggs si...@2ndquadrant.com wrote:
 Neither do I. It's pretty clear from our last discussion that the
 fix proposed doesn't actually work fully so I don't think its going
 to be either more robust or more certain to give low false positives.
 So I don't think more time fixing this will actually improve the
 situation.

I hope that's not true, and I certainly don't think it's true.  Like
Tom, I'd like to see you keep working on this (or maybe someone else
will pick it up) for 9.3.  I agree that our most recent discussing
left off with a somewhat depressing conclusion, but I don't think that
means we should give up; I think it just means that we need a better
idea than the ones we've had so far.  I guess it's possible that there
is no better idea out there, but I think it's more likely that we just
haven't thought of it yet.  I feel like we are close to unraveling it,
and just not quite there yet.  I might be wrong.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Checksums, state of play

2012-03-07 Thread Bruce Momjian
On Wed, Mar 07, 2012 at 03:33:34PM -0500, Robert Haas wrote:
 On Wed, Mar 7, 2012 at 3:09 PM, Simon Riggs si...@2ndquadrant.com wrote:
  Neither do I. It's pretty clear from our last discussion that the
  fix proposed doesn't actually work fully so I don't think its going
  to be either more robust or more certain to give low false positives.
  So I don't think more time fixing this will actually improve the
  situation.
 
 I hope that's not true, and I certainly don't think it's true.  Like
 Tom, I'd like to see you keep working on this (or maybe someone else
 will pick it up) for 9.3.  I agree that our most recent discussing
 left off with a somewhat depressing conclusion, but I don't think that
 means we should give up; I think it just means that we need a better
 idea than the ones we've had so far.  I guess it's possible that there
 is no better idea out there, but I think it's more likely that we just
 haven't thought of it yet.  I feel like we are close to unraveling it,
 and just not quite there yet.  I might be wrong.

Yep, good summary.  Giving ourselves a few months to think about it and
consider other failure cases will make this a great 9.3 addition.

-- 
  Bruce Momjian  br...@momjian.ushttp://momjian.us
  EnterpriseDB http://enterprisedb.com

  + It's impossible for everything to be true. +

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Checksums, state of play

2012-03-06 Thread Simon Riggs
On Mon, Mar 5, 2012 at 11:29 PM, Josh Berkus j...@agliodbs.com wrote:

 3. Pages with checksums set need to have a version marking to show
 that they are a later version of the page layout. That version number
 needs to be extensible to many later versions. Pages of multiple
 versions need to exist within the server to allow simple upgrades and
 migration.

 This is a statement of a problem; do you have a proposed solution for it?

Yes. (3) and (4) are in some ways related, so the solution was
presented further down the page.

 (3) and (4) are in conflict with each other, but there is a solution.
 We mark the block with a version number, but we don't make the
 checking dependant upon the version number. We simply avoid making any
 checks until the command to scan all blocks is complete, per point
 (2). That way we need to use 1 flag bit to mark the new version and
 zero flag bits to indicate checks should happen.

The proposed solution is extensible and currently would allow many new versions.

Thank you for your comments.

Are all the other aspects of the proposal acceptable, or are there
other options anybody would like to explore? We've discussed a range
of possible options, so the proposal can be varied as needed.

-- 
 Simon Riggs   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Checksums, state of play

2012-03-06 Thread Robert Haas
On Mon, Mar 5, 2012 at 10:03 AM, Simon Riggs si...@2ndquadrant.com wrote:
 1. We don't need them because there will be something better in a
 later release. I don't think anybody disagrees that a better solution
 is possible in the future; doubts have been expressed as to what will
 be required and when that is likely to happen. Opinions differ. We can
 and should do something now unless there is reason not to.

I don't think there's really a serious problem here.  Everyone seems
to agree that checksums are useful, and nobody even seems terribly
unhappy with the fact that this is only a 16-bit checksum.  On the
flip side I think it's quite likely that we will never wish to support
multiple checksum implementations, so for that reason I do believe
that whatever we commit first in this area needs to be something we
can live with for a very long time.  It can have limitations of
course, but we should be comfortable with the on-disk format and basic
design of the feature.

 2. Turning checksums on/off/on/off in rapid succession can cause false
 positive reports of checksum failure if crashes occur and are ignored.
 That may lead to the feature and PostgreSQL being held in disrepute.

This I do think is a problem, although not for precisely the reason
stated here.  In my experience, in data corruption situations, the
first thing customers do is blame PostgreSQL: they don't believe it's
the hardware; they accuse us of having bugs in our code.  Having a
checksum feature would be valuable, because, first, we'd perhaps
detect problems sooner and, second, people understand what checksums
are and that checksum failures really shouldn't happen unless the
hardware is bad.  More generally, one of the purposes of checksums is
to distinguish hardware failure from other possible causes of data
corruption problems.  If there are code paths where checksum failures
can happy despite the hardware being good, I think that the patch will
fail to accomplish its goal of giving us confidence that the hardware
is bad.

 This can be resolved, if desired, by having having a two-stage
 enabling process where we must issue a command that scans every block
 in the database before checksum checking begins. VACUUM is easily
 modified to the task, we just need to agree that is suitable and agree
 syntax.
 A suggestion is VACUUM ENABLE CHECKSUMS; others are possible.

If we're going to extend the syntax of VACUUM, we should use the
extensible-options syntax there - i.e. something like VACUUM
(CHECKSUMS ON).  But overall I think this proposal lacks in detail.
It seems to me that there are four states that a database (or table,
if we made it more fine-grained) could be in:

1. No blocks have checksums.

2. Checksums are being added, but some blocks may not yet have them.
Thus, it's not an error for a block to have no checksum, but if a
block appears to have a checksum, it should be correct.   All blocks
are written with checksums.

3. Checksums are fully enabled.   Checksum failures are still an
error, and apparent lack of a checksum is now an error as well, since
it must indicate something's been corrupted.  All blocks are written
with checksums.

4. Checksums are being removed, but some blocks may still have them.
Thus, it's not an error for a block to have no checksum, but any
still-remaining checksums should be correct (though possibly we ought
not to complain if they aren't, to provide a recovery path for users
who are turning checksums off because they're getting errors they
don't want).  Any block that's written is written without checksums.

I think we need to be clear about how the system transitions between
these states.  In the current patch, AIUI, you can effectively go from
1-2 or 4-2 by setting page_checksums=on and from 2-4 by setting
page_checksums=off, but there's no easy way to ensure that you've
reached state 3 or that you've gotten back to state 1.  Some variant
of VACUUM seems like a good way of doing that, but it doesn't make
sense for example to have page_checksums=off and do VACUUM (CHECKSUMS
ON), or to have page_checksums=on and do VACUUM (CHECKSUMS OFF).  I
guess we could just reject those as error cases, but it would be
nicer, I think, to have an interface with a higher degree of
orthogonality.

There's probably more than one way to do that, but my personal
preference, as previously noted, is to make this a table-level option,
rather than a GUC.  Then, VACUUM (CHECKSUMS ON) can first change the
pg_class entry to indicate that checksums are enabling-in-progress
(i.e. 1-2), then scan the table, adding checksums, and then mark
checksums as fully enabled (i.e. 2-3).  VACUUM (CHECKSUMS OFF) can
proceed in symmetric fashion, marking checksums as
disabling-in-progress (3-4), then scanning the table and getting rid
of them, and then marking them fully disabled (4-1).  If a crash
happens in the middle somewhere, the state of the table can get left
as enabling-in-progress or disabling-in-progress, but a new VACUUM
(CHECKSUMS 

Re: [HACKERS] Checksums, state of play

2012-03-06 Thread Simon Riggs
On Tue, Mar 6, 2012 at 2:25 PM, Robert Haas robertmh...@gmail.com wrote:

 4. Checksums are being removed, but some blocks may still have them.
 Thus, it's not an error for a block to have no checksum, but any
 still-remaining checksums should be correct (though possibly we ought
 not to complain if they aren't, to provide a recovery path for users
 who are turning checksums off because they're getting errors they
 don't want).  Any block that's written is written without checksums.

I agree its advantageous to have a means of removing pagesums from
data blocks as a 4th state.

I think we need a 5th state - pagesums disabled. Which allows an
emergency disabling of the feature without rewriting the blocks.
Obviously if the database is damaged and I/O devices going bad, trying
to rescan database is likely to cause further problems.

 I think we need to be clear about how the system transitions between
 these states.  In the current patch, AIUI, you can effectively go from
 1-2 or 4-2 by setting page_checksums=on and from 2-4 by setting
 page_checksums=off, but there's no easy way to ensure that you've
 reached state 3 or that you've gotten back to state 1.  Some variant
 of VACUUM seems like a good way of doing that, but it doesn't make
 sense for example to have page_checksums=off and do VACUUM (CHECKSUMS
 ON), or to have page_checksums=on and do VACUUM (CHECKSUMS OFF).  I
 guess we could just reject those as error cases, but it would be
 nicer, I think, to have an interface with a higher degree of
 orthogonality.

Right, a misunderstanding I think.

If we have states set at database level then we'd not have a GUC as well.

 There's probably more than one way to do that, but my personal
 preference, as previously noted, is to make this a table-level option,
 rather than a GUC.  Then, VACUUM (CHECKSUMS ON) can first change the
 pg_class entry to indicate that checksums are enabling-in-progress
 (i.e. 1-2), then scan the table, adding checksums, and then mark
 checksums as fully enabled (i.e. 2-3).  VACUUM (CHECKSUMS OFF) can
 proceed in symmetric fashion, marking checksums as
 disabling-in-progress (3-4), then scanning the table and getting rid
 of them, and then marking them fully disabled (4-1).  If a crash
 happens in the middle somewhere, the state of the table can get left
 as enabling-in-progress or disabling-in-progress, but a new VACUUM
 (CHECKSUMS X) can be used to finish the process, and we always know
 exactly where we're at.

Any in-progress state needs to have checksums removed first, then re-added.

I'll keep an open mind for now about database/table level. I'm not
sure how possible/desirable each is.

 I generally agree with this outline, though I think that in lieu of a
 version number we could simply set a new pd_flags bit indicating that
 checksums are enabled.  If we haven't fully enabled checksums yet,
 then the fact that this bit isn't set is not an error; but if
 checksums are fully enabled, then every page must have that bit set,
 and any page that doesn't is ipso facto corrupt.

Whether to have it or not, if a corruption occurs during
checksum-enabling then we could get a false reading. If we have a bit
then the bit can be set wrong, so we could either make a check when it
wasn't due, or skip a check we should have made. If we don't have a
bit and so skip checksum checking during enabling process then we can
get an error that isn't spotted by the checksum process.

Given it can happen both ways, we should have a bit/ or not depending
upon which is the least likely to be wrong. I would say having the bit
provides the least likely way to get false readings.

-- 
 Simon Riggs   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Checksums, state of play

2012-03-06 Thread Simon Riggs
On Tue, Mar 6, 2012 at 2:25 PM, Robert Haas robertmh...@gmail.com wrote:

 For the reasons stated above, I believe pd_tli is less useful than
 pd_pagesize_version.  I fear that if we repurpose pd_pagesize_version,
 we're going to make things very difficult for people who want to write
 block-inspection tools, like pg_filedump or pageinspect.  Right now,
 it's possible to look at that offset within the block and know for
 certain what page version you're dealing with.  If we repurpose it to
 hold checksum data, that will no longer be possible.  Unlike pd_tli,
 pd_pagesize_version is validated every time we read a block.

We've not changed the page format in 5 years. I really can't see what
the value of having a constant stored on every data block, especially
since you're now saying that we *shouldn't* bump the constant for this
change. Surely if we are keeping the pd_pagesize_version field its
obvious that we should increment it? If not, why the insistence on
keeping the field if we aren't using it for its stated purpose?

Do you know of any PostgreSQL variant that can set this byte range to
different values? If so, I'd suggest we just declare the field user
defined or some such so that others can use it for different things
as well and then use pd_tli.

IMHO if we keep use pd_tli but pd_pagesize_version then we should increment it.

-- 
 Simon Riggs   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Checksums, state of play

2012-03-06 Thread Simon Riggs
On Tue, Mar 6, 2012 at 3:31 PM, Simon Riggs si...@2ndquadrant.com wrote:

 I'll keep an open mind for now about database/table level. I'm not
 sure how possible/desirable each is.

Table level sounds great, but how will it work with recovery? We don't
have a relcache in Startup process.

So either database or tablespace level seems doable.

-- 
 Simon Riggs   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Checksums, state of play

2012-03-06 Thread Robert Haas
On Tue, Mar 6, 2012 at 10:31 AM, Simon Riggs si...@2ndquadrant.com wrote:
 On Tue, Mar 6, 2012 at 2:25 PM, Robert Haas robertmh...@gmail.com wrote:
 4. Checksums are being removed, but some blocks may still have them.
 Thus, it's not an error for a block to have no checksum, but any
 still-remaining checksums should be correct (though possibly we ought
 not to complain if they aren't, to provide a recovery path for users
 who are turning checksums off because they're getting errors they
 don't want).  Any block that's written is written without checksums.

 I agree its advantageous to have a means of removing pagesums from
 data blocks as a 4th state.

 I think we need a 5th state - pagesums disabled. Which allows an
 emergency disabling of the feature without rewriting the blocks.
 Obviously if the database is damaged and I/O devices going bad, trying
 to rescan database is likely to cause further problems.

Maybe we should consider making this a separate flag that controls
what we do when we detect a checksum failure -- ERROR, LOG, or just
count it in the stats -- rather than an extra state.

 I think we need to be clear about how the system transitions between
 these states.  In the current patch, AIUI, you can effectively go from
 1-2 or 4-2 by setting page_checksums=on and from 2-4 by setting
 page_checksums=off, but there's no easy way to ensure that you've
 reached state 3 or that you've gotten back to state 1.  Some variant
 of VACUUM seems like a good way of doing that, but it doesn't make
 sense for example to have page_checksums=off and do VACUUM (CHECKSUMS
 ON), or to have page_checksums=on and do VACUUM (CHECKSUMS OFF).  I
 guess we could just reject those as error cases, but it would be
 nicer, I think, to have an interface with a higher degree of
 orthogonality.

 Right, a misunderstanding I think.

 If we have states set at database level then we'd not have a GUC as well.

OK.  If the flag is set at database level rather than table or cluster
level, then shared catalogs become a bit of a wart.  Maybe managably
so, but it needs thought, at the least.

 There's probably more than one way to do that, but my personal
 preference, as previously noted, is to make this a table-level option,
 rather than a GUC.  Then, VACUUM (CHECKSUMS ON) can first change the
 pg_class entry to indicate that checksums are enabling-in-progress
 (i.e. 1-2), then scan the table, adding checksums, and then mark
 checksums as fully enabled (i.e. 2-3).  VACUUM (CHECKSUMS OFF) can
 proceed in symmetric fashion, marking checksums as
 disabling-in-progress (3-4), then scanning the table and getting rid
 of them, and then marking them fully disabled (4-1).  If a crash
 happens in the middle somewhere, the state of the table can get left
 as enabling-in-progress or disabling-in-progress, but a new VACUUM
 (CHECKSUMS X) can be used to finish the process, and we always know
 exactly where we're at.

 Any in-progress state needs to have checksums removed first, then re-added.

I don't think that's necessarily true.  If the user begins disabling
checksums (i.e. they are in state #4) and they change their mind, they
should be able to switch back to enabling mode.

 I'll keep an open mind for now about database/table level. I'm not
 sure how possible/desirable each is.

I think that per-table is clearly more desirable, but I haven't
studied how invasive the code changes are.  I suspect it is manageable
but I might be building castles in the air.

 I generally agree with this outline, though I think that in lieu of a
 version number we could simply set a new pd_flags bit indicating that
 checksums are enabled.  If we haven't fully enabled checksums yet,
 then the fact that this bit isn't set is not an error; but if
 checksums are fully enabled, then every page must have that bit set,
 and any page that doesn't is ipso facto corrupt.

 Whether to have it or not, if a corruption occurs during
 checksum-enabling then we could get a false reading.

That is true, and I think unavoidable.  While checksums are being
enabled, we have to assume that a not-checksummed page just hasn't
been updated yet.  There's no way to distinguish those cases.

 If we have a bit
 then the bit can be set wrong, so we could either make a check when it
 wasn't due, or skip a check we should have made. If we don't have a
 bit and so skip checksum checking during enabling process then we can
 get an error that isn't spotted by the checksum process.

 Given it can happen both ways, we should have a bit/ or not depending
 upon which is the least likely to be wrong. I would say having the bit
 provides the least likely way to get false readings.

I agree that having a bit is good.  It's not really necessary, but it
makes it easier to understand the logic, and allows us to catch and
distinguish error cases better.  Once checksums are fully enabled, all
checksums must match, bit or no bit.  But it's easier to understand
what the intermediate states look like with 

Re: [HACKERS] Checksums, state of play

2012-03-06 Thread Robert Haas
On Tue, Mar 6, 2012 at 10:40 AM, Simon Riggs si...@2ndquadrant.com wrote:
 On Tue, Mar 6, 2012 at 2:25 PM, Robert Haas robertmh...@gmail.com wrote:
 For the reasons stated above, I believe pd_tli is less useful than
 pd_pagesize_version.  I fear that if we repurpose pd_pagesize_version,
 we're going to make things very difficult for people who want to write
 block-inspection tools, like pg_filedump or pageinspect.  Right now,
 it's possible to look at that offset within the block and know for
 certain what page version you're dealing with.  If we repurpose it to
 hold checksum data, that will no longer be possible.  Unlike pd_tli,
 pd_pagesize_version is validated every time we read a block.

 We've not changed the page format in 5 years. I really can't see what
 the value of having a constant stored on every data block, especially
 since you're now saying that we *shouldn't* bump the constant for this
 change. Surely if we are keeping the pd_pagesize_version field its
 obvious that we should increment it? If not, why the insistence on
 keeping the field if we aren't using it for its stated purpose?

 Do you know of any PostgreSQL variant that can set this byte range to
 different values? If so, I'd suggest we just declare the field user
 defined or some such so that others can use it for different things
 as well and then use pd_tli.

 IMHO if we keep use pd_tli but pd_pagesize_version then we should increment 
 it.

The fact that we haven't changed the page format in 5 years is a good
thing, and I hope that we won't change it very often because it will
require whole-cluster rewrites to take full advantage of whatever
features are made available by the version bump, which is darn
painful.  However, I'm pretty sure that eventually we're going to want
to bump it.  Aside from checksums, the most imminent thing I can think
of that might cause us to do that is this idea regarding XID
wraparound:

http://archives.postgresql.org/message-id/4f2fa541.8040...@enterprisedb.com

However, even if we don't do that or we find some way to do it without
bumping the page version, it seems likely to me that something else
will come up eventually.  The size of our page header doesn't thrill
me, but the one byte we've allocated to storing the version is only a
minor contributor and pretty cheap insurance against future needs.

As to whether we should increment pd_pagesize_version, I'm not sure
quite what you were saying about that (I think you may have an extra
or missing word there), but I don't think it's necessary here.  I
believe we feel free to assign new flag bits without bumping the page
size version, so we could define PD_HAS_CHECKSUM without doing so.
Maybe your point is that we're changing the meaning of pd_tli and it
seems ugly to do that without the bumping the page version, but I
guess my point is that we're not changing it incompatibly.  We really
only need to bump the page version for changes where a newer version
of the server would otherwise misinterpret an older page, which isn't
a problem in this case because pd_tli is basically dead already.

And, on a more practical level, Tom argued on the other thread that if
we have a page upgrade facility, then we ought to store the minimum
page version for every relation in a pg_class column, so we can keep
track of when all pages of the older format are gone.  That's
infrastructure that this patch doesn't really need, and we can avoid
having to build it by steering clear of the page versioning issue
altogether, viewing this instead as an enhancement of the existing
page format that doesn't break compatibility with older releases.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Checksums, state of play

2012-03-06 Thread Simon Riggs
On Tue, Mar 6, 2012 at 4:06 PM, Simon Riggs si...@2ndquadrant.com wrote:
 On Tue, Mar 6, 2012 at 3:31 PM, Simon Riggs si...@2ndquadrant.com wrote:

 I'll keep an open mind for now about database/table level. I'm not
 sure how possible/desirable each is.

 Table level sounds great, but how will it work with recovery? We don't
 have a relcache in Startup process.

 So either database or tablespace level seems doable.

Even db or ts level is problematic.

Options

(1) Recovery ignores checksums until db in consistent state

(2) Recovery ignores checksums until all databases are enabled, when
we set flag in pg_control

(3) Recovery checks blocks marked as having a checksum, no matter the
overall state

-- 
 Simon Riggs   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Checksums, state of play

2012-03-06 Thread Simon Riggs
On Tue, Mar 6, 2012 at 4:42 PM, Robert Haas robertmh...@gmail.com wrote:

 As to whether we should increment pd_pagesize_version, I'm not sure
 quite what you were saying about that (I think you may have an extra
 or missing word there), but I don't think it's necessary here.

I said this...

On Tue, Mar 6, 2012 at 3:40 PM, Simon Riggs si...@2ndquadrant.com wrote:
 Do you know of any PostgreSQL variant that can set this byte range to
 different values?

Not sure what the missing word is there, so I'll ask again.

Has EDB or anybody else you know of has used the pd_pagesize_version
field for something else, so you'd rather I didn't touch that?

-- 
 Simon Riggs   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Checksums, state of play

2012-03-06 Thread Robert Haas
On Tue, Mar 6, 2012 at 12:00 PM, Simon Riggs si...@2ndquadrant.com wrote:
 On Tue, Mar 6, 2012 at 4:42 PM, Robert Haas robertmh...@gmail.com wrote:

 As to whether we should increment pd_pagesize_version, I'm not sure
 quite what you were saying about that (I think you may have an extra
 or missing word there), but I don't think it's necessary here.

 I said this...

 On Tue, Mar 6, 2012 at 3:40 PM, Simon Riggs si...@2ndquadrant.com wrote:
 Do you know of any PostgreSQL variant that can set this byte range to
 different values?

 Not sure what the missing word is there, so I'll ask again.

 Has EDB or anybody else you know of has used the pd_pagesize_version
 field for something else, so you'd rather I didn't touch that?

To my knowledge, EDB has not ever stamped pages with a different
pd_pagesize_version, or reused that field for any other purpose, but
I'm not sure about Greenplum or others.  I have seen at least one
pg_filedump output where every page was stamped with the same very odd
value there; I don't remember what the value was, and I don't know
where the output I was sent came from, but it was enough to make me
wonder if somebody's done that.

That's not why I want to leave that field alone, though: I want to
leave that field alone for backward and forward compatibility, so that
any version of community PostgreSQL ever released - and any page
inspection tools, current or future - can look at the low-order byte
of that field and know what page version they've got.  If we didn't
have some other bytes in the page header that seem relatively useless
(like pd_tli and the high bits of pd_pagesize_version), I'd be arguing
for extending the page header rather than clobbering the version
number.  It just seems to me that the page version number is
absolutely the most basic piece of information on the page, which you
must have before you can interpret the rest of the page contents.  For
it to have any value in identifying past page format changes, future
page format changes, or format changes by forks of our main code base,
it's got to be present in every release and have the same meaning in
each one.  If we do a release where 2 can mean the checksum is 2
rather than the page was written by PostgreSQL 8.0 then we've
forever lost the ability to decide, without the use of heuristics,
what kind of page we've got.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Checksums, state of play

2012-03-06 Thread Heikki Linnakangas

On 06.03.2012 19:00, Simon Riggs wrote:

On Tue, Mar 6, 2012 at 4:42 PM, Robert Haasrobertmh...@gmail.com  wrote:


As to whether we should increment pd_pagesize_version, I'm not sure
quite what you were saying about that (I think you may have an extra
or missing word there), but I don't think it's necessary here.


I said this...

On Tue, Mar 6, 2012 at 3:40 PM, Simon Riggssi...@2ndquadrant.com  wrote:

Do you know of any PostgreSQL variant that can set this byte range to
different values?


Not sure what the missing word is there, so I'll ask again.

Has EDB or anybody else you know of has used the pd_pagesize_version
field for something else, so you'd rather I didn't touch that?


The EDB page format is exactly the same as the community one. Thanks for 
asking.


--
  Heikki Linnakangas
  EnterpriseDB   http://www.enterprisedb.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Checksums, state of play

2012-03-06 Thread Robert Haas
On Tue, Mar 6, 2012 at 11:50 AM, Simon Riggs si...@2ndquadrant.com wrote:
 On Tue, Mar 6, 2012 at 4:06 PM, Simon Riggs si...@2ndquadrant.com wrote:
 On Tue, Mar 6, 2012 at 3:31 PM, Simon Riggs si...@2ndquadrant.com wrote:

 I'll keep an open mind for now about database/table level. I'm not
 sure how possible/desirable each is.

 Table level sounds great, but how will it work with recovery? We don't
 have a relcache in Startup process.

 So either database or tablespace level seems doable.

 Even db or ts level is problematic.

 Options

 (1) Recovery ignores checksums until db in consistent state

 (2) Recovery ignores checksums until all databases are enabled, when
 we set flag in pg_control

 (3) Recovery checks blocks marked as having a checksum, no matter the
 overall state

How about combining #1 and #3?  If the database isn't consistent yet
(and thus we can't look at pg_database) then we rely on the blocks
themselves to tell us whether they have checksums.  Once we reach
consistency we can do better.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Checksums, state of play

2012-03-06 Thread Simon Riggs
On Tue, Mar 6, 2012 at 5:14 PM, Heikki Linnakangas
heikki.linnakan...@enterprisedb.com wrote:

 Has EDB or anybody else you know of has used the pd_pagesize_version
 field for something else, so you'd rather I didn't touch that?


 The EDB page format is exactly the same as the community one. Thanks for
 asking.

No problem, anytime. Just wanted to make sure downstream wasn't affected.

-- 
 Simon Riggs   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Checksums, state of play

2012-03-06 Thread Simon Riggs
On Tue, Mar 6, 2012 at 5:14 PM, Robert Haas robertmh...@gmail.com wrote:

 Options

 (1) Recovery ignores checksums until db in consistent state

 (2) Recovery ignores checksums until all databases are enabled, when
 we set flag in pg_control

 (3) Recovery checks blocks marked as having a checksum, no matter the
 overall state

 How about combining #1 and #3?  If the database isn't consistent yet
 (and thus we can't look at pg_database) then we rely on the blocks
 themselves to tell us whether they have checksums.  Once we reach
 consistency we can do better.

We can change state then, but to what? We don't have a relcache.

Maybe that puts us back at Square #1. Will think

-- 
 Simon Riggs   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Checksums, state of play

2012-03-06 Thread Robert Haas
On Tue, Mar 6, 2012 at 12:23 PM, Simon Riggs si...@2ndquadrant.com wrote:
 On Tue, Mar 6, 2012 at 5:14 PM, Robert Haas robertmh...@gmail.com wrote:
 Options

 (1) Recovery ignores checksums until db in consistent state

 (2) Recovery ignores checksums until all databases are enabled, when
 we set flag in pg_control

 (3) Recovery checks blocks marked as having a checksum, no matter the
 overall state

 How about combining #1 and #3?  If the database isn't consistent yet
 (and thus we can't look at pg_database) then we rely on the blocks
 themselves to tell us whether they have checksums.  Once we reach
 consistency we can do better.

 We can change state then, but to what? We don't have a relcache.

If the state is per-database or per-tablespace, you can read
pg_database at that point and see what the flag says.  If it's
per-relation, then I agree: you still don't have enough information.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Checksums, state of play

2012-03-06 Thread Bruce Momjian
On Tue, Mar 06, 2012 at 09:25:17AM -0500, Robert Haas wrote:
  2. Turning checksums on/off/on/off in rapid succession can cause false
  positive reports of checksum failure if crashes occur and are ignored.
  That may lead to the feature and PostgreSQL being held in disrepute.
 
 This I do think is a problem, although not for precisely the reason
 stated here.  In my experience, in data corruption situations, the
 first thing customers do is blame PostgreSQL: they don't believe it's
 the hardware; they accuse us of having bugs in our code.  Having a
 checksum feature would be valuable, because, first, we'd perhaps
 detect problems sooner and, second, people understand what checksums
 are and that checksum failures really shouldn't happen unless the
 hardware is bad.  More generally, one of the purposes of checksums is
 to distinguish hardware failure from other possible causes of data
 corruption problems.  If there are code paths where checksum failures
 can happy despite the hardware being good, I think that the patch will
 fail to accomplish its goal of giving us confidence that the hardware
 is bad.

I think the turning checksums on/off/on/off is really a killer
problem, and obviously many of the actions needed to make it safe make
the checksum feature itself less useful.  

One crazy idea would be to have a checksum _version_ number somewhere on
the page and in pg_controldata.  When you turn on checksums, you
increment that value, and all new checksum pages get that checksum
version;  if you turn off checksums, we just don't check them anymore,
but they might get incorrect due to a hint bit write and a crash.  When
you turn on checksums again, you increment the checksum version again,
and only check pages having the _new_ checksum version.

Yes, this does add additional storage requirements for the checksum, but
I don't see another clean option.  If you can spare one byte, that gives
you 255 times to turn on checksums;   after that, you have to
dump/reload to use the checksum feature.

-- 
  Bruce Momjian  br...@momjian.ushttp://momjian.us
  EnterpriseDB http://enterprisedb.com

  + It's impossible for everything to be true. +

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Checksums, state of play

2012-03-06 Thread Bruce Momjian
On Mon, Mar 05, 2012 at 03:03:18PM +, Simon Riggs wrote:
 To avoid any confusion as to where this proposed feature is now, I'd
 like to summarise my understanding, make proposals and also request
 clear feedback on them.
 
 Checksums have a number of objections to them outstanding.
 
 1. We don't need them because there will be something better in a
 later release. I don't think anybody disagrees that a better solution
 is possible in the future; doubts have been expressed as to what will
 be required and when that is likely to happen. Opinions differ. We can
 and should do something now unless there is reason not to.

Obviously, one reason not to do this now is that we are way past time to
be designing any feature.  As much as I like how it has progressed and
how it handles pg_upgrade issues, I don't think anyone can state that
this feature is ready to go, and considering how far we are into the
last commit-fest, I think we can fairly say this patch has gotten good
review and return it with feedback.  We can keep discussing it (and I
just posted some ideas myself), but I don't think we can any longer
pretend that this is going into Postgres 9.2.

 2. Turning checksums on/off/on/off in rapid succession can cause false
 positive reports of checksum failure if crashes occur and are ignored.
 That may lead to the feature and PostgreSQL being held in disrepute.
 This can be resolved, if desired, by having having a two-stage
 enabling process where we must issue a command that scans every block
 in the database before checksum checking begins. VACUUM is easily
 modified to the task, we just need to agree that is suitable and agree
 syntax.
 A suggestion is VACUUM ENABLE CHECKSUMS; others are possible.

There is no question that if this feature is done badly, it will make
Postgres look bad, and no one wants that.  As nice as this features is,
making Postgres look bad can't justify it as it currently sits.

-- 
  Bruce Momjian  br...@momjian.ushttp://momjian.us
  EnterpriseDB http://enterprisedb.com

  + It's impossible for everything to be true. +

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Checksums, state of play

2012-03-06 Thread Simon Riggs
On Tue, Mar 6, 2012 at 5:50 PM, Bruce Momjian br...@momjian.us wrote:

 One crazy idea would be to have a checksum _version_ number somewhere on
 the page and in pg_controldata.  When you turn on checksums, you
 increment that value, and all new checksum pages get that checksum
 version;  if you turn off checksums, we just don't check them anymore,
 but they might get incorrect due to a hint bit write and a crash.  When
 you turn on checksums again, you increment the checksum version again,
 and only check pages having the _new_ checksum version.

 Yes, this does add additional storage requirements for the checksum, but
 I don't see another clean option.  If you can spare one byte, that gives
 you 255 times to turn on checksums;   after that, you have to
 dump/reload to use the checksum feature.

I like the idea very much actually. But I'll let you argue the case
for using pd_pagesize_version for that with your esteemed colleagues.

It would be pretty safe to just let it wrap.

-- 
 Simon Riggs   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Checksums, state of play

2012-03-06 Thread Tom Lane
Robert Haas robertmh...@gmail.com writes:
 That's not why I want to leave that field alone, though: I want to
 leave that field alone for backward and forward compatibility, so that
 any version of community PostgreSQL ever released - and any page
 inspection tools, current or future - can look at the low-order byte
 of that field and know what page version they've got.

I've not been following this thread very closely, but FWIW I find the
above argument extremely compelling.  We could get away with relocating
the version identifier in the narrow context of an upgrade from PG 9.x
to 9.y, but the side effects for external tools such as pg_filedump
would be disastrous.

(And yeah, as maintainer for pg_filedump I'm rather biased.)

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Checksums, state of play

2012-03-06 Thread Bruce Momjian
On Tue, Mar 06, 2012 at 06:00:13PM +, Simon Riggs wrote:
 On Tue, Mar 6, 2012 at 5:50 PM, Bruce Momjian br...@momjian.us wrote:
 
  One crazy idea would be to have a checksum _version_ number somewhere on
  the page and in pg_controldata.  When you turn on checksums, you
  increment that value, and all new checksum pages get that checksum
  version;  if you turn off checksums, we just don't check them anymore,
  but they might get incorrect due to a hint bit write and a crash.  When
  you turn on checksums again, you increment the checksum version again,
  and only check pages having the _new_ checksum version.
 
  Yes, this does add additional storage requirements for the checksum, but
  I don't see another clean option.  If you can spare one byte, that gives
  you 255 times to turn on checksums;   after that, you have to
  dump/reload to use the checksum feature.
 
 I like the idea very much actually. But I'll let you argue the case
 for using pd_pagesize_version for that with your esteemed colleagues.
 
 It would be pretty safe to just let it wrap.

How would we know there are not old unwritten pages sitting around?

-- 
  Bruce Momjian  br...@momjian.ushttp://momjian.us
  EnterpriseDB http://enterprisedb.com

  + It's impossible for everything to be true. +

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Checksums, state of play

2012-03-06 Thread Simon Riggs
On Tue, Mar 6, 2012 at 5:56 PM, Bruce Momjian br...@momjian.us wrote:
 On Mon, Mar 05, 2012 at 03:03:18PM +, Simon Riggs wrote:
 To avoid any confusion as to where this proposed feature is now, I'd
 like to summarise my understanding, make proposals and also request
 clear feedback on them.

 Checksums have a number of objections to them outstanding.

 1. We don't need them because there will be something better in a
 later release. I don't think anybody disagrees that a better solution
 is possible in the future; doubts have been expressed as to what will
 be required and when that is likely to happen. Opinions differ. We can
 and should do something now unless there is reason not to.

 Obviously, one reason not to do this now is that we are way past time to
 be designing any feature.  As much as I like how it has progressed and
 how it handles pg_upgrade issues, I don't think anyone can state that
 this feature is ready to go, and considering how far we are into the
 last commit-fest, I think we can fairly say this patch has gotten good
 review and return it with feedback.  We can keep discussing it (and I
 just posted some ideas myself), but I don't think we can any longer
 pretend that this is going into Postgres 9.2.

Are you calling time on all patches in this CF, or just this one?

From what you say, this seems to be the reason for your thinking:

On Tue, Mar 6, 2012 at 5:50 PM, Bruce Momjian br...@momjian.us wrote:

 I think the turning checksums on/off/on/off is really a killer
 problem, and obviously many of the actions needed to make it safe make
 the checksum feature itself less useful.

The problem is actually on/off/crash/on in quick succession which is
much less likely.

In the current patch that can be resolved by running a VACUUM to
remove checksums if you want to turn them off completely and safely.
That is easily documented as a procedure for people to follow, like
creating types or setting up replication.

The resolution suggested to that problem is to force a VACUUM to turn
on or turn off. There's very little difference between those two
things other than us forcing the user, which as you point out, being
forced to do that could be worse than the problem we're trying to
solve. From this discussion, ISTM that the force route has
sufficient complexity that we could easily make it less robust and
much more likely to receive criticism.

You can break replication by doing on/off/on as well, for example.

It's simply not a killer issue. Not if you have a reasonable grading
of issues. It's a possibility of a usage annoyance, much less annoying
than having an XML type that doesn't actually allow indexes without
knowledge of C, a hash index that doesn't ever recover or an unlogged
table whose data suddenly disappears after a crash, for example.

ISTM pretty reasonable to document the issue clearly, inform people
how it works and let them choose whether to use it or not. Stopping
features from going out because some people might misuse them isn't a
great plan.

-- 
 Simon Riggs   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Checksums, state of play

2012-03-06 Thread Bruce Momjian
On Tue, Mar 06, 2012 at 01:52:31PM -0500, Bruce Momjian wrote:
 On Tue, Mar 06, 2012 at 06:00:13PM +, Simon Riggs wrote:
  On Tue, Mar 6, 2012 at 5:50 PM, Bruce Momjian br...@momjian.us wrote:
  
   One crazy idea would be to have a checksum _version_ number somewhere on
   the page and in pg_controldata.  When you turn on checksums, you
   increment that value, and all new checksum pages get that checksum
   version;  if you turn off checksums, we just don't check them anymore,
   but they might get incorrect due to a hint bit write and a crash.  When
   you turn on checksums again, you increment the checksum version again,
   and only check pages having the _new_ checksum version.
  
   Yes, this does add additional storage requirements for the checksum, but
   I don't see another clean option.  If you can spare one byte, that gives
   you 255 times to turn on checksums;   after that, you have to
   dump/reload to use the checksum feature.
  
  I like the idea very much actually. But I'll let you argue the case
  for using pd_pagesize_version for that with your esteemed colleagues.
  
  It would be pretty safe to just let it wrap.
 
 How would we know there are not old unwritten pages sitting around?

Perhaps a full xid wrap-around would allow us to re-use checksum
counters.

-- 
  Bruce Momjian  br...@momjian.ushttp://momjian.us
  EnterpriseDB http://enterprisedb.com

  + It's impossible for everything to be true. +

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Checksums, state of play

2012-03-06 Thread Bruce Momjian
On Tue, Mar 06, 2012 at 07:09:23PM +, Simon Riggs wrote:
 On Tue, Mar 6, 2012 at 5:56 PM, Bruce Momjian br...@momjian.us wrote:
  On Mon, Mar 05, 2012 at 03:03:18PM +, Simon Riggs wrote:
  To avoid any confusion as to where this proposed feature is now, I'd
  like to summarise my understanding, make proposals and also request
  clear feedback on them.
 
  Checksums have a number of objections to them outstanding.
 
  1. We don't need them because there will be something better in a
  later release. I don't think anybody disagrees that a better solution
  is possible in the future; doubts have been expressed as to what will
  be required and when that is likely to happen. Opinions differ. We can
  and should do something now unless there is reason not to.
 
  Obviously, one reason not to do this now is that we are way past time to
  be designing any feature.  As much as I like how it has progressed and
  how it handles pg_upgrade issues, I don't think anyone can state that
  this feature is ready to go, and considering how far we are into the
  last commit-fest, I think we can fairly say this patch has gotten good
  review and return it with feedback.  We can keep discussing it (and I
  just posted some ideas myself), but I don't think we can any longer
  pretend that this is going into Postgres 9.2.
 
 Are you calling time on all patches in this CF, or just this one?
 
 From what you say, this seems to be the reason for your thinking:

I am calling time on any CF patch that requires redesign.

  I think the turning checksums on/off/on/off is really a killer
  problem, and obviously many of the actions needed to make it safe make
  the checksum feature itself less useful.
 
 The problem is actually on/off/crash/on in quick succession which is
 much less likely.

True.

 In the current patch that can be resolved by running a VACUUM to
 remove checksums if you want to turn them off completely and safely.
 That is easily documented as a procedure for people to follow, like
 creating types or setting up replication.
 
 The resolution suggested to that problem is to force a VACUUM to turn
 on or turn off. There's very little difference between those two
 things other than us forcing the user, which as you point out, being
 forced to do that could be worse than the problem we're trying to
 solve. From this discussion, ISTM that the force route has
 sufficient complexity that we could easily make it less robust and
 much more likely to receive criticism.
 
 You can break replication by doing on/off/on as well, for example.
 
 It's simply not a killer issue. Not if you have a reasonable grading
 of issues. It's a possibility of a usage annoyance, much less annoying
 than having an XML type that doesn't actually allow indexes without
 knowledge of C, a hash index that doesn't ever recover or an unlogged
 table whose data suddenly disappears after a crash, for example.
 
 ISTM pretty reasonable to document the issue clearly, inform people
 how it works and let them choose whether to use it or not. Stopping
 features from going out because some people might misuse them isn't a
 great plan.

The feature is no where near complete, and we should not be designing
features at this stage.

-- 
  Bruce Momjian  br...@momjian.ushttp://momjian.us
  EnterpriseDB http://enterprisedb.com

  + It's impossible for everything to be true. +

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Checksums, state of play

2012-03-06 Thread Robert Haas
On Tue, Mar 6, 2012 at 12:50 PM, Bruce Momjian br...@momjian.us wrote:
 I think the turning checksums on/off/on/off is really a killer
 problem, and obviously many of the actions needed to make it safe make
 the checksum feature itself less useful.

 One crazy idea would be to have a checksum _version_ number somewhere on
 the page and in pg_controldata.  When you turn on checksums, you
 increment that value, and all new checksum pages get that checksum
 version;  if you turn off checksums, we just don't check them anymore,
 but they might get incorrect due to a hint bit write and a crash.  When
 you turn on checksums again, you increment the checksum version again,
 and only check pages having the _new_ checksum version.

 Yes, this does add additional storage requirements for the checksum, but
 I don't see another clean option.  If you can spare one byte, that gives
 you 255 times to turn on checksums;   after that, you have to
 dump/reload to use the checksum feature.

I don't see what problem that solves.  It's just taking the problem we
already have and a new failure mode (out of checksum versions) on top
of it.  If you see a page with checksum version 153 and the current
checksum version is 152, then you know that either (a) it is the
result of a previous iteration of turning checksums on or (b) the
checksum version number got corrupted.  This is the exact same problem
we have with using a PD_HAS_CHECKSUM bit.  If the bit is not set, then
you know that either (a) it hasn't been checksummed yet or (b) the bit
got corrupted.   In either case, a single poorly placed bit-flip gives
rise to the exact same confusion.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Checksums, state of play

2012-03-06 Thread Marcin Mańk
On Tue, Mar 06, 2012 at 07:09:23PM +, Simon Riggs wrote:
 The problem is actually on/off/crash/on in quick succession which is
 much less likely.

I must be missing something, but how about:
if (!has_checksums  page_loses_checksum_due_to_hint_bit_write)
 wal_log_the_hint_bit_write();

Greetings
Marcin Mańk

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] Checksums, state of play

2012-03-05 Thread Simon Riggs
To avoid any confusion as to where this proposed feature is now, I'd
like to summarise my understanding, make proposals and also request
clear feedback on them.

Checksums have a number of objections to them outstanding.

1. We don't need them because there will be something better in a
later release. I don't think anybody disagrees that a better solution
is possible in the future; doubts have been expressed as to what will
be required and when that is likely to happen. Opinions differ. We can
and should do something now unless there is reason not to.

2. Turning checksums on/off/on/off in rapid succession can cause false
positive reports of checksum failure if crashes occur and are ignored.
That may lead to the feature and PostgreSQL being held in disrepute.
This can be resolved, if desired, by having having a two-stage
enabling process where we must issue a command that scans every block
in the database before checksum checking begins. VACUUM is easily
modified to the task, we just need to agree that is suitable and agree
syntax.
A suggestion is VACUUM ENABLE CHECKSUMS; others are possible.

3. Pages with checksums set need to have a version marking to show
that they are a later version of the page layout. That version number
needs to be extensible to many later versions. Pages of multiple
versions need to exist within the server to allow simple upgrades and
migration.

4. Checksums that are dependent upon a bit setting on the block are
somewhat fragile. Requests have been made to add bits in certain
positions and also to remove them again. No set of bits seems to
please everyone.

(3) and (4) are in conflict with each other, but there is a solution.
We mark the block with a version number, but we don't make the
checking dependant upon the version number. We simply avoid making any
checks until the command to scan all blocks is complete, per point
(2). That way we need to use 1 flag bit to mark the new version and
zero flag bits to indicate checks should happen.

(Various other permutations of solutions for (2), (3), (4) have been
discussed and may also be still open)

5. The part of the page header that can be used as a checksum has been
disputed. Using the 16 bits dedicated to a version number seems like
the least useful consecutive 2 bytes of data in the page header. It
can't be  16 bits because that wouldn't be an effective checksum for
database blocks. We might prefer 32 bits, but that would require use
of some other parts of the page header and possibly split that into
two parts. Splitting the checksum into 2 parts will cause the code to
be more complex and fragile.

6. Performance impacts. Measured to be a small regression.

7. Hint bit setting requires WAL logging. The worst case for that
would be setting hints on newly loaded tables. Work has been done on
other patches to remove that case. If those don't fly, this would be a
cost paid by those that wish to take advantage of this feature.

If there are other points I've missed for whatever reason, please add
them here again for clarity.

My own assessment of the above is that the checksum feature can be
added to 9.2, as long as we agree the changes above and then proceed
to implement them and also that no further serious problems emerge.

-- 
 Simon Riggs   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Checksums, state of play

2012-03-05 Thread Josh Berkus

 3. Pages with checksums set need to have a version marking to show
 that they are a later version of the page layout. That version number
 needs to be extensible to many later versions. Pages of multiple
 versions need to exist within the server to allow simple upgrades and
 migration.

This is a statement of a problem; do you have a proposed solution for it?

-- 
Josh Berkus
PostgreSQL Experts Inc.
http://pgexperts.com

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers