subject:"Re\: \[HACKERS\] Block\-level CRC checks"

Re: [HACKERS] Block-level CRC checks

2009-12-07 Thread Greg Stark

On Fri, Dec 4, 2009 at 10:47 PM, Chuck McDevitt cmcdev...@greenplum.com wrote:
 A curiosity question regarding torn pages:  How does this work on file 
 systems that don't write in-place, but instead always do copy-on-write?

 My example would be Sun's ZFS file system (In Solaris  BSD).  Because of its 
 snapshot  rollback functionality, it never writes a page in-place, but 
 instead always copies it to another place on disk.  How does this affect the 
 corruption caused by a torn write?

 Can we end up with horrible corruption on this type of filesystem where we 
 wouldn't on normal file systems, where we are writing to a previously zeroed 
 area on disk?

 Sorry if this is a stupid question... Hopefully somebody can reassure me that 
 this isn't an issue.

It's not a stupid question, we're not 100% sure but we believe ZFS
doesn't need full page writes because it's immune to torn pages.

I think the idea of ZFS is that the new partially written page isn't
visible because it's not linked into the tree until it's been
completely written. To me it appears this would depend on the drive
system ordering writes very strictly which seems hard to be sure is
happening. Perhaps this is tied to the tricks they do to avoid
contention on the root, if they do a write barrier before every root
update that seems like it should be sufficient to me, but I don't know
at that level of detail.

-- 
greg

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Block-level CRC checks

2009-12-05 Thread Greg Stark

It can save space because the line pointers have less alignment  
requirements. But I don't see any point in the current state.


--
Greg

On 2009-12-04, at 3:48 PM, Tom Lane t...@sss.pgh.pa.us wrote:


Greg Stark gsst...@mit.edu writes:

I'm not sure why I said including ctid. We would have to move
everything transactional to the line pointer, including xmin, xmax,
ctid, all the hint bits, the updated flags, hot flags, etc. The only
things left in the tuple header would be things that have to be there
such as HAS_OIDS, HAS_NULLS, natts, hoff, etc. It would be a pretty
drastic change, though a fairly logical one. I recall someone  
actually

submitted a patch to separate out the transactional bits anyways a
while back, just to save a few bytes in in-memory tuples. If we could
save on disk-space usage it would be a lot more compelling. But it
doesn't look to me like it really saves enough often enough to be
worth so much code churn.


It would also break things for indexes, which don't need all that  
stuff

in their line pointers.

More to the point, moving the same bits to someplace else on the page
doesn't save anything at all.

   regards, tom lane


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Block-level CRC checks

2009-12-04 Thread decibel


On Dec 3, 2009, at 1:53 PM, Jonah H. Harris wrote:
On Tue, Dec 1, 2009 at 1:27 PM, Joshua D. Drake  
j...@commandprompt.com wrote:

On Tue, 2009-12-01 at 13:20 -0500, Robert Haas wrote:
 Does $COMPETITOR offer this feature?


My understanding is that MSSQL does. I am not sure about Oracle. Those
are the only two I run into (I don't run into MySQL at all). I know
others likely compete in the DB2 space.

To my knowledge, MySQL, InnoDB, BerkeleyDB, solidDB, Oracle, SQL  
Server, Sybase, DB2, eXtremeDB, RDB, and Teradata all checksum pages.



So... now that the upgrade discussion seems to have died down... was  
any consensus reached on how to do said checksumming?

--
Jim C. Nasby, Database Architect   j...@nasby.net
512.569.9461 (cell) http://jim.nasby.net



--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Block-level CRC checks

2009-12-04 Thread Simon Riggs

On Fri, 2009-12-04 at 03:32 -0600, decibel wrote:

 So... now that the upgrade discussion seems to have died down... was  
 any consensus reached on how to do said checksumming?

Possibly. Please can you go through the discussion and pull out a
balanced summary of how to proceed? I lost track a while back and I'm
sure many others did also.

-- 
 Simon Riggs   www.2ndQuadrant.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Block-level CRC checks

2009-12-04 Thread Massa, Harald Armin

Kevin,

 md5sum of each tuple?  As an optional system column (a la oid)?

I am mainly an application programmer working with PostgreSQL. And I
want to point out an additional usefullness of an md5sum of each
tuple: it makes comparing table-contents in replicated / related
databases MUCH more feasible.

I am in the process of adding a user-space myhash column to all my
applications tables, filled by a trigger on insert / update. It really
speeds up table comparison across databases; and it is very helpfull
in debugging replications.

Harald


-- 
GHUM Harald Massa
persuadere et programmare
Harald Armin Massa
Spielberger Straße 49
70435 Stuttgart
0173/9409607
no fx, no carrier pigeon
-
%s is too gigantic of an industry to bend to the whims of reality

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Block-level CRC checks

2009-12-04 Thread Greg Stark

On Fri, Dec 4, 2009 at 9:34 AM, Simon Riggs si...@2ndquadrant.com wrote:

 Possibly. Please can you go through the discussion and pull out a
 balanced summary of how to proceed? I lost track a while back and I'm
 sure many others did also.

I summarized the three feasible plans I think I saw;
407d949e0912011713j63045989j67b7b343ef00c...@mail.gmail.com

-- 
greg

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Block-level CRC checks

2009-12-04 Thread Bruce Momjian

decibel wrote:
 On Dec 3, 2009, at 1:53 PM, Jonah H. Harris wrote:
  On Tue, Dec 1, 2009 at 1:27 PM, Joshua D. Drake  
  j...@commandprompt.com wrote:
  On Tue, 2009-12-01 at 13:20 -0500, Robert Haas wrote:
   Does $COMPETITOR offer this feature?
  
 
  My understanding is that MSSQL does. I am not sure about Oracle. Those
  are the only two I run into (I don't run into MySQL at all). I know
  others likely compete in the DB2 space.
 
  To my knowledge, MySQL, InnoDB, BerkeleyDB, solidDB, Oracle, SQL  
  Server, Sybase, DB2, eXtremeDB, RDB, and Teradata all checksum pages.
 
 
 So... now that the upgrade discussion seems to have died down... was  
 any consensus reached on how to do said checksumming?

I think the hint bit has to be added to the item pointer, by using the
offset bits that are already zero, according to Greg Stark.  That
solution leads to easy programming, no expanding hint bit array, and it
is backward compatible so doesn't cause problems for pg_migrator.

-- 
  Bruce Momjian  br...@momjian.ushttp://momjian.us
  EnterpriseDB http://enterprisedb.com

  + If your life is a hard drive, Christ can be your backup. +

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Block-level CRC checks

2009-12-04 Thread Simon Riggs

On Fri, 2009-12-04 at 07:12 -0500, Bruce Momjian wrote:

 I think the hint bit has to be added to the item pointer, by using the
 offset bits that are already zero, according to Greg Stark.  That
 solution leads to easy programming, no expanding hint bit array, and it
 is backward compatible so doesn't cause problems for pg_migrator.

Seems like a reasonable way forward. 

As I pointed out here
http://archives.postgresql.org/pgsql-hackers/2009-12/msg00056.php
we only need to use 3 bits not 4, but it does limit tuple length to 4096
for all block sizes. (Two different options there for doing that).

An added advantage of this approach is that the cachelines for the item
pointer array will already be in CPU cache, so there is no additional
access time when we set the hint bits when they are moved to their new
position.

I should also point out that removing 4 bits from the tuple header would
allow us to get rid of t_infomask2, reducing tuple length by a further 2
bytes.

-- 
 Simon Riggs   www.2ndQuadrant.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Block-level CRC checks

2009-12-04 Thread Alvaro Herrera

BTW with VACUUM FULL removed I assume we're going to get rid of
HEAP_MOVED_IN and HEAP_MOVED_OFF too, right?

-- 
Alvaro Herrerahttp://www.CommandPrompt.com/
PostgreSQL Replication, Consulting, Custom Development, 24x7 support

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Block-level CRC checks

2009-12-04 Thread Bruce Momjian

Simon Riggs wrote:
 On Fri, 2009-12-04 at 07:12 -0500, Bruce Momjian wrote:
 
  I think the hint bit has to be added to the item pointer, by using the
  offset bits that are already zero, according to Greg Stark.  That
  solution leads to easy programming, no expanding hint bit array, and it
  is backward compatible so doesn't cause problems for pg_migrator.
 
 Seems like a reasonable way forward. 
 
 As I pointed out here
 http://archives.postgresql.org/pgsql-hackers/2009-12/msg00056.php
 we only need to use 3 bits not 4, but it does limit tuple length to 4096
 for all block sizes. (Two different options there for doing that).
 
 An added advantage of this approach is that the cachelines for the item
 pointer array will already be in CPU cache, so there is no additional
 access time when we set the hint bits when they are moved to their new
 position.
 
 I should also point out that removing 4 bits from the tuple header would
 allow us to get rid of t_infomask2, reducing tuple length by a further 2
 bytes.

Wow, that is a nice win.  Does alignment allow us to actually use that
space?

-- 
  Bruce Momjian  br...@momjian.ushttp://momjian.us
  EnterpriseDB http://enterprisedb.com

  + If your life is a hard drive, Christ can be your backup. +

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Block-level CRC checks

2009-12-04 Thread Simon Riggs

On Fri, 2009-12-04 at 07:54 -0500, Bruce Momjian wrote:

  I should also point out that removing 4 bits from the tuple header would
  allow us to get rid of t_infomask2, reducing tuple length by a further 2
  bytes.
 
 Wow, that is a nice win.  Does alignment allow us to actually use that
 space?

It would mean that tables up to 24 columns wide would still be 24 bytes
wide, whereas 8 columns now has to fit in 32 bytes. So in practical
terms most tables would benefit in your average database.

-- 
 Simon Riggs   www.2ndQuadrant.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Block-level CRC checks

2009-12-04 Thread Simon Riggs

On Fri, 2009-12-04 at 09:52 -0300, Alvaro Herrera wrote:

 BTW with VACUUM FULL removed I assume we're going to get rid of
 HEAP_MOVED_IN and HEAP_MOVED_OFF too, right?

Much as I would like to see those go, no. VF code should remain for some
time yet, IMHO. We could remove it, but doing so is not a priority
because it buys us nothing in terms of features and its the type of
thing we should do at the start of a release cycle, not end. I certainly
don't have time to do it, at least.

-- 
 Simon Riggs   www.2ndQuadrant.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Block-level CRC checks

2009-12-04 Thread Heikki Linnakangas

Simon Riggs wrote:
 On Fri, 2009-12-04 at 09:52 -0300, Alvaro Herrera wrote:
 
 BTW with VACUUM FULL removed I assume we're going to get rid of
 HEAP_MOVED_IN and HEAP_MOVED_OFF too, right?
 
 Much as I would like to see those go, no. VF code should remain for some
 time yet, IMHO.

I don't think we need to keep VF code otherwise, but I would leave
HEAP_MOVED_IN/OFF support alone for now for in-place upgrade. Otherwise
we need a pre-upgrade script or something to scrub them off.

-- 
  Heikki Linnakangas
  EnterpriseDB   http://www.enterprisedb.com

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Block-level CRC checks

2009-12-04 Thread Greg Stark

On Fri, Dec 4, 2009 at 12:57 PM, Simon Riggs si...@2ndquadrant.com wrote:
 On Fri, 2009-12-04 at 07:54 -0500, Bruce Momjian wrote:

  I should also point out that removing 4 bits from the tuple header would
  allow us to get rid of t_infomask2, reducing tuple length by a further 2
  bytes.

 Wow, that is a nice win.  Does alignment allow us to actually use that
 space?

 It would mean that tables up to 24 columns wide would still be 24 bytes
 wide, whereas 8 columns now has to fit in 32 bytes. So in practical
 terms most tables would benefit in your average database.

I don't think getting rid of infomask2 wins us 2 bytes so fast. The
rest of those two bytes is natts which of course we still need.

If we lose vacuum full then the table's open for reducing the width of
command id too if we need more bits.  If we do that and we moved
everything we could to the line pointers including ctid we might just
be able to squeeze the tuple overhead down to 16 bytes. That would win
8 bytes per tuple for people with no null columns or with nulls and a
total of 9-64 columns but if they have 1-8 columns and any are null it
would actually consume more space. But it looks to me like it would be
very very tight and require drastic measures -- I think we would be
left with something like 11 bits for commandid and no spare bits in
the tuple header at all.

-- 
greg

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Block-level CRC checks

2009-12-04 Thread Greg Stark

On Fri, Dec 4, 2009 at 1:35 PM, Greg Stark gsst...@mit.edu wrote:
 If we lose vacuum full then the table's open for reducing the width of
 command id too if we need more bits.  If we do that and we moved
 everything we could to the line pointers including ctid we might just
 be able to squeeze the tuple overhead down to 16 bytes.

I'm not sure why I said including ctid. We would have to move
everything transactional to the line pointer, including xmin, xmax,
ctid, all the hint bits, the updated flags, hot flags, etc. The only
things left in the tuple header would be things that have to be there
such as HAS_OIDS, HAS_NULLS, natts, hoff, etc. It would be a pretty
drastic change, though a fairly logical one. I recall someone actually
submitted a patch to separate out the transactional bits anyways a
while back, just to save a few bytes in in-memory tuples. If we could
save on disk-space usage it would be a lot more compelling. But it
doesn't look to me like it really saves enough often enough to be
worth so much code churn.

-- 
greg

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Block-level CRC checks

2009-12-04 Thread Alvaro Herrera

Greg Stark escribió:
 On Fri, Dec 4, 2009 at 1:35 PM, Greg Stark gsst...@mit.edu wrote:
  If we lose vacuum full then the table's open for reducing the width of
  command id too if we need more bits.  If we do that and we moved
  everything we could to the line pointers including ctid we might just
  be able to squeeze the tuple overhead down to 16 bytes.
 
 I'm not sure why I said including ctid. We would have to move
 everything transactional to the line pointer, including xmin, xmax,
 ctid, all the hint bits, the updated flags, hot flags, etc. The only
 things left in the tuple header would be things that have to be there
 such as HAS_OIDS, HAS_NULLS, natts, hoff, etc. It would be a pretty
 drastic change, though a fairly logical one.

Do we need XMAX_EXCL_LOCK and XMAX_SHARED_LOCK to be moved?  It seems to
me that they can stay with the tuple header because they are set by
wal-logged operations.  Same for XMAX_IS_MULTI.  The HASfoo bits are all
set on tuple creation, never touched later, so they can stay in the
header too.  We only need XMIN_COMMITTED, XMIN_INVALID, XMAX_COMMITTED,
XMAX_INVALID, HEAP_COMBOCID on the line pointer AFAICS ... oh, and
HEAP_HOT_UPDATED and HEAP_ONLY_TUPLE, not sure.

-- 
Alvaro Herrerahttp://www.CommandPrompt.com/
PostgreSQL Replication, Consulting, Custom Development, 24x7 support

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Block-level CRC checks

2009-12-04 Thread Alvaro Herrera

Heikki Linnakangas escribió:
 Simon Riggs wrote:
  On Fri, 2009-12-04 at 09:52 -0300, Alvaro Herrera wrote:
  
  BTW with VACUUM FULL removed I assume we're going to get rid of
  HEAP_MOVED_IN and HEAP_MOVED_OFF too, right?
  
  Much as I would like to see those go, no. VF code should remain for some
  time yet, IMHO.
 
 I don't think we need to keep VF code otherwise, but I would leave
 HEAP_MOVED_IN/OFF support alone for now for in-place upgrade. Otherwise
 we need a pre-upgrade script or something to scrub them off.

CRCs are going to need scrubbing anyway, no?  Oh, but you're assuming
that CRCs are optional, so not everybody would need that, right?

-- 
Alvaro Herrerahttp://www.CommandPrompt.com/
The PostgreSQL Company - Command Prompt, Inc.

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Block-level CRC checks

2009-12-04 Thread Robert Haas

On Fri, Dec 4, 2009 at 9:48 AM, Alvaro Herrera
alvhe...@commandprompt.com wrote:
 Heikki Linnakangas escribió:
 Simon Riggs wrote:
  On Fri, 2009-12-04 at 09:52 -0300, Alvaro Herrera wrote:
 
  BTW with VACUUM FULL removed I assume we're going to get rid of
  HEAP_MOVED_IN and HEAP_MOVED_OFF too, right?
 
  Much as I would like to see those go, no. VF code should remain for some
  time yet, IMHO.

 I don't think we need to keep VF code otherwise, but I would leave
 HEAP_MOVED_IN/OFF support alone for now for in-place upgrade. Otherwise
 we need a pre-upgrade script or something to scrub them off.

 CRCs are going to need scrubbing anyway, no?  Oh, but you're assuming
 that CRCs are optional, so not everybody would need that, right?

If we can make not only the validity but also the presence of the CRC
field optional, it will simplify things greatly for in-place upgrade,
I think, because the upgrade won't itself require expanding the page.
Turning on the CRC functionality for a particular table may require
expanding the page, but that's a different problem.  :-)

Have we thought about what other things have changed between 8.4 and
8.5 that might cause problems for in-place upgrade?

...Robert

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Block-level CRC checks

2009-12-04 Thread Tom Lane

Simon Riggs si...@2ndquadrant.com writes:
 As I pointed out here
 http://archives.postgresql.org/pgsql-hackers/2009-12/msg00056.php
 we only need to use 3 bits not 4, but it does limit tuple length to 4096
 for all block sizes. (Two different options there for doing that).

Limiting the tuple length is a deal-breaker.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Block-level CRC checks

2009-12-04 Thread Tom Lane

Greg Stark gsst...@mit.edu writes:
 I'm not sure why I said including ctid. We would have to move
 everything transactional to the line pointer, including xmin, xmax,
 ctid, all the hint bits, the updated flags, hot flags, etc. The only
 things left in the tuple header would be things that have to be there
 such as HAS_OIDS, HAS_NULLS, natts, hoff, etc. It would be a pretty
 drastic change, though a fairly logical one. I recall someone actually
 submitted a patch to separate out the transactional bits anyways a
 while back, just to save a few bytes in in-memory tuples. If we could
 save on disk-space usage it would be a lot more compelling. But it
 doesn't look to me like it really saves enough often enough to be
 worth so much code churn.

It would also break things for indexes, which don't need all that stuff
in their line pointers.

More to the point, moving the same bits to someplace else on the page
doesn't save anything at all.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Block-level CRC checks

2009-12-04 Thread Simon Riggs

On Fri, 2009-12-04 at 10:43 -0500, Tom Lane wrote:
 Simon Riggs si...@2ndquadrant.com writes:
  As I pointed out here
  http://archives.postgresql.org/pgsql-hackers/2009-12/msg00056.php
  we only need to use 3 bits not 4, but it does limit tuple length to 4096
  for all block sizes. (Two different options there for doing that).
 
 Limiting the tuple length is a deal-breaker.

If people that use 32kB block sizes exist in practice, I note that
because tuples are at least 4 byte aligned that the first 2 bits of the
length are always unused. So they're available for those with strangely
long tuples, and can be used to signify high order bytes and so max
tuple length could be 16384. With tuples that long, it would be better
to assume 8-byte minimum alignment, which would put max tuple length
back up to 32KB again. None of that need effect people with a standard
8192 byte blocksize.

-- 
 Simon Riggs   www.2ndQuadrant.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Block-level CRC checks

2009-12-04 Thread Tom Lane

Robert Haas robertmh...@gmail.com writes:
 Have we thought about what other things have changed between 8.4 and
 8.5 that might cause problems for in-place upgrade?

So far, nothing.  We even made Andrew Gierth jump through hoops to
keep hstore's on-disk representation upwards compatible.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Block-level CRC checks

2009-12-04 Thread Simon Riggs

On Fri, 2009-12-04 at 13:35 +, Greg Stark wrote:

 I don't think getting rid of infomask2 wins us 2 bytes so fast. The
 rest of those two bytes is natts which of course we still need.

err, yes, OK.

-- 
 Simon Riggs   www.2ndQuadrant.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Block-level CRC checks

2009-12-04 Thread Bruce Momjian

Robert Haas wrote:
 On Fri, Dec 4, 2009 at 9:48 AM, Alvaro Herrera
 alvhe...@commandprompt.com wrote:
  Heikki Linnakangas escribi?:
  Simon Riggs wrote:
   On Fri, 2009-12-04 at 09:52 -0300, Alvaro Herrera wrote:
  
   BTW with VACUUM FULL removed I assume we're going to get rid of
   HEAP_MOVED_IN and HEAP_MOVED_OFF too, right?
  
   Much as I would like to see those go, no. VF code should remain for some
   time yet, IMHO.
 
  I don't think we need to keep VF code otherwise, but I would leave
  HEAP_MOVED_IN/OFF support alone for now for in-place upgrade. Otherwise
  we need a pre-upgrade script or something to scrub them off.
 
  CRCs are going to need scrubbing anyway, no? ?Oh, but you're assuming
  that CRCs are optional, so not everybody would need that, right?
 
 If we can make not only the validity but also the presence of the CRC
 field optional, it will simplify things greatly for in-place upgrade,
 I think, because the upgrade won't itself require expanding the page.
 Turning on the CRC functionality for a particular table may require
 expanding the page, but that's a different problem.  :-)

Well, I am not sure how we would turn the _space_ used for CRC on and
off because you would have to rewrite the entire table/database to turn
it on, which seems unfortunate.

-- 
  Bruce Momjian  br...@momjian.ushttp://momjian.us
  EnterpriseDB http://enterprisedb.com

  + If your life is a hard drive, Christ can be your backup. +

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Block-level CRC checks

2009-12-04 Thread Robert Haas

On Fri, Dec 4, 2009 at 2:04 PM, Bruce Momjian br...@momjian.us wrote:
 Robert Haas wrote:
 On Fri, Dec 4, 2009 at 9:48 AM, Alvaro Herrera
 alvhe...@commandprompt.com wrote:
  Heikki Linnakangas escribi?:
  Simon Riggs wrote:
   On Fri, 2009-12-04 at 09:52 -0300, Alvaro Herrera wrote:
  
   BTW with VACUUM FULL removed I assume we're going to get rid of
   HEAP_MOVED_IN and HEAP_MOVED_OFF too, right?
  
   Much as I would like to see those go, no. VF code should remain for some
   time yet, IMHO.
 
  I don't think we need to keep VF code otherwise, but I would leave
  HEAP_MOVED_IN/OFF support alone for now for in-place upgrade. Otherwise
  we need a pre-upgrade script or something to scrub them off.
 
  CRCs are going to need scrubbing anyway, no? ?Oh, but you're assuming
  that CRCs are optional, so not everybody would need that, right?

 If we can make not only the validity but also the presence of the CRC
 field optional, it will simplify things greatly for in-place upgrade,
 I think, because the upgrade won't itself require expanding the page.
 Turning on the CRC functionality for a particular table may require
 expanding the page, but that's a different problem.  :-)

 Well, I am not sure how we would turn the _space_ used for CRC on and
 off because you would have to rewrite the entire table/database to turn
 it on, which seems unfortunate.

Well, presumably you're going to have to do some of that work anyway,
because even if the space is set aside you're still going to have to
read the page in, CRC it, and write it back out.  However if the space
is not pre-allocated then you also have to deal with moving tuples to
other pages.  But that problem is going to have to be dealt with
somewhere along the line no matter what we do, because if you're
upgrading an 8.3 or 8.4 system to 8.5, you need to add that space
sometime: either before migration (with a pre-upgrade utility), or
after migration (by some sort of page converter/tuple mover), or only
when/if enabling the CRC feature.

One nice thing about making it the CRC feature's problem to make space
on each page is that people who don't want to use CRCs can still use
those extra 4 bytes/page for data.  That might not be worth the code
complexity if we were starting from scratch, but I'm thinking that
most of the code complexity is a given if we want to also support
in-place upgrade.

...Robert

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Block-level CRC checks

2009-12-04 Thread Alvaro Herrera

Massa, Harald Armin wrote:

 I am in the process of adding a user-space myhash column to all my
 applications tables, filled by a trigger on insert / update. It really
 speeds up table comparison across databases; and it is very helpfull
 in debugging replications.

Have you seen pg_comparator?

-- 
Alvaro Herrerahttp://www.CommandPrompt.com/
The PostgreSQL Company - Command Prompt, Inc.

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Block-level CRC checks

2009-12-04 Thread Bruce Momjian

Robert Haas wrote:
  Well, I am not sure how we would turn the _space_ used for CRC on and
  off because you would have to rewrite the entire table/database to turn
  it on, which seems unfortunate.
 
 Well, presumably you're going to have to do some of that work anyway,
 because even if the space is set aside you're still going to have to
 read the page in, CRC it, and write it back out.  However if the space
 is not pre-allocated then you also have to deal with moving tuples to
 other pages.  But that problem is going to have to be dealt with
 somewhere along the line no matter what we do, because if you're
 upgrading an 8.3 or 8.4 system to 8.5, you need to add that space
 sometime: either before migration (with a pre-upgrade utility), or
 after migration (by some sort of page converter/tuple mover), or only
 when/if enabling the CRC feature.
 
 One nice thing about making it the CRC feature's problem to make space
 on each page is that people who don't want to use CRCs can still use
 those extra 4 bytes/page for data.  That might not be worth the code
 complexity if we were starting from scratch, but I'm thinking that
 most of the code complexity is a given if we want to also support
 in-place upgrade.

My guess is we can find somewhere on a 8.4 heap/index page to add four
bytes.

-- 
  Bruce Momjian  br...@momjian.ushttp://momjian.us
  EnterpriseDB http://enterprisedb.com

  + If your life is a hard drive, Christ can be your backup. +

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Block-level CRC checks

2009-12-04 Thread Chuck McDevitt

A curiosity question regarding torn pages:  How does this work on file systems 
that don't write in-place, but instead always do copy-on-write?

My example would be Sun's ZFS file system (In Solaris  BSD).  Because of its 
snapshot  rollback functionality, it never writes a page in-place, but 
instead always copies it to another place on disk.  How does this affect the 
corruption caused by a torn write?

Can we end up with horrible corruption on this type of filesystem where we 
wouldn't on normal file systems, where we are writing to a previously zeroed 
area on disk?

Sorry if this is a stupid question... Hopefully somebody can reassure me that 
this isn't an issue.

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Block-level CRC checks

2009-12-04 Thread Simon Riggs

On Fri, 2009-12-04 at 14:47 -0800, Chuck McDevitt wrote:
 A curiosity question regarding torn pages:  How does this work on file
 systems that don't write in-place, but instead always do
 copy-on-write?
 
 My example would be Sun's ZFS file system (In Solaris  BSD).  Because
 of its snapshot  rollback functionality, it never writes a page
 in-place, but instead always copies it to another place on disk.  How
 does this affect the corruption caused by a torn write?
 
 Can we end up with horrible corruption on this type of filesystem
 where we wouldn't on normal file systems, where we are writing to a
 previously zeroed area on disk?
 
 Sorry if this is a stupid question... Hopefully somebody can reassure
 me that this isn't an issue.

Think we're still good. Not a stupid question.

Hint bits are set while the block is in shared_buffers and setting a
hint bit dirties the page, but does not write WAL.

Because the page is dirty we re-write the whole block at checkpoint, by
bgwriter cleaning or via dirty page eviction. So ZFS is OK, but we do
more writing than we want to, sometimes.

-- 
 Simon Riggs   www.2ndQuadrant.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Block-level CRC checks

2009-12-04 Thread Massa, Harald Armin

 I am in the process of adding a user-space myhash column to all my
 applications tables, filled by a trigger on insert / update. It really
 speeds up table comparison across databases; and it is very helpfull
 in debugging replications.

 Have you seen pg_comparator?

yes, saw the lightning talk at pgday.eu
it also uses md5 hashes, just in an own schema. Guess pg_comparator
would profit from an integrated MD5 hash.

Harald


-- 
GHUM Harald Massa
persuadere et programmare
Harald Armin Massa
Spielberger Straße 49
70435 Stuttgart
0173/9409607
no fx, no carrier pigeon
-
%s is too gigantic of an industry to bend to the whims of reality

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Block-level CRC checks

2009-12-03 Thread Jonah H. Harris

On Tue, Dec 1, 2009 at 1:27 PM, Joshua D. Drake j...@commandprompt.comwrote:

 On Tue, 2009-12-01 at 13:20 -0500, Robert Haas wrote:
  Does $COMPETITOR offer this feature?
 

 My understanding is that MSSQL does. I am not sure about Oracle. Those
 are the only two I run into (I don't run into MySQL at all). I know
 others likely compete in the DB2 space.


To my knowledge, MySQL, InnoDB, BerkeleyDB, solidDB, Oracle, SQL Server,
Sybase, DB2, eXtremeDB, RDB, and Teradata all checksum pages.

-- 
Jonah H. Harris, Senior DBA
myYearbook.com

Re: [HACKERS] Block-level CRC checks

2009-12-02 Thread Peter Eisentraut

On tis, 2009-12-01 at 19:41 +, Greg Stark wrote:
  Also, it would
  require reading back each page as it's written to disk, which is OK
 for
  a bunch of single-row writes, but for bulk data loads a significant
 problem.
 
 Not sure what that really means for Postgres. It would just mean
 reading back the same page of memory from the filesystem cache that we
 just read.

Surely the file system ought to be the place where to solve this.  After
all, we don't put link-level corruption detection into the libpq
protocol either.


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Block-level CRC checks

2009-12-02 Thread Peter Eisentraut

On tis, 2009-12-01 at 17:47 -0500, Tom Lane wrote:
 Bruce Momjian br...@momjian.us writes:
  I also like the idea that we don't need to CRC check the line pointers
  because any corruption there is going to appear immediately.  However,
  the bad news is that we wouldn't find the corruption until we try to
  access bad data and might crash.
 
 That sounds exactly like the corruption detection system we have now.
 If you think that behavior is acceptable, we can skip this whole
 discussion.

I think one of the motivations for this CRC business was to detect
corruption in the user data.  As you say, we already handle corruption
in the metadata.


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Block-level CRC checks

2009-12-01 Thread Heikki Linnakangas

Simon Riggs wrote:
 There is no creation of corruption events. This scheme detects
 corruption events that *have* occurred. Now I understand that we
 previously would have recovered seamlessly from such events, but they
 were corruption events nonetheless and I think they need to be reported.
 (For why, see Conclusion #2, below).

No, you're still missing the point. The point is *not* random bit errors
affecting hint bits, but the torn page problem. Today, a torn page is a
completely valid and expected behavior from the OS and storage
subsystem. We handle it with full_page_writes, and by relying on the
fact that it's OK for a hint bit set to get lost. With your scheme, a
torn page would become a corrupt page.

-- 
  Heikki Linnakangas
  EnterpriseDB   http://www.enterprisedb.com

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Block-level CRC checks

2009-12-01 Thread Simon Riggs

On Tue, 2009-12-01 at 10:04 +0200, Heikki Linnakangas wrote:
 Simon Riggs wrote:
  There is no creation of corruption events. This scheme detects
  corruption events that *have* occurred. Now I understand that we
  previously would have recovered seamlessly from such events, but they
  were corruption events nonetheless and I think they need to be reported.
  (For why, see Conclusion #2, below).
 
 No, you're still missing the point. The point is *not* random bit errors
 affecting hint bits, but the torn page problem. Today, a torn page is a
 completely valid and expected behavior from the OS and storage
 subsystem. We handle it with full_page_writes, and by relying on the
 fact that it's OK for a hint bit set to get lost. With your scheme, a
 torn page would become a corrupt page.

Well, its easy to keep going on about how much you think I
misunderstand. But I think that's just misdirection.

The way we handle torn page corruptions *hides* actual corruptions from
us. The frequency of true positives and false positives is important
here. If the false positive ratio is very small, then reporting them is
not a problem because of the benefit we get from having spotted the true
positives. Some convicted murderers didn't do it, but that is not an
argument for letting them all go free (without knowing the details). So
we need to know what the false positive ratio is before we evaluate the
benefit of either reporting or non-reporting possible corruption events.

When do you think torn pages happen? Only at crash, or other times also?
Do they always happen at crash? Are there ways to re-check a block that
has suffered a hint-related torn page issue? Are there ways to isolate
and minimise the reporting of false positives? Those are important
questions and this is not black and white.

If the *only* answer really is we-must-WAL-log everything, then that is
the answer, as an option. I suspect that there is a less strict
possibility, if we question our assumptions and look at the frequencies.

We know that I have no time to work on this; I am just trying to hold
open the door to a few possibilities that we have not fully considered
in a balanced way. And I myself am guilty of having slammed the door
previously. I encourage development of a way forward based upon a
balance of utility.

-- 
 Simon Riggs   www.2ndQuadrant.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Block-level CRC checks

2009-12-01 Thread Bruce Momjian

Simon Riggs wrote:
 The way we handle torn page corruptions *hides* actual corruptions from
 us. The frequency of true positives and false positives is important
 here. If the false positive ratio is very small, then reporting them is
 not a problem because of the benefit we get from having spotted the true
 positives. Some convicted murderers didn't do it, but that is not an
 argument for letting them all go free (without knowing the details). So
 we need to know what the false positive ratio is before we evaluate the
 benefit of either reporting or non-reporting possible corruption events.
 
 When do you think torn pages happen? Only at crash, or other times also?
 Do they always happen at crash? Are there ways to re-check a block that
 has suffered a hint-related torn page issue? Are there ways to isolate
 and minimise the reporting of false positives? Those are important
 questions and this is not black and white.
 
 If the *only* answer really is we-must-WAL-log everything, then that is
 the answer, as an option. I suspect that there is a less strict
 possibility, if we question our assumptions and look at the frequencies.
 
 We know that I have no time to work on this; I am just trying to hold
 open the door to a few possibilities that we have not fully considered
 in a balanced way. And I myself am guilty of having slammed the door
 previously. I encourage development of a way forward based upon a
 balance of utility.

I think the problem boils down to what the user response should be to a
corruption report.  If it is a torn page, it would be corrected and the
user doesn't have to do anything.  If it is something that is not
correctable, then the user has corruption and/or bad hardware. I think
the problem is that the existing proposal can't distinguish between
these two cases so the user has no idea how to respond to the report.

-- 
  Bruce Momjian  br...@momjian.ushttp://momjian.us
  EnterpriseDB http://enterprisedb.com

  + If your life is a hard drive, Christ can be your backup. +

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Block-level CRC checks

2009-12-01 Thread Simon Riggs

On Tue, 2009-12-01 at 06:35 -0500, Bruce Momjian wrote:
 Simon Riggs wrote:
  The way we handle torn page corruptions *hides* actual corruptions from
  us. The frequency of true positives and false positives is important
  here. If the false positive ratio is very small, then reporting them is
  not a problem because of the benefit we get from having spotted the true
  positives. Some convicted murderers didn't do it, but that is not an
  argument for letting them all go free (without knowing the details). So
  we need to know what the false positive ratio is before we evaluate the
  benefit of either reporting or non-reporting possible corruption events.
  
  When do you think torn pages happen? Only at crash, or other times also?
  Do they always happen at crash? Are there ways to re-check a block that
  has suffered a hint-related torn page issue? Are there ways to isolate
  and minimise the reporting of false positives? Those are important
  questions and this is not black and white.
  
  If the *only* answer really is we-must-WAL-log everything, then that is
  the answer, as an option. I suspect that there is a less strict
  possibility, if we question our assumptions and look at the frequencies.
  
  We know that I have no time to work on this; I am just trying to hold
  open the door to a few possibilities that we have not fully considered
  in a balanced way. And I myself am guilty of having slammed the door
  previously. I encourage development of a way forward based upon a
  balance of utility.
 
 I think the problem boils down to what the user response should be to a
 corruption report.  If it is a torn page, it would be corrected and the
 user doesn't have to do anything.  If it is something that is not
 correctable, then the user has corruption and/or bad hardware. 

 I think
 the problem is that the existing proposal can't distinguish between
 these two cases so the user has no idea how to respond to the report.

If 99.5% of cases are real corruption then there is little need to
distinguish between the cases, nor much value in doing so. The
prevalence of the different error types is critical to understanding how
to respond.

If a man pulls a gun on you, your first thought isn't some people
remove guns from their jacket to polish them, so perhaps he intends to
polish it now because the prevalence of shootings is high, when faced
by people with guns, and the risk of dying is also high. You make a
judgement based upon the prevalence and the risk. 

That is all I am asking for us to do here, make a balanced call. These
recent comments are a change in my own position, based upon evaluating
the prevalence and the risk. I ask others to consider the same line of
thought rather than a black/white assessment.

All useful detection mechanisms have non-zero false positives because we
would rather sometimes ring the bell for no reason than to let bad
things through silently, as we do now.

-- 
 Simon Riggs   www.2ndQuadrant.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Block-level CRC checks

2009-12-01 Thread Bruce Momjian

Simon Riggs wrote:
  I think
  the problem is that the existing proposal can't distinguish between
  these two cases so the user has no idea how to respond to the report.
 
 If 99.5% of cases are real corruption then there is little need to
 distinguish between the cases, nor much value in doing so. The
 prevalence of the different error types is critical to understanding how
 to respond.
 
 If a man pulls a gun on you, your first thought isn't some people
 remove guns from their jacket to polish them, so perhaps he intends to
 polish it now because the prevalence of shootings is high, when faced
 by people with guns, and the risk of dying is also high. You make a
 judgement based upon the prevalence and the risk. 
 
 That is all I am asking for us to do here, make a balanced call. These
 recent comments are a change in my own position, based upon evaluating
 the prevalence and the risk. I ask others to consider the same line of
 thought rather than a black/white assessment.
 
 All useful detection mechanisms have non-zero false positives because we
 would rather sometimes ring the bell for no reason than to let bad
 things through silently, as we do now.

OK, but what happens if someone gets the failure report, assumes their
hardware is faulty and replaces it, and then gets a failure report
again?  I assume torn pages are 99% of the reported problem, which are
expected and are fixed, and bad hardware 1%, quite the opposite of your
numbers above.

What might be interesting is to report CRC mismatches if the database
was shut down cleanly previously;  I think in those cases we shouldn't
have torn pages.

-- 
  Bruce Momjian  br...@momjian.ushttp://momjian.us
  EnterpriseDB http://enterprisedb.com

  + If your life is a hard drive, Christ can be your backup. +

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Block-level CRC checks

2009-12-01 Thread Robert Haas

On Mon, Nov 30, 2009 at 3:27 PM, Heikki Linnakangas
heikki.linnakan...@enterprisedb.com wrote:
 Simon Riggs wrote:
 Proposal

 * We reserve enough space on a disk block for a CRC check. When a dirty
 block is written to disk we calculate and annotate the CRC value, though
 this is *not* WAL logged.

 Imagine this:
 1. A hint bit is set. It is not WAL-logged, but the page is dirtied.
 2. The buffer is flushed out of the buffer cache to the OS. A new CRC is
 calculated and stored on the page.
 3. Half of the page is flushed to disk (aka torn page problem). The CRC
 made it to disk but the flipped hint bit didn't.

 You now have a page with incorrect CRC on disk.

This is probably a stupid question, but why doesn't the other half of
the page make it to disk?  Somebody pulls the plug first?

...Robert

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Block-level CRC checks

2009-12-01 Thread Bruce Momjian

Robert Haas wrote:
 On Mon, Nov 30, 2009 at 3:27 PM, Heikki Linnakangas
 heikki.linnakan...@enterprisedb.com wrote:
  Simon Riggs wrote:
  Proposal
 
  * We reserve enough space on a disk block for a CRC check. When a dirty
  block is written to disk we calculate and annotate the CRC value, though
  this is *not* WAL logged.
 
  Imagine this:
  1. A hint bit is set. It is not WAL-logged, but the page is dirtied.
  2. The buffer is flushed out of the buffer cache to the OS. A new CRC is
  calculated and stored on the page.
  3. Half of the page is flushed to disk (aka torn page problem). The CRC
  made it to disk but the flipped hint bit didn't.
 
  You now have a page with incorrect CRC on disk.
 
 This is probably a stupid question, but why doesn't the other half of
 the page make it to disk?  Somebody pulls the plug first?

Yep, the pages are 512 bytes on disk, so you might get only some of the
16 512-byte blocks to disk, or the 512-byte block might be partially
written.  Full page writes fix these on recovery.

-- 
  Bruce Momjian  br...@momjian.ushttp://momjian.us
  EnterpriseDB http://enterprisedb.com

  + If your life is a hard drive, Christ can be your backup. +

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Block-level CRC checks

2009-12-01 Thread Simon Riggs

On Tue, 2009-12-01 at 07:05 -0500, Bruce Momjian wrote:

 I assume torn pages are 99% of the reported problem, which are
 expected and are fixed, and bad hardware 1%, quite the opposite of your
 numbers above.

On what basis do you make that assumption?

-- 
 Simon Riggs   www.2ndQuadrant.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Block-level CRC checks

2009-12-01 Thread Bruce Momjian

Simon Riggs wrote:
 On Tue, 2009-12-01 at 07:05 -0500, Bruce Momjian wrote:
 
  I assume torn pages are 99% of the reported problem, which are
  expected and are fixed, and bad hardware 1%, quite the opposite of your
  numbers above.
 
 On what basis do you make that assumption?

Because we added full page write protection to fix the reported problem
of torn pages, which we had on occasion;  now we don't.  Bad hardware
reports are less frequent.

And we know we can reproduce torn pages by shutting of power to a server
without battery-backed cache.  We don't know how to produce I/O failures
on demand.

-- 
  Bruce Momjian  br...@momjian.ushttp://momjian.us
  EnterpriseDB http://enterprisedb.com

  + If your life is a hard drive, Christ can be your backup. +

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Block-level CRC checks

2009-12-01 Thread Bruce Momjian

bruce wrote:
 What might be interesting is to report CRC mismatches if the database
 was shut down cleanly previously;  I think in those cases we shouldn't
 have torn pages.

Sorry, stupid idea on my part.  We don't WAL log hit bit changes so
there is no guarantee the page is in WAL on recovery.

-- 
  Bruce Momjian  br...@momjian.ushttp://momjian.us
  EnterpriseDB http://enterprisedb.com

  + If your life is a hard drive, Christ can be your backup. +

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Block-level CRC checks

2009-12-01 Thread Simon Riggs

On Tue, 2009-12-01 at 07:58 -0500, Bruce Momjian wrote:
 bruce wrote:
  What might be interesting is to report CRC mismatches if the database
  was shut down cleanly previously;  I think in those cases we shouldn't
  have torn pages.
 
 Sorry, stupid idea on my part.  We don't WAL log hit bit changes so
 there is no guarantee the page is in WAL on recovery.

I thought it was a reasonable idea. We would need to re-check CRCs after
a crash and zero any that mismatched. Then we can start checking them
again as we run.

In any case, it seems strange to do nothing to protect the database in
normal running just because there is one type of problem that occurs
when we crash.

-- 
 Simon Riggs   www.2ndQuadrant.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Block-level CRC checks

2009-12-01 Thread Simon Riggs

On Tue, 2009-12-01 at 07:42 -0500, Bruce Momjian wrote:
 Simon Riggs wrote:
  On Tue, 2009-12-01 at 07:05 -0500, Bruce Momjian wrote:
  
   I assume torn pages are 99% of the reported problem, which are
   expected and are fixed, and bad hardware 1%, quite the opposite of your
   numbers above.
  
  On what basis do you make that assumption?
 
 Because we added full page write protection to fix the reported problem
 of torn pages, which we had on occasion;  now we don't.  Bad hardware
 reports are less frequent.

Bad hardware reports are infrequent because we lack a detection system
for them, which is the topic of this thread. It would be circular to
argue that as a case against.

It's also an argument that only effects crashes.

-- 
 Simon Riggs   www.2ndQuadrant.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Block-level CRC checks

2009-12-01 Thread Heikki Linnakangas

Bruce Momjian wrote:
 What might be interesting is to report CRC mismatches if the database
 was shut down cleanly previously;  I think in those cases we shouldn't
 have torn pages.

Unfortunately that's not true. You can crash, leading to a torn page,
and then start up the database and shut it down cleanly. The torn page
is still there, even though the last shutdown was a clean one.

-- 
  Heikki Linnakangas
  EnterpriseDB   http://www.enterprisedb.com

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Block-level CRC checks

2009-12-01 Thread marcin mank

On Mon, Nov 30, 2009 at 9:27 PM, Heikki Linnakangas
heikki.linnakan...@enterprisedb.com wrote:
 Simon Riggs wrote:
 Proposal

 * We reserve enough space on a disk block for a CRC check. When a dirty
 block is written to disk we calculate and annotate the CRC value, though
 this is *not* WAL logged.

 Imagine this:
 1. A hint bit is set. It is not WAL-logged, but the page is dirtied.
 2. The buffer is flushed out of the buffer cache to the OS. A new CRC is
 calculated and stored on the page.
 3. Half of the page is flushed to disk (aka torn page problem). The CRC
 made it to disk but the flipped hint bit didn't.

 You now have a page with incorrect CRC on disk.


What if we treated the hint bits as all-zeros for the purpose of CRC
calculation? This would exclude them from the checksum.


Greetings
Marcin Mańk

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Block-level CRC checks

2009-12-01 Thread Andres Freund

On Tuesday 01 December 2009 14:38:26 marcin mank wrote:
 On Mon, Nov 30, 2009 at 9:27 PM, Heikki Linnakangas
 
 heikki.linnakan...@enterprisedb.com wrote:
  Simon Riggs wrote:
  Proposal
 
  * We reserve enough space on a disk block for a CRC check. When a dirty
  block is written to disk we calculate and annotate the CRC value, though
  this is *not* WAL logged.
 
  Imagine this:
  1. A hint bit is set. It is not WAL-logged, but the page is dirtied.
  2. The buffer is flushed out of the buffer cache to the OS. A new CRC is
  calculated and stored on the page.
  3. Half of the page is flushed to disk (aka torn page problem). The CRC
  made it to disk but the flipped hint bit didn't.
 
  You now have a page with incorrect CRC on disk.
 
 What if we treated the hint bits as all-zeros for the purpose of CRC
 calculation? This would exclude them from the checksum.
That sounds like doing a complete copy of the wal page zeroing specific fields 
and then doing wal - rather expensive I would say. Both, during computing the 
checksum and checking it...

Andres

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Block-level CRC checks

2009-12-01 Thread Aidan Van Dyk

* Andres Freund and...@anarazel.de [091201 08:42]:
 On Tuesday 01 December 2009 14:38:26 marcin mank wrote:
  On Mon, Nov 30, 2009 at 9:27 PM, Heikki Linnakangas
  
  heikki.linnakan...@enterprisedb.com wrote:
   Simon Riggs wrote:
   Proposal
  
   * We reserve enough space on a disk block for a CRC check. When a dirty
   block is written to disk we calculate and annotate the CRC value, though
   this is *not* WAL logged.
  
   Imagine this:
   1. A hint bit is set. It is not WAL-logged, but the page is dirtied.
   2. The buffer is flushed out of the buffer cache to the OS. A new CRC is
   calculated and stored on the page.
   3. Half of the page is flushed to disk (aka torn page problem). The CRC
   made it to disk but the flipped hint bit didn't.
  
   You now have a page with incorrect CRC on disk.
  
  What if we treated the hint bits as all-zeros for the purpose of CRC
  calculation? This would exclude them from the checksum.
 That sounds like doing a complete copy of the wal page zeroing specific 
 fields 
 and then doing wal - rather expensive I would say. Both, during computing the 
 checksum and checking it...

No, it has nothing to do with WAL, it has to do with when writing
pages out... You already double-buffer them (to avoid the page
changing while you checksum it) before calling write, but the code
writing (and then reading) pages doesn't currently have to know all the
internal stuff needed decide what's a hint bit and what's not...

And adding that information into the buffer in/out would be a huge wart
on the modularity of the PG code...

a.

-- 
Aidan Van Dyk Create like a god,
ai...@highrise.ca   command like a king,
http://www.highrise.ca/   work like a slave.


signature.asc
Description: Digital signature

Re: [HACKERS] Block-level CRC checks

2009-12-01 Thread Robert Haas

On Tue, Dec 1, 2009 at 8:30 AM, Heikki Linnakangas
heikki.linnakan...@enterprisedb.com wrote:
 Bruce Momjian wrote:
 What might be interesting is to report CRC mismatches if the database
 was shut down cleanly previously;  I think in those cases we shouldn't
 have torn pages.

 Unfortunately that's not true. You can crash, leading to a torn page,
 and then start up the database and shut it down cleanly. The torn page
 is still there, even though the last shutdown was a clean one.

Thinking through this, as I understand it, in order to prevent this
problem, you'd need to be able to predict at recovery time which pages
might have been torn by the unclean shutdown.  In order to do that,
you'd need to know which pages were waiting to be written to disk at
the time of the shutdown.  For ordinary page modifications, that's not
a problem, because there will be WAL records for those pages that need
to be replayed, and we could recompute the CRC at the same time.  But
for hint bit changes, there's no persistent state that would tell us
which hint bits were in the midst of being flipped when the system
went down, so the only way to make sure all the CRCs are correct would
be to rescan every page in the entire cluster and recompute every CRC.

Is that right?

...Robert

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Block-level CRC checks

2009-12-01 Thread Andres Freund

On Tuesday 01 December 2009 15:26:21 Aidan Van Dyk wrote:
 * Andres Freund and...@anarazel.de [091201 08:42]:
  On Tuesday 01 December 2009 14:38:26 marcin mank wrote:
   On Mon, Nov 30, 2009 at 9:27 PM, Heikki Linnakangas
  
   heikki.linnakan...@enterprisedb.com wrote:
Simon Riggs wrote:
Proposal
   
* We reserve enough space on a disk block for a CRC check. When a
dirty block is written to disk we calculate and annotate the CRC
value, though this is *not* WAL logged.
   
Imagine this:
1. A hint bit is set. It is not WAL-logged, but the page is dirtied.
2. The buffer is flushed out of the buffer cache to the OS. A new CRC
is calculated and stored on the page.
3. Half of the page is flushed to disk (aka torn page problem). The
CRC made it to disk but the flipped hint bit didn't.
   
You now have a page with incorrect CRC on disk.
  
   What if we treated the hint bits as all-zeros for the purpose of CRC
   calculation? This would exclude them from the checksum.
 
  That sounds like doing a complete copy of the wal page zeroing specific
  fields and then doing wal - rather expensive I would say. Both, during
  computing the checksum and checking it...

 No, it has nothing to do with WAL, it has to do with when writing
 pages out... You already double-buffer them (to avoid the page
 changing while you checksum it) before calling write, but the code
 writing (and then reading) pages doesn't currently have to know all the
 internal stuff needed decide what's a hint bit and what's not...
err, yes. That WAL slipped in, sorry. But it would still either mean a third 
copy of the page or a rather complex jumping around on the page...

Andres

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Block-level CRC checks

2009-12-01 Thread Heikki Linnakangas

Robert Haas wrote:
 On Tue, Dec 1, 2009 at 8:30 AM, Heikki Linnakangas
 heikki.linnakan...@enterprisedb.com wrote:
 Bruce Momjian wrote:
 What might be interesting is to report CRC mismatches if the database
 was shut down cleanly previously;  I think in those cases we shouldn't
 have torn pages.
 Unfortunately that's not true. You can crash, leading to a torn page,
 and then start up the database and shut it down cleanly. The torn page
 is still there, even though the last shutdown was a clean one.
 
 Thinking through this, as I understand it, in order to prevent this
 problem, you'd need to be able to predict at recovery time which pages
 might have been torn by the unclean shutdown.  In order to do that,
 you'd need to know which pages were waiting to be written to disk at
 the time of the shutdown.  For ordinary page modifications, that's not
 a problem, because there will be WAL records for those pages that need
 to be replayed, and we could recompute the CRC at the same time.  But
 for hint bit changes, there's no persistent state that would tell us
 which hint bits were in the midst of being flipped when the system
 went down, so the only way to make sure all the CRCs are correct would
 be to rescan every page in the entire cluster and recompute every CRC.
 
 Is that right?

Yep.

Even if rescanning every page in the cluster was feasible from a
performance point-of-view, it would make the CRC checking a lot less
useful. It's not hard to imagine that when a hardware glitch happens
causing corruption, it also causes the system to crash. Recalculating
the CRCs after crash would mask the corruption.

-- 
  Heikki Linnakangas
  EnterpriseDB   http://www.enterprisedb.com

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Block-level CRC checks

2009-12-01 Thread Simon Riggs

On Tue, 2009-12-01 at 15:30 +0200, Heikki Linnakangas wrote:
 Bruce Momjian wrote:
  What might be interesting is to report CRC mismatches if the database
  was shut down cleanly previously;  I think in those cases we shouldn't
  have torn pages.
 
 Unfortunately that's not true. You can crash, leading to a torn page,
 and then start up the database and shut it down cleanly. The torn page
 is still there, even though the last shutdown was a clean one.

There seems to be two ways forwards: journalling or fsck.

We can either

* WAL-log all changes to a page (journalling) (8-byte overhead)

* After a crash disable CRC checks until a full database scan has either
re-checked CRC or found CRC mismatch, report it in the LOG and then
reset the CRC. (fsck) (8-byte overhead)

Both of which can be optimised in various ways.

Also, we might

* Put all hint bits in the block header to allow them to be excluded
more easily from CRC checking. If we used 3 more bits from
ItemIdData.lp_len (limiting tuple length to 4096) then we could store
some hints in the item pointer. HEAP_XMIN_INVALID can be stored as
LP_DEAD, since that will happen very quickly anyway. 

-- 
 Simon Riggs   www.2ndQuadrant.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Block-level CRC checks

2009-12-01 Thread Robert Haas

On Tue, Dec 1, 2009 at 9:40 AM, Heikki Linnakangas
heikki.linnakan...@enterprisedb.com wrote:
 Even if rescanning every page in the cluster was feasible from a
 performance point-of-view, it would make the CRC checking a lot less
 useful. It's not hard to imagine that when a hardware glitch happens
 causing corruption, it also causes the system to crash. Recalculating
 the CRCs after crash would mask the corruption.

Yeah.  Thanks for the explanation - I think I understand the problem now.

...Robert

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Block-level CRC checks

2009-12-01 Thread Florian Weimer

* Simon Riggs:

 * Put all hint bits in the block header to allow them to be excluded
 more easily from CRC checking. If we used 3 more bits from
 ItemIdData.lp_len (limiting tuple length to 4096) then we could store
 some hints in the item pointer. HEAP_XMIN_INVALID can be stored as
 LP_DEAD, since that will happen very quickly anyway. 

What about putting the whole visibility information out-of-line, into
its own B-tree, indexed by page number?

-- 
Florian Weimerfwei...@bfk.de
BFK edv-consulting GmbH   http://www.bfk.de/
Kriegsstraße 100  tel: +49-721-96201-1
D-76133 Karlsruhe fax: +49-721-96201-99

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Block-level CRC checks

2009-12-01 Thread Tom Lane

Florian Weimer fwei...@bfk.de writes:
 What about putting the whole visibility information out-of-line, into
 its own B-tree, indexed by page number?

Hint bits need to be *cheap* to examine.  Otherwise there's little
point in having them at all.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Block-level CRC checks

2009-12-01 Thread Simon Riggs

On Tue, 2009-12-01 at 16:40 +0200, Heikki Linnakangas wrote:

 It's not hard to imagine that when a hardware glitch happens
 causing corruption, it also causes the system to crash. Recalculating
 the CRCs after crash would mask the corruption.

They are already masked from us, so continuing to mask those errors
would not put us in a worse position.

If we are saying that 99% of page corruptions are caused at crash time
because of torn pages on hint bits, then only WAL logging can help us
find the 1%. I'm not convinced that is an accurate or safe assumption
and I'd at least like to see LOG entries showing what happened.

ISTM we could go for two levels of protection. CRC checks and scanner
for Level 1 protection, then full WAL logging for Level 2 protection.

-- 
 Simon Riggs   www.2ndQuadrant.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Block-level CRC checks

2009-12-01 Thread Tom Lane

Simon Riggs si...@2ndquadrant.com writes:
 On Tue, 2009-12-01 at 16:40 +0200, Heikki Linnakangas wrote:
 It's not hard to imagine that when a hardware glitch happens
 causing corruption, it also causes the system to crash. Recalculating
 the CRCs after crash would mask the corruption.

 They are already masked from us, so continuing to mask those errors
 would not put us in a worse position.

No, it would just destroy a large part of the argument for why this
is worth doing.  We detect disk errors ... except for ones that happen
during a database crash.  Say what?

The fundamental problem with this is the same as it's been all along:
the tradeoff between implementation work expended, performance overhead
added, and net number of real problems detected (with a suitably large
demerit for actually *introducing* problems) just doesn't look
attractive.  You can make various compromises that improve one or two of
these factors at the cost of making the others worse, but at the end of
the day I've still not seen a combination that seems worth doing.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Block-level CRC checks

2009-12-01 Thread Robert Haas

On Tue, Dec 1, 2009 at 10:35 AM, Simon Riggs si...@2ndquadrant.com wrote:
 On Tue, 2009-12-01 at 16:40 +0200, Heikki Linnakangas wrote:

 It's not hard to imagine that when a hardware glitch happens
 causing corruption, it also causes the system to crash. Recalculating
 the CRCs after crash would mask the corruption.

 They are already masked from us, so continuing to mask those errors
 would not put us in a worse position.

 If we are saying that 99% of page corruptions are caused at crash time
 because of torn pages on hint bits, then only WAL logging can help us
 find the 1%. I'm not convinced that is an accurate or safe assumption
 and I'd at least like to see LOG entries showing what happened.

It may or may not be true that most page corruptions happen at crash
time, but it's certainly false that they are caused at crash time
*because of torn pages on hint bits*.   If only part of a block is
written to disk and the unwritten parts contain hint-bit changes -
that's not corruption.  That's design behavior.  Any CRC system needs
to avoid complaining about errors when that happens because otherwise
people will think that their database is corrupted and their hardware
is faulty when in reality it is not.

If we could find a way to put the hint bits in the same 512-byte block
as the CRC, that might do it, but I'm not sure whether that is
possible.

Ignoring CRC errors after a crash until we've re-CRC'd the entire
database will certainly eliminate the bogus error reports, but it
seems likely to mask a large percentage of legitimate errors.  For
example, suppose that I write 1MB of data out to disk and then don't
access it for a year.   During that time the data is corrupted.  Then
the system crashes.  Upon recovery, since there's no way of knowing
whether hint bits on those pages were being updated at the time of the
crash, so the system re-CRC's the corrupted data and declares it known
good.  Six months later, I try to access the data and find out that
it's bad.  Sucks to be me.

Now consider the following alternative scenario: I write the block to
disk.  Five minutes later, without an intervening crash, I read it
back in and it's bad.  Yeah, the system detects it.

Which is more likely?  I'm not an expert on disk failure modes, but my
intuition is that the first one will happen often enough to make us
look silly.  Is it 10%?  20%?  50%?  I don't know.  But ISTM that a
CRC system that has no ability to determine whether a system is still
ok post-crash is not a compelling proposition, even though it might
still be able to detect some problems.

...Robert

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Block-level CRC checks

2009-12-01 Thread Joshua D. Drake

On Tue, 2009-12-01 at 07:05 -0500, Bruce Momjian wrote:
  
  All useful detection mechanisms have non-zero false positives because we
  would rather sometimes ring the bell for no reason than to let bad
  things through silently, as we do now.
 
 OK, but what happens if someone gets the failure report, assumes their
 hardware is faulty and replaces it, and then gets a failure report
 again? 

They are stupid? Nobody just replaces hardware. You test it.

We can't fix stupid.

Joshua D. Drake

-- 
PostgreSQL.org Major Contributor
Command Prompt, Inc: http://www.commandprompt.com/ - 503.667.4564
Consulting, Training, Support, Custom Development, Engineering
If the world pushes look it in the eye and GRR. Then push back harder. - 
Salamander


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Block-level CRC checks

2009-12-01 Thread Simon Riggs

On Tue, 2009-12-01 at 10:55 -0500, Tom Lane wrote:
 Simon Riggs si...@2ndquadrant.com writes:
  On Tue, 2009-12-01 at 16:40 +0200, Heikki Linnakangas wrote:
  It's not hard to imagine that when a hardware glitch happens
  causing corruption, it also causes the system to crash. Recalculating
  the CRCs after crash would mask the corruption.
 
  They are already masked from us, so continuing to mask those errors
  would not put us in a worse position.
 
 No, it would just destroy a large part of the argument for why this
 is worth doing.  We detect disk errors ... except for ones that happen
 during a database crash.  Say what?

I know what I said sounds ridiculous, I'm just trying to keep my mind
open about the tradeoffs. The way to detect 100% of corruptions is to
WAL-log 100% of writes to blocks and we know that sucks performance -
twas me that said it in the original discussion. I'm trying to explore
whether we can detect 100% of other errors at some intermediate
percentage of WAL-logging. If we decide that there isn't an intermediate
position worth taking, I'm happy, as long it was a fact-based decision.

 The fundamental problem with this is the same as it's been all along:
 the tradeoff between implementation work expended, performance overhead
 added, and net number of real problems detected (with a suitably large
 demerit for actually *introducing* problems) just doesn't look
 attractive.  You can make various compromises that improve one or two of
 these factors at the cost of making the others worse, but at the end of
 the day I've still not seen a combination that seems worth doing.

I agree. But also I do believe there are people that care enough about
this to absorb a performance hit and the new features in 8.5 will bring
in a new crop of people that care about those things very much.

-- 
 Simon Riggs   www.2ndQuadrant.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Block-level CRC checks

2009-12-01 Thread Joshua D. Drake

On Tue, 2009-12-01 at 10:55 -0500, Tom Lane wrote:
 Simon Riggs si...@2ndquadrant.com writes:
  On Tue, 2009-12-01 at 16:40 +0200, Heikki Linnakangas wrote:
  It's not hard to imagine that when a hardware glitch happens
  causing corruption, it also causes the system to crash. Recalculating
  the CRCs after crash would mask the corruption.
 
  They are already masked from us, so continuing to mask those errors
  would not put us in a worse position.
 
 No, it would just destroy a large part of the argument for why this
 is worth doing.  We detect disk errors ... except for ones that happen
 during a database crash.  Say what?
 
 The fundamental problem with this is the same as it's been all along:
 the tradeoff between implementation work expended, performance overhead
 added, and net number of real problems detected (with a suitably large
 demerit for actually *introducing* problems) just doesn't look
 attractive.  You can make various compromises that improve one or two of
 these factors at the cost of making the others worse, but at the end of
 the day I've still not seen a combination that seems worth doing.

Let me try a different but similar perspective. The problem we are
trying to solve here, only matters to a very small subset of the people
actually using PostgreSQL. Specifically, a percentage that is using
PostgreSQL in a situation where they can lose many thousands of dollars
per minute or hour should an outage occur.

On the other hand it is those very people that are *paying* people to
try and implement these features. Kind of a catch-22.

The hard core reality is this. *IF* it is one of the goals of this
project to insure that the software can be safely, effectively, and
responsibly operated in a manner that is acceptable to C* level people
in a Fortune level company then we *must* solve this problem.

If it is not the goal of the project, leave it to EDB/CMD/2ndQuandrant
to fork it because it will eventually happen. Our customers are
demanding these features.

Sincerely,

Joshua D. Drake


-- 
PostgreSQL.org Major Contributor
Command Prompt, Inc: http://www.commandprompt.com/ - 503.667.4564
Consulting, Training, Support, Custom Development, Engineering
If the world pushes look it in the eye and GRR. Then push back harder. - 
Salamander


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Block-level CRC checks

2009-12-01 Thread Bruce Momjian

Simon Riggs wrote:
 Also, we might
 
 * Put all hint bits in the block header to allow them to be excluded
 more easily from CRC checking. If we used 3 more bits from
 ItemIdData.lp_len (limiting tuple length to 4096) then we could store
 some hints in the item pointer. HEAP_XMIN_INVALID can be stored as
 LP_DEAD, since that will happen very quickly anyway. 

OK, here is another idea, maybe crazy:

When we read in a page that has an invalid CRC, we check the page to see
which hint bits are _not_ set, and we try setting them to see if can get
a matching CRC.  If there no missing hint bits and the CRC doesn't
match, we know the page is corrupt.  If two hint bits are missing, we
can try setting one and both of them and see if can get a matching CRC. 
If we can, the page is OK, if not, it is corrupt.

Now if 32 hint bits are missing, but could be based on transaction
status, then we would need 2^32 possible hint bit combinations, so we
can't do the test and we just assume the page is valid.

I have no idea what percentage of corruption this would detect, but it
might have minimal overhead because the overhead only happens when we
detect a non-matching CRC due to a crash of some sort.

-- 
  Bruce Momjian  br...@momjian.ushttp://momjian.us
  EnterpriseDB http://enterprisedb.com

  + If your life is a hard drive, Christ can be your backup. +

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Block-level CRC checks

2009-12-01 Thread Robert Haas

On Tue, Dec 1, 2009 at 1:02 PM, Joshua D. Drake j...@commandprompt.com wrote:
 The hard core reality is this. *IF* it is one of the goals of this
 project to insure that the software can be safely, effectively, and
 responsibly operated in a manner that is acceptable to C* level people
 in a Fortune level company then we *must* solve this problem.

 If it is not the goal of the project, leave it to EDB/CMD/2ndQuandrant
 to fork it because it will eventually happen. Our customers are
 demanding these features.

OK, and when you fork it, how do you plan to implement it?  The
problem AFAICS is not that anyone hugely dislikes the feature; it's
that nobody is really clear on how to implement it in a way that's
actually useful.

So far the only somewhat reasonable suggestions I've seen seem to be:

1. WAL-log setting the hint bits.  If you don't like the resulting
performance, shut off the feature.

2. Rearrange the page so that all the hint bits are in the first 512
bytes along with the CRC, so that there can be no torn pages.  AFAICS,
no one has rendered judgment on whether this is a feasible solution.

Does $COMPETITOR offer this feature?

...Robert

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Block-level CRC checks

2009-12-01 Thread Simon Riggs

On Tue, 2009-12-01 at 13:05 -0500, Bruce Momjian wrote:
 Simon Riggs wrote:
  Also, we might
  
  * Put all hint bits in the block header to allow them to be excluded
  more easily from CRC checking. If we used 3 more bits from
  ItemIdData.lp_len (limiting tuple length to 4096) then we could store
  some hints in the item pointer. HEAP_XMIN_INVALID can be stored as
  LP_DEAD, since that will happen very quickly anyway. 
 
 OK, here is another idea, maybe crazy:

When there's nothing else left, crazy wins.

 When we read in a page that has an invalid CRC, we check the page to see
 which hint bits are _not_ set, and we try setting them to see if can get
 a matching CRC.  If there no missing hint bits and the CRC doesn't
 match, we know the page is corrupt.  If two hint bits are missing, we
 can try setting one and both of them and see if can get a matching CRC. 
 If we can, the page is OK, if not, it is corrupt.
 
 Now if 32 hint bits are missing, but could be based on transaction
 status, then we would need 2^32 possible hint bit combinations, so we
 can't do the test and we just assume the page is valid.
 
 I have no idea what percentage of corruption this would detect, but it
 might have minimal overhead because the overhead only happens when we
 detect a non-matching CRC due to a crash of some sort.

Perhaps we could store a sector-based parity bit each 512 bytes in the
block. If there are an even number of hint bits set, if odd we unset the
parity bit. So whenever we set a hint bit we flip the parity bit for
that sector. That way we could detect which sectors are potentially
missing in an effort to minimize the number of combinations we need to
test. That would require only 16 bits for an 8192 byte block; we store
it next to the CRC, so we know that was never altered separately. So
total 6 byte overhead.

-- 
 Simon Riggs   www.2ndQuadrant.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Block-level CRC checks

2009-12-01 Thread Joshua D. Drake

On Tue, 2009-12-01 at 13:20 -0500, Robert Haas wrote:
 On Tue, Dec 1, 2009 at 1:02 PM, Joshua D. Drake j...@commandprompt.com 
 wrote:
  The hard core reality is this. *IF* it is one of the goals of this
  project to insure that the software can be safely, effectively, and
  responsibly operated in a manner that is acceptable to C* level people
  in a Fortune level company then we *must* solve this problem.
 
  If it is not the goal of the project, leave it to EDB/CMD/2ndQuandrant
  to fork it because it will eventually happen. Our customers are
  demanding these features.
 
 OK, and when you fork it, how do you plan to implement it? 

Hey man, I am not an engineer :P. You know that. I am just speaking the
pressures that some of us are having in the marketplace about these
types of features.

 red judgment on whether this is a feasible solution.
 
 Does $COMPETITOR offer this feature?
 

My understanding is that MSSQL does. I am not sure about Oracle. Those
are the only two I run into (I don't run into MySQL at all). I know
others likely compete in the DB2 space.

Sincerely,

Joshua D. Drake



-- 
PostgreSQL.org Major Contributor
Command Prompt, Inc: http://www.commandprompt.com/ - 503.667.4564
Consulting, Training, Support, Custom Development, Engineering
If the world pushes look it in the eye and GRR. Then push back harder. - 
Salamander


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Block-level CRC checks

2009-12-01 Thread Tom Lane

Bruce Momjian br...@momjian.us writes:
 OK, here is another idea, maybe crazy:

 When we read in a page that has an invalid CRC, we check the page to see
 which hint bits are _not_ set, and we try setting them to see if can get
 a matching CRC.  If there no missing hint bits and the CRC doesn't
 match, we know the page is corrupt.  If two hint bits are missing, we
 can try setting one and both of them and see if can get a matching CRC. 
 If we can, the page is OK, if not, it is corrupt.

 Now if 32 hint bits are missing, but could be based on transaction
 status, then we would need 2^32 possible hint bit combinations, so we
 can't do the test and we just assume the page is valid.

A typical page is going to have something like 100 tuples, so
potentially 2^400 combinations to try.  I don't see this being
realistic from that standpoint.  What's much worse is that to even
find the potentially missing hint bits, you need to make very strong
assumptions about the validity of the rest of the page.

The suggestions that were made upthread about moving the hint bits
could resolve the second objection, but once you do that you might
as well just exclude them from the CRC and eliminate the guessing.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Block-level CRC checks

2009-12-01 Thread Tom Lane

Simon Riggs si...@2ndquadrant.com writes:
 On Tue, 2009-12-01 at 13:05 -0500, Bruce Momjian wrote:
 When we read in a page that has an invalid CRC, we check the page to see
 which hint bits are _not_ set, and we try setting them to see if can get
 a matching CRC.

 Perhaps we could store a sector-based parity bit each 512 bytes in the
 block. If there are an even number of hint bits set, if odd we unset the
 parity bit. So whenever we set a hint bit we flip the parity bit for
 that sector. That way we could detect which sectors are potentially
 missing in an effort to minimize the number of combinations we need to
 test.

Actually, the killer problem with *any* scheme involving guessing
is that each bit you guess translates directly to removing one bit
of confidence from the CRC value.  If you try to guess at as many
as 32 bits, it is practically guaranteed that you will find a
combination that makes a 32-bit CRC appear to match.  Well before
that, you have degraded the reliability of the error detection to
the point that there's no point.

The bottom line here seems to be that the only practical way to do
anything like this is to move the hint bits into their own area of
the page, and then exclude them from the CRC.  Are we prepared to
once again blow off any hope of in-place update for another release
cycle?

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Block-level CRC checks

2009-12-01 Thread Aidan Van Dyk

* Tom Lane t...@sss.pgh.pa.us [091201 13:58]:

 Actually, the killer problem with *any* scheme involving guessing
 is that each bit you guess translates directly to removing one bit
 of confidence from the CRC value.  If you try to guess at as many
 as 32 bits, it is practically guaranteed that you will find a
 combination that makes a 32-bit CRC appear to match.  Well before
 that, you have degraded the reliability of the error detection to
 the point that there's no point.

Exactly.

 The bottom line here seems to be that the only practical way to do
 anything like this is to move the hint bits into their own area of
 the page, and then exclude them from the CRC.  Are we prepared to
 once again blow off any hope of in-place update for another release
 cycle?

Well, *I* think if we're ever going to have really reliable in-place
upgrades that we can expect to function release after release, we're
going to need to be able to read in old version pages, and convert
them to current version pages, for some set of old version (I'ld be
happy with $VERSION-1)...  But I don't see that happening any time
soon...

But I'm not loading TB of data either, my largest clusters are a couple
of gigs, so I acknowledge my priorities are probably quite different
then some of the companies driving a lot of the heavy development.

a.
-- 
Aidan Van Dyk Create like a god,
ai...@highrise.ca   command like a king,
http://www.highrise.ca/   work like a slave.


signature.asc
Description: Digital signature

Re: [HACKERS] Block-level CRC checks

2009-12-01 Thread Greg Stark

On Tue, Dec 1, 2009 at 6:41 PM, Tom Lane t...@sss.pgh.pa.us wrote:
 Bruce Momjian br...@momjian.us writes:
 OK, here is another idea, maybe crazy:

 When we read in a page that has an invalid CRC, we check the page to see
 which hint bits are _not_ set, and we try setting them to see if can get
 a matching CRC.

Unfortunately you would also have to try *unsetting* every hint bit as
well since the updated hint bits might have made it to disk but not
the CRC leaving the old CRC for the block with the unset bits.

I actually independently had the same thought today that Simon had of
moving the hint bits to the line pointer. We can obtain more free bits
in the line pointers by dividing the item offsets and sizes by
maxalign if we need it. That should give at least 4 spare bits which
is all we need for the four VALID/INVALID hint bits.

It should be relatively cheap to skip the hint bits in the line
pointers since they'll be the same bits of every 16-bit value for a
whole range. Alternatively we could just CRC the tuples and assume a
corrupted line pointer will show itself quickly. That would actually
make it faster than a straight CRC of the whole block -- making
lemonade out of lemons as it were.

There's still the all-tuples-in-page-are-visible hint bit and the hint
bits in btree pages. I'm not sure if those are easier or harder to
solve. We might be able to assume the all-visible flag will not be
torn from the crc as long as they're within the same 512 byte sector.
And iirc the btree hint bits are in the line pointers themselves as
well?

Another thought is that would could use the MSSQL-style torn page
detection of including a counter (or even a bit?) in every 512-byte
chunk which gets incremented every time the page is written. If they
don't all match when read in then the page was torn and we can't check
the CRC. That gets us the advantage that we can inform the user that a
torn page was detected so they know that they must absolutely use
full_page_writes on their system. Currently users are in the dark
whether their system is susceptible to them or not and have now idea
with what frequency. Even here there are quite divergent opinions
about their frequency and which systems are susceptible to them or
immune.

-- 
greg

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Block-level CRC checks

2009-12-01 Thread Tom Lane

Greg Stark gsst...@mit.edu writes:
 Another thought is that would could use the MSSQL-style torn page
 detection of including a counter (or even a bit?) in every 512-byte
 chunk which gets incremented every time the page is written.

I think we can dismiss that idea, or any idea involving a per-page
status value, out of hand.  The implications for tuple layout are
just too messy.  I'm not especially thrilled with making assumptions
about the underlying device block size anyway.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Block-level CRC checks

2009-12-01 Thread Josh Berkus

All,

I feel strongly that we should be verifying pages on write, or at least
providing the option to do so, because hardware is simply not reliable.
 And a lot of our biggest users are having issues; it seems pretty much
guarenteed that if you have more than 20 postgres servers, at least one
of them will have bad memory, bad RAID and/or a bad driver.

(and yes, InnoDB, DB2 and Oracle do provide tools to detect hardware
corruption when it happens. Oracle even provides correction tools.  We
are *way* behind them in this regard)

There are two primary conditions we are testing for:

(a) bad RAM, which happens as frequently as 8% of the time on commodity
servers, and given a sufficient amount of RAM happens 99% of the time
due to quantum effects, and
(b) bad I/O, in the form of bad drivers, bad RAID, and/or bad disks.

Our users want to potentially take two degrees of action on this:

1. detect the corruption immediately when it happens, so that they can
effectively troubleshoot the cause of the corruption, and potentially
shut down the database before further corruption occurs and while they
still have clean backups.

2. make an attempt to fix the corrupted page before/immediately after it
is written.

Further, based on talking to some of these users who are having chronic
and not-debuggable issues on their sets of 100's of PostgreSQL servers,
there are some other specs:

-- Many users would be willing to sacrifice significant performance (up
to 20%) as a start-time option in order to be corruption-proof.
-- Even more users would only be interested in using the anti-corruption
options after they know they have a problem to troubleshoot it, and then
turn the corruption detection back off.

So, based on my conversations with users, what we really want is a
solution which does (1) for both (a) and (b) as a start-time option, and
having siginificant performance overhead for this option is OK.

Now, does block-level CRCs qualify?

The problem I have with CRC checks is that it only detects bad I/O, and
is completely unable to detect data corruption due to bad memory.  This
means that really we want a different solution which can detect both bad
RAM and bad I/O, and should only fall back on CRC checks if we're unable
to devise one.

One of the things Simon and I talked about in Japan is that most of the
time, data corruption makes the data page and/or tuple unreadable.  So,
checking data format for readable pages and tuples (and index nodes)
both before and after write to disk (the latter would presumably be
handled by the bgwriter and/or checkpointer) would catch a lot of kinds
of corruption before they had a chance to spread.

However, that solution would not detect subtle corruption, like
single-bit-flipping issues caused by quantum errors.  Also, it would
require reading back each page as it's written to disk, which is OK for
a bunch of single-row writes, but for bulk data loads a significant problem.

So, what I'm saying is that I think we really want a better solution,
and am throwing this out there to see if anyone is clever enough.

--Josh Berkus






-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Block-level CRC checks

2009-12-01 Thread Kevin Grittner

Josh Berkus j...@agliodbs.com wrote:
 
 And a lot of our biggest users are having issues; it seems pretty
 much guarenteed that if you have more than 20 postgres servers, at
 least one of them will have bad memory, bad RAID and/or a bad
 driver.
 
Huh?!?  We have about 200 clusters running on about 100 boxes, and
we see that very rarely.  On about 100 older boxes, relegated to
less critical tasks, we see a failure maybe three or four times per
year.  It's usually not subtle, and a sane backup and redundant
server policy has kept us from suffering much pain from these.  I'm
not questioning the value of adding features to detect corruption,
but your numbers are hard to believe.
 
 The problem I have with CRC checks is that it only detects bad
 I/O, and is completely unable to detect data corruption due to bad
 memory. This means that really we want a different solution which
 can detect both bad RAM and bad I/O, and should only fall back on
 CRC checks if we're unable to devise one.
 
md5sum of each tuple?  As an optional system column (a la oid)?
 
 checking data format for readable pages and tuples (and index
 nodes) both before and after write to disk
 
Given that PostgreSQL goes through the OS, and many of us are using
RAID controllers with BBU RAM, how do you do a read with any
confidence that it came from the disk?  (I mean, I know how to do
that for a performance test, but as a routine step during production
use?)
 
-Kevin

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Block-level CRC checks

2009-12-01 Thread Greg Stark

On Tue, Dec 1, 2009 at 7:19 PM, Josh Berkus j...@agliodbs.com wrote:
 However, that solution would not detect subtle corruption, like
 single-bit-flipping issues caused by quantum errors.

Well there is a solution for this, ECC RAM. There's *no* software
solution for it. The corruption can just as easily happen the moment
you write the value before you calculate any checksum or in the
register holding the value before you even write it. Or it could occur
the moment after you finish checking the checksum. Also you're not
going to be able to be sure you're checking the actual dram and not
the L2 cache or the processor's L1/L0 caches.

ECC RAM solves this problem properly and it does work. There's not
much point in paying a much bigger cost for an ineffective solution.

 Also, it would
 require reading back each page as it's written to disk, which is OK for
 a bunch of single-row writes, but for bulk data loads a significant problem.

Not sure what that really means for Postgres. It would just mean
reading back the same page of memory from the filesystem cache that we
just read.

It sounds like you're describing fsyncing every single page to disk
and then wait 1min/7200 or even 1min/15k to do a direct read for every
single page -- that's not a 20% performance hit though. We would have
to change our mascot from the elephant to a snail I think.

You could imagine a more complex solution where you have a separate
process wait until the next checkpoint then do direct reads for all
the blocks written since the previous checkpoint (which have now been
fsynced) and verify that the block on disk has a verifiable CRC. I'm
not sure even direct reads let you get the block on disk if someone
else has written the block into cache though. If you could then this
sounds like it could be made to work efficiently (with sequential
bitmap-style scans) and could be quite handy. What I like about that
is you could deprioritize this process's i/o so that it didn't impact
the main processing. As things stand this wouldn't detect pages
written because they were dirtied by hint bit updates but that could
be addressed a few different ways.

-- 
greg

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Block-level CRC checks

2009-12-01 Thread Robert Haas

On Tue, Dec 1, 2009 at 2:06 PM, Aidan Van Dyk ai...@highrise.ca wrote:
 Well, *I* think if we're ever going to have really reliable in-place
 upgrades that we can expect to function release after release, we're
 going to need to be able to read in old version pages, and convert
 them to current version pages, for some set of old version (I'ld be
 happy with $VERSION-1)...  But I don't see that happening any time
 soon...

I agree.  I've attempted to make this point before - as has Zdenek -
and been scorned for it, but I don't think it's become any less true
for all of that.  I don't think you have to look much further than the
limitations on upgrading from 8.3 to 8.4 to conclude that the current
strategy is always going to be pretty hit or miss.

http://cvs.pgfoundry.org/cgi-bin/cvsweb.cgi/~checkout~/pg-migrator/pg_migrator/README?rev=1.59content-type=text/plain

...Robert

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Block-level CRC checks

2009-12-01 Thread Greg Stark

On Tue, Dec 1, 2009 at 7:51 PM, Robert Haas robertmh...@gmail.com wrote:
 On Tue, Dec 1, 2009 at 2:06 PM, Aidan Van Dyk ai...@highrise.ca wrote:
 Well, *I* think if we're ever going to have really reliable in-place
 upgrades that we can expect to function release after release, we're
 going to need to be able to read in old version pages, and convert
 them to current version pages, for some set of old version (I'ld be
 happy with $VERSION-1)...  But I don't see that happening any time
 soon...

 I agree.  I've attempted to make this point before - as has Zdenek -
 and been scorned for it, but I don't think it's become any less true
 for all of that.  I don't think you have to look much further than the
 limitations on upgrading from 8.3 to 8.4 to conclude that the current
 strategy is always going to be pretty hit or miss.

I find that hard to understand. I believe the consensus is that an
on-demand page-level migration statregy like Aidan described is
precisely the plan when it's necessary to handle page format changes.
There were no page format changes for 8.3-8.4 however so there's no
point writing dead code until it actually has anything to do. And
there was no point writing it for previously releases because there
was pg_migrator anyways. Zdenek's plan was basically the same but he
wanted the backend to be able to handle any version page directly
without conversion any time.

Pointing at the 8.3 pg_migrator limitations is irrelevant -- every
single one of those limitations would not be addressed by a page-level
migration code path. They are all data-type redefinitions that can't
be fixed without understanding the table structure and definition.
These limitations would all require adding code to the new version to
handle the old data types and their behaviour and to convert them to
the new datatypes when a tuple is rewritten. In some cases this is
really not easy at all.

-- 
greg

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Block-level CRC checks

2009-12-01 Thread Greg Stark

On Tue, Dec 1, 2009 at 8:04 PM, Greg Stark gsst...@mit.edu wrote:
 And there was no point writing it for previously releases because there
 was **no** pg_migrator anyways.

oops

-- 
greg

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Block-level CRC checks

2009-12-01 Thread Robert Haas

On Tue, Dec 1, 2009 at 3:04 PM, Greg Stark gsst...@mit.edu wrote:
 I find that hard to understand. I believe the consensus is that an
 on-demand page-level migration statregy like Aidan described is
 precisely the plan when it's necessary to handle page format changes.
 There were no page format changes for 8.3-8.4 however so there's no
 point writing dead code until it actually has anything to do. And
 there was no point writing it for previously releases because there
 was pg_migrator anyways. Zdenek's plan was basically the same but he
 wanted the backend to be able to handle any version page directly
 without conversion any time.

 Pointing at the 8.3 pg_migrator limitations is irrelevant -- every
 single one of those limitations would not be addressed by a page-level
 migration code path. They are all data-type redefinitions that can't
 be fixed without understanding the table structure and definition.
 These limitations would all require adding code to the new version to
 handle the old data types and their behaviour and to convert them to
 the new datatypes when a tuple is rewritten. In some cases this is
 really not easy at all.

OK, fair enough.  My implication that only page formats were at issue
was off-base.  My underlying point was that I think we have to be
prepared to write code that can understand old binary formats (on the
tuple, page, or relation level) if we want this to work and work
reliably.  I believe that there has been much resistance to that idea.
 If I am wrong, great!

...Robert

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Block-level CRC checks

2009-12-01 Thread Tom Lane

Robert Haas robertmh...@gmail.com writes:
 OK, fair enough.  My implication that only page formats were at issue
 was off-base.  My underlying point was that I think we have to be
 prepared to write code that can understand old binary formats (on the
 tuple, page, or relation level) if we want this to work and work
 reliably.  I believe that there has been much resistance to that idea.

We keep looking for cheaper alternatives.  There may not be any...

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Block-level CRC checks

2009-12-01 Thread Andrew Dunstan




Tom Lane wrote:

Robert Haas robertmh...@gmail.com writes:
  

OK, fair enough.  My implication that only page formats were at issue
was off-base.  My underlying point was that I think we have to be
prepared to write code that can understand old binary formats (on the
tuple, page, or relation level) if we want this to work and work
reliably.  I believe that there has been much resistance to that idea.



We keep looking for cheaper alternatives.  There may not be any...


  


Yeah. I think we might need to bite the bullet on that and start 
thinking more about different strategies for handling page versioning to 
satisfy various needs. I've been convinced for a while that some sort of 
versioning scheme is inevitable, but I do understand the reluctance.


cheers

andrew

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Block-level CRC checks

2009-12-01 Thread Bruce Momjian

Andrew Dunstan wrote:
 
 
 Tom Lane wrote:
  Robert Haas robertmh...@gmail.com writes:

  OK, fair enough.  My implication that only page formats were at issue
  was off-base.  My underlying point was that I think we have to be
  prepared to write code that can understand old binary formats (on the
  tuple, page, or relation level) if we want this to work and work
  reliably.  I believe that there has been much resistance to that idea.
  
 
  We keep looking for cheaper alternatives.  There may not be any...
 
  

 
 Yeah. I think we might need to bite the bullet on that and start 
 thinking more about different strategies for handling page versioning to 
 satisfy various needs. I've been convinced for a while that some sort of 
 versioning scheme is inevitable, but I do understand the reluctance.

I always felt our final solution would be a combination of pg_migrator
for system catalog changes and page format conversion for page changes.

-- 
  Bruce Momjian  br...@momjian.ushttp://momjian.us
  EnterpriseDB http://enterprisedb.com

  + If your life is a hard drive, Christ can be your backup. +

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Block-level CRC checks

2009-12-01 Thread Bruce Momjian

Tom Lane wrote:
 Bruce Momjian br...@momjian.us writes:
  OK, here is another idea, maybe crazy:
 
  When we read in a page that has an invalid CRC, we check the page to see
  which hint bits are _not_ set, and we try setting them to see if can get
  a matching CRC.  If there no missing hint bits and the CRC doesn't
  match, we know the page is corrupt.  If two hint bits are missing, we
  can try setting one and both of them and see if can get a matching CRC. 
  If we can, the page is OK, if not, it is corrupt.
 
  Now if 32 hint bits are missing, but could be based on transaction
  status, then we would need 2^32 possible hint bit combinations, so we
  can't do the test and we just assume the page is valid.
 
 A typical page is going to have something like 100 tuples, so
 potentially 2^400 combinations to try.  I don't see this being
 realistic from that standpoint.  What's much worse is that to even
 find the potentially missing hint bits, you need to make very strong
 assumptions about the validity of the rest of the page.
 
 The suggestions that were made upthread about moving the hint bits
 could resolve the second objection, but once you do that you might
 as well just exclude them from the CRC and eliminate the guessing.

OK, crazy idea #3.  What if we had a per-page counter of the number of
hint bits set --- that way, we would only consider a CRC check failure
to be corruption if the count matched the hint bit count on the page.

-- 
  Bruce Momjian  br...@momjian.ushttp://momjian.us
  EnterpriseDB http://enterprisedb.com

  + If your life is a hard drive, Christ can be your backup. +

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Block-level CRC checks

2009-12-01 Thread Tom Lane

Bruce Momjian br...@momjian.us writes:
 OK, crazy idea #3.  What if we had a per-page counter of the number of
 hint bits set --- that way, we would only consider a CRC check failure
 to be corruption if the count matched the hint bit count on the page.

Seems like rather a large hole in the ability to detect corruption.
In particular, this again assumes that you can accurately locate all
the hint bits in a page whose condition is questionable.  Pick up the
wrong bits, you'll come to the wrong conclusion --- and the default
behavior you propose here is the wrong result.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Block-level CRC checks

2009-12-01 Thread Richard Huxton

Bruce Momjian wrote:
 Tom Lane wrote:

 The suggestions that were made upthread about moving the hint bits
 could resolve the second objection, but once you do that you might
 as well just exclude them from the CRC and eliminate the guessing.
 
 OK, crazy idea #3.  What if we had a per-page counter of the number of
 hint bits set --- that way, we would only consider a CRC check failure
 to be corruption if the count matched the hint bit count on the page.

Can I piggy-back on Bruce's crazy idea and ask a stupid question?

Why are we writing out the hint bits to disk anyway? Is it really so
slow to calculate them on read + cache them that it's worth all this
trouble? Are they not also to blame for the write my import data twice
feature?

-- 
  Richard Huxton
  Archonet Ltd

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Block-level CRC checks

2009-12-01 Thread Bruce Momjian

Tom Lane wrote:
 Bruce Momjian br...@momjian.us writes:
  OK, crazy idea #3.  What if we had a per-page counter of the number of
  hint bits set --- that way, we would only consider a CRC check failure
  to be corruption if the count matched the hint bit count on the page.
 
 Seems like rather a large hole in the ability to detect corruption.
 In particular, this again assumes that you can accurately locate all
 the hint bits in a page whose condition is questionable.  Pick up the
 wrong bits, you'll come to the wrong conclusion --- and the default
 behavior you propose here is the wrong result.

I was assuming any update of hint bits would update the per-page counter
so it would always be accurate.  However, I seem to remember we don't
lock the page when updating hint bits, so that wouldn't work.

-- 
  Bruce Momjian  br...@momjian.ushttp://momjian.us
  EnterpriseDB http://enterprisedb.com

  + If your life is a hard drive, Christ can be your backup. +

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Block-level CRC checks

2009-12-01 Thread Greg Stark

On Tue, Dec 1, 2009 at 9:57 PM, Richard Huxton d...@archonet.com wrote:
 Why are we writing out the hint bits to disk anyway? Is it really so
 slow to calculate them on read + cache them that it's worth all this
 trouble? Are they not also to blame for the write my import data twice
 feature?

It would be interesting to experiment with different strategies. But
the results would depend a lot on workloads and I doubt one strategy
is best for everyone.

It has often been suggested that we could set the hint bits but not
dirty the page, so they would never be written out unless some other
update hit the page. In most use cases that would probably result in
the right thing happening where we avoid half the writes but still
stop doing transaction status lookups relatively promptly. The scary
thing is that there might be use cases such as static data loaded
where the hint bits never get set and every scan of the page has to
recheck those statuses until the tuples are frozen.

(Not dirtying the page almost gets us out of the CRC problems -- it
doesn't in our current setup because we don't take a lock when setting
the hint bits, so you could set it on a page someone is in the middle
of CRC checking and writing. There were other solutions proposed for
that, including just making hint bits require locking the page or
double buffering the write.)

There does need to be something like the hint bits which does
eventually have to be set because we can't keep transaction
information around forever. Even if you keep the transaction
information all the way back to the last freeze date (up to about 1GB
and change I think) then the data has to be written twice, the second
time is to freeze the transactions. In the worst case then reading a
page requires a random page access (or two) from anywhere in that 1GB+
file for each tuple on the page (whether visible to us or not).
-- 
greg

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Block-level CRC checks

2009-12-01 Thread decibel


On Dec 1, 2009, at 1:39 PM, Kevin Grittner wrote:

Josh Berkus j...@agliodbs.com wrote:


And a lot of our biggest users are having issues; it seems pretty
much guarenteed that if you have more than 20 postgres servers, at
least one of them will have bad memory, bad RAID and/or a bad
driver.


Huh?!?  We have about 200 clusters running on about 100 boxes, and
we see that very rarely.  On about 100 older boxes, relegated to
less critical tasks, we see a failure maybe three or four times per
year.  It's usually not subtle, and a sane backup and redundant
server policy has kept us from suffering much pain from these.  I'm
not questioning the value of adding features to detect corruption,
but your numbers are hard to believe.


That's just your experience. Others have had different experiences.

And honestly, bickering about exact numbers misses Josh's point  
completely. Postgres is seriously lacking in it's ability to detect  
hardware problems, and hardware *does fail*. And you can't just  
assume that when it fails it blows up completely.


We really do need some capability for detecting errors.


The problem I have with CRC checks is that it only detects bad
I/O, and is completely unable to detect data corruption due to bad
memory. This means that really we want a different solution which
can detect both bad RAM and bad I/O, and should only fall back on
CRC checks if we're unable to devise one.


md5sum of each tuple?  As an optional system column (a la oid)


That's a possibility.

As Josh mentioned, some people will pay a serious performance hit to  
ensure that their data is safe and correct. The CRC proposal was  
intended as a middle of the road approach that would at least tell  
you that your hardware was probably OK. There's certainly more that  
could be done.


Also, I think some means of detecting torn pages would be very  
welcome. If this was done at the storage manager level it would  
probably be fairly transparent to the rest of the code.



checking data format for readable pages and tuples (and index
nodes) both before and after write to disk


Given that PostgreSQL goes through the OS, and many of us are using
RAID controllers with BBU RAM, how do you do a read with any
confidence that it came from the disk?  (I mean, I know how to do
that for a performance test, but as a routine step during production
use?)



You'd probably need to go to some kind of stand-alone or background  
process that slowly reads and verifies the entire database.  
Unfortunately at that point you could only detect corruption and not  
correct it, but it'd still be better than nothing.

--
Jim C. Nasby, Database Architect   j...@nasby.net
512.569.9461 (cell) http://jim.nasby.net



--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Block-level CRC checks

2009-12-01 Thread Bruce Momjian

Greg Stark wrote:
 On Tue, Dec 1, 2009 at 6:41 PM, Tom Lane t...@sss.pgh.pa.us wrote:
  Bruce Momjian br...@momjian.us writes:
  OK, here is another idea, maybe crazy:
 
  When we read in a page that has an invalid CRC, we check the page to see
  which hint bits are _not_ set, and we try setting them to see if can get
  a matching CRC.
 
 Unfortunately you would also have to try *unsetting* every hint bit as
 well since the updated hint bits might have made it to disk but not
 the CRC leaving the old CRC for the block with the unset bits.
 
 I actually independently had the same thought today that Simon had of
 moving the hint bits to the line pointer. We can obtain more free bits
 in the line pointers by dividing the item offsets and sizes by
 maxalign if we need it. That should give at least 4 spare bits which
 is all we need for the four VALID/INVALID hint bits.
 
 It should be relatively cheap to skip the hint bits in the line
 pointers since they'll be the same bits of every 16-bit value for a
 whole range. Alternatively we could just CRC the tuples and assume a
 corrupted line pointer will show itself quickly. That would actually
 make it faster than a straight CRC of the whole block -- making
 lemonade out of lemons as it were.

Yea, I am thinking we would have to have the hint bits in the line
pointers --- if not, we would have to reserve a lot of free space to
hold the maximum number of tuple hint bits --- seems like a waste.

I also like the idea that we don't need to CRC check the line pointers
because any corruption there is going to appear immediately.  However,
the bad news is that we wouldn't find the corruption until we try to
access bad data and might crash.

-- 
  Bruce Momjian  br...@momjian.ushttp://momjian.us
  EnterpriseDB http://enterprisedb.com

  + If your life is a hard drive, Christ can be your backup. +

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Block-level CRC checks

2009-12-01 Thread Richard Huxton

Greg Stark wrote:
 On Tue, Dec 1, 2009 at 9:57 PM, Richard Huxton d...@archonet.com wrote:
 Why are we writing out the hint bits to disk anyway? Is it really so
 slow to calculate them on read + cache them that it's worth all this
 trouble? Are they not also to blame for the write my import data twice
 feature?
 
 It would be interesting to experiment with different strategies. But
 the results would depend a lot on workloads and I doubt one strategy
 is best for everyone.
 
 It has often been suggested that we could set the hint bits but not
 dirty the page, so they would never be written out unless some other
 update hit the page. In most use cases that would probably result in
 the right thing happening where we avoid half the writes but still
 stop doing transaction status lookups relatively promptly. The scary
 thing is that there might be use cases such as static data loaded
 where the hint bits never get set and every scan of the page has to
 recheck those statuses until the tuples are frozen.

And how scary is that? Assuming we cache the hints...
1. With the page itself, so same lifespan
2. Separately, perhaps with a different (longer) lifespan.

Separately would then let you trade complexity for compactness - all of
block B is deleted, all of table T is visible.

So what is the cost of calculating the hint-bits for a whole block of
tuples in one go vs reading that block from actual spinning disk?

 There does need to be something like the hint bits which does
 eventually have to be set because we can't keep transaction
 information around forever. Even if you keep the transaction
 information all the way back to the last freeze date (up to about 1GB
 and change I think) then the data has to be written twice, the second
 time is to freeze the transactions. In the worst case then reading a
 page requires a random page access (or two) from anywhere in that 1GB+
 file for each tuple on the page (whether visible to us or not).

While on that topic - I'm assuming freezing requires substantially more
effort than updating hint bits?

-- 
  Richard Huxton
  Archonet Ltd

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Block-level CRC checks

2009-12-01 Thread Tom Lane

Bruce Momjian br...@momjian.us writes:
 Greg Stark wrote:
 It should be relatively cheap to skip the hint bits in the line
 pointers since they'll be the same bits of every 16-bit value for a
 whole range. Alternatively we could just CRC the tuples and assume a
 corrupted line pointer will show itself quickly. That would actually
 make it faster than a straight CRC of the whole block -- making
 lemonade out of lemons as it were.

I don't think relatively cheap is the right criterion here --- the
question to me is how many assumptions are you making in order to
compute the page's CRC.  Each assumption degrades the reliability
of the check, not to mention creating another maintenance hazard.

 Yea, I am thinking we would have to have the hint bits in the line
 pointers --- if not, we would have to reserve a lot of free space to
 hold the maximum number of tuple hint bits --- seems like a waste.

Not if you're willing to move the line pointers around.  I'd envision
an extra pointer in the page header, with a layout along the lines of

fixed-size page header
hint bits
line pointers
free space
tuples proper
special space

with the CRC covering everything except the hint bits and perhaps the
free space (depending on whether you wanted to depend on two more
pointers to be right).  We would have to move the line pointers anytime
we needed to grow the hint-bit space, and there would be a
straightforward tradeoff between how often to move the pointers versus
how much potentially-wasted space we leave at the end of the hint area.

Or we could put the hint bits after the pointers, which might be better
because the hints would be smaller == cheaper to move.

 I also like the idea that we don't need to CRC check the line pointers
 because any corruption there is going to appear immediately.  However,
 the bad news is that we wouldn't find the corruption until we try to
 access bad data and might crash.

That sounds exactly like the corruption detection system we have now.
If you think that behavior is acceptable, we can skip this whole
discussion.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Block-level CRC checks

2009-12-01 Thread Bruce Momjian

Tom Lane wrote:
 Bruce Momjian br...@momjian.us writes:
  Greg Stark wrote:
  It should be relatively cheap to skip the hint bits in the line
  pointers since they'll be the same bits of every 16-bit value for a
  whole range. Alternatively we could just CRC the tuples and assume a
  corrupted line pointer will show itself quickly. That would actually
  make it faster than a straight CRC of the whole block -- making
  lemonade out of lemons as it were.
 
 I don't think relatively cheap is the right criterion here --- the
 question to me is how many assumptions are you making in order to
 compute the page's CRC.  Each assumption degrades the reliability
 of the check, not to mention creating another maintenance hazard.
 
  Yea, I am thinking we would have to have the hint bits in the line
  pointers --- if not, we would have to reserve a lot of free space to
  hold the maximum number of tuple hint bits --- seems like a waste.
 
 Not if you're willing to move the line pointers around.  I'd envision
 an extra pointer in the page header, with a layout along the lines of
 
   fixed-size page header
   hint bits
   line pointers
   free space
   tuples proper
   special space
 
 with the CRC covering everything except the hint bits and perhaps the
 free space (depending on whether you wanted to depend on two more
 pointers to be right).  We would have to move the line pointers anytime
 we needed to grow the hint-bit space, and there would be a
 straightforward tradeoff between how often to move the pointers versus
 how much potentially-wasted space we leave at the end of the hint area.

I assume you don't want the hint bits in the line pointers because we
would need to lock the page?

 Or we could put the hint bits after the pointers, which might be better
 because the hints would be smaller == cheaper to move.

I don't see the value there because you would need to move the hint bits
every time you added a new line pointer.  The bigger problem is that you
would need to lock the page to update the hint bits if they move around
on the page.

  I also like the idea that we don't need to CRC check the line pointers
  because any corruption there is going to appear immediately.  However,
  the bad news is that we wouldn't find the corruption until we try to
  access bad data and might crash.
 
 That sounds exactly like the corruption detection system we have now.
 If you think that behavior is acceptable, we can skip this whole
 discussion.

Agreed, hence the bad part.

-- 
  Bruce Momjian  br...@momjian.ushttp://momjian.us
  EnterpriseDB http://enterprisedb.com

  + If your life is a hard drive, Christ can be your backup. +

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Block-level CRC checks

2009-12-01 Thread Tom Lane

Richard Huxton d...@archonet.com writes:
 So what is the cost of calculating the hint-bits for a whole block of
 tuples in one go vs reading that block from actual spinning disk?

Potentially a couple of hundred times worse, if you're unlucky and each
XID on the page requires visiting a different block of clog that's also
not in memory.  The average case probably isn't that bad, but I think
we'd likely be talking at least a factor of two penalty --- you'd be
hopelessly optimistic to assume you didn't need at least one clog visit
per page.

Also, if you want to assume that you're lucky and the XIDs mostly fall
within a fairly recent range of clog pages, you're still not out of the
woods.  In that situation what you are talking about is a spectacular
increase in the hit rate for cached clog pages --- which are already a
known contention bottleneck in many scenarios.

 While on that topic - I'm assuming freezing requires substantially more
 effort than updating hint bits?

It's a WAL-logged page change, so at minimum double the cost.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Block-level CRC checks

2009-12-01 Thread Tom Lane

Bruce Momjian br...@momjian.us writes:
 Tom Lane wrote:
 I don't think relatively cheap is the right criterion here --- the
 question to me is how many assumptions are you making in order to
 compute the page's CRC.  Each assumption degrades the reliability
 of the check, not to mention creating another maintenance hazard.

 I assume you don't want the hint bits in the line pointers because we
 would need to lock the page?

No, I don't want them there because I don't want the CRC check to know
so much about the page layout.

 Or we could put the hint bits after the pointers, which might be better
 because the hints would be smaller == cheaper to move.

 I don't see the value there because you would need to move the hint bits
 every time you added a new line pointer.

No, we could add unused line pointers in multiples, exactly the same as
we would add unused hint bits in multiples if we did it the other way.
I don't know offhand which would be more efficient, but you can't just
dismiss one without analysis.

 The bigger problem is that you
 would need to lock the page to update the hint bits if they move around
 on the page.

We are already assuming that things aren't moving around when we update
a hint bit now.  That's what the requirement of shared buffer lock when
calling tqual.c is for.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Block-level CRC checks

2009-12-01 Thread Greg Stark

On Tue, Dec 1, 2009 at 10:47 PM, Tom Lane t...@sss.pgh.pa.us wrote:
 Bruce Momjian br...@momjian.us writes:
 Greg Stark wrote:
 It should be relatively cheap to skip the hint bits in the line
 pointers since they'll be the same bits of every 16-bit value for a
 whole range. Alternatively we could just CRC the tuples and assume a
 corrupted line pointer will show itself quickly. That would actually
 make it faster than a straight CRC of the whole block -- making
 lemonade out of lemons as it were.

 I don't think relatively cheap is the right criterion here --- the
 question to me is how many assumptions are you making in order to
 compute the page's CRC.  Each assumption degrades the reliability
 of the check, not to mention creating another maintenance hazard.

Well the only assumption here is that we know where the line pointers
start and end. That sounds like the same level of assumption as your
structure with the line pointers moving around. I agree with your
general point though -- trying to skip the hint bits strewn around in
the tuples means that every line pointer had better be correct or
you'll be in trouble before you even get to the CRC check. Skipping
them in the line pointers just means applying a hard-coded mask
against each word in that region.

It seems to me adding a third structure on the page and then requiring
tqual to be able to find that doesn't significantly reduce the
complexity over having tqual be able to find the line pointers. And it
significantly increases the complexity of every other part of the
system which has to deal with a third structure on the page. And
adding and compacting the page becomes a lot more complex.  I'm also
I'm a bit leery about adding more line pointers than necessary because
even a small number of line pointers will mean you're likely to often
fit one fewer tuple on the page.

-- 
greg

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Block-level CRC checks

2009-12-01 Thread decibel


On Dec 1, 2009, at 4:13 PM, Greg Stark wrote:
On Tue, Dec 1, 2009 at 9:57 PM, Richard Huxton d...@archonet.com  
wrote:

Why are we writing out the hint bits to disk anyway? Is it really so
slow to calculate them on read + cache them that it's worth all this
trouble? Are they not also to blame for the write my import data  
twice

feature?


It would be interesting to experiment with different strategies. But
the results would depend a lot on workloads and I doubt one strategy
is best for everyone.


I agree that we'll always have the issue with freezing. But I also  
think it's time to revisit the whole idea of hint bits. AFAIK we only  
keep at maximum 2B transactions, and each one takes 2 bits in CLOG.  
So worst-case scenario, we're looking at 4G of clog. On modern  
hardware, that's not a lot. And that's also assuming that we don't do  
any kind of compression on that data (obviously we couldn't use just  
any old compression algorithm, but there's certainly tricks that  
could be used to reduce the size of this information).


I know this is something that folks at EnterpriseDB have looked at,  
perhaps there's data they can share.



It has often been suggested that we could set the hint bits but not
dirty the page, so they would never be written out unless some other
update hit the page. In most use cases that would probably result in
the right thing happening where we avoid half the writes but still
stop doing transaction status lookups relatively promptly. The scary
thing is that there might be use cases such as static data loaded
where the hint bits never get set and every scan of the page has to
recheck those statuses until the tuples are frozen.

(Not dirtying the page almost gets us out of the CRC problems -- it
doesn't in our current setup because we don't take a lock when setting
the hint bits, so you could set it on a page someone is in the middle
of CRC checking and writing. There were other solutions proposed for
that, including just making hint bits require locking the page or
double buffering the write.)

There does need to be something like the hint bits which does
eventually have to be set because we can't keep transaction
information around forever. Even if you keep the transaction
information all the way back to the last freeze date (up to about 1GB
and change I think) then the data has to be written twice, the second
time is to freeze the transactions. In the worst case then reading a
page requires a random page access (or two) from anywhere in that 1GB+
file for each tuple on the page (whether visible to us or not).
--
greg

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers



--
Jim C. Nasby, Database Architect   j...@nasby.net
512.569.9461 (cell) http://jim.nasby.net



--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Block-level CRC checks

2009-12-01 Thread Tom Lane

Greg Stark gsst...@mit.edu writes:
 On Tue, Dec 1, 2009 at 10:47 PM, Tom Lane t...@sss.pgh.pa.us wrote:
 I don't think relatively cheap is the right criterion here --- the
 question to me is how many assumptions are you making in order to
 compute the page's CRC.  Each assumption degrades the reliability
 of the check, not to mention creating another maintenance hazard.

 Well the only assumption here is that we know where the line pointers
 start and end.

... and what they contain.  To CRC a subset of the page at all, we have
to put some amount of faith into the page header's pointers.  We can do
weak checks on those, but only weak ones.  If we process different parts
of the page differently, we're increasing our trust in those pointers
and reducing the quality of the CRC check.

 It seems to me adding a third structure on the page and then requiring
 tqual to be able to find that doesn't significantly reduce the
 complexity over having tqual be able to find the line pointers. And it
 significantly increases the complexity of every other part of the
 system which has to deal with a third structure on the page. And
 adding and compacting the page becomes a lot more complex.

The page compaction logic amounts to a grand total of two not-very-long
routines.  The vast majority of the code impact from this would be from
the problem of finding the out-of-line hint bits for a tuple, which as
you say appears about equivalently hard either way.  So I think keeping
the CRC logic as simple as possible is good from both a reliability and
performance standpoint.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Block-level CRC checks

2009-12-01 Thread Greg Stark

On Wed, Dec 2, 2009 at 12:03 AM, Tom Lane t...@sss.pgh.pa.us wrote:
 Greg Stark gsst...@mit.edu writes:
 On Tue, Dec 1, 2009 at 10:47 PM, Tom Lane t...@sss.pgh.pa.us wrote:
 I don't think relatively cheap is the right criterion here --- the
 question to me is how many assumptions are you making in order to
 compute the page's CRC.  Each assumption degrades the reliability
 of the check, not to mention creating another maintenance hazard.

 Well the only assumption here is that we know where the line pointers
 start and end.

 ... and what they contain.  To CRC a subset of the page at all, we have
 to put some amount of faith into the page header's pointers.  We can do
 weak checks on those, but only weak ones.  If we process different parts
 of the page differently, we're increasing our trust in those pointers
 and reducing the quality of the CRC check.

I'm not sure we're on the same page.  As I understand it there are
three proposals on the table now:

1) set aside a section of the page to contain only non-checksummed
hint bits. That section has to be relocatable so the crc check would
have to read the start and end address of it from the page header.

2) store the hint bits in the line pointers and skip checking the line
pointers. In that case the crc check would skip any bytes between the
start of the line pointer array and pd_lower (or pd_upper? no point in
crc checking unused bytes is there?)

3) store the hint bits in the line pointers and apply a mask which
masks out the 4 hint bits in each 32-bit word in the region between
the start of the line pointers and pd_lower (or pd_upper again)

These three options all seem to have the same level of interdependence
for the crc check, namely they all depend one or two values in the
page header to specify a range of bytes in the block. None of them
depend on the contents of the line pointers themselves being correct,
only the one or two fields in the header specifying which range of
bytes the hint bits lie in.

For what it's worth I don't think decreasing the quality of the crc
check is actually valid. The bottom line is that in all of the above
options if any pointer is invalid we'll be CRCing a different set of
data from the set that originally went into calculating the stored CRC
so we'll be effectively computing a random value which will have a
1/2^32 chance of being the value stored in the CRC field regardless of
anything else.

-- 
greg

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Block-level CRC checks

2009-12-01 Thread Aidan Van Dyk

* Greg Stark gsst...@mit.edu [091201 20:14]:
 
 I'm not sure we're on the same page.  As I understand it there are
 three proposals on the table now:
 
 1) set aside a section of the page to contain only non-checksummed
 hint bits. That section has to be relocatable so the crc check would
 have to read the start and end address of it from the page header.
 
 2) store the hint bits in the line pointers and skip checking the line
 pointers. In that case the crc check would skip any bytes between the
 start of the line pointer array and pd_lower (or pd_upper? no point in
 crc checking unused bytes is there?)
 
 3) store the hint bits in the line pointers and apply a mask which
 masks out the 4 hint bits in each 32-bit word in the region between
 the start of the line pointers and pd_lower (or pd_upper again)

I'm not intimately familiar with the innards of the pages, but I had
*thought* that the original suggestion of moving the hint bits was
purely to make sure that they are in the same filesystem block/disk
sector as the CRC.  That may not be possible, but *if* that's the case,
you avoid the torn-page problem, with only 1 minimal assumption:
  - the FS-block/disk-sector will write whole blocks at a time, or
likely be corrupt anyways

With my understanding of disks and platters, I'ld assume that if you got
a partial sector written, and something prevented it from being
completely written, I'ld guess the part missing would be smeared with
corruption  And that would seem to hold with flash/SSD's too...

a.

-- 
Aidan Van Dyk Create like a god,
ai...@highrise.ca   command like a king,
http://www.highrise.ca/   work like a slave.


signature.asc
Description: Digital signature

Re: [HACKERS] Block-level CRC checks

2009-11-30 Thread Simon Riggs

On Fri, 2008-10-17 at 12:26 -0300, Alvaro Herrera wrote:
 So this discussion died with no solution arising to the
 hint-bit-setting-invalidates-the-CRC problem.
 
 Apparently the only solution in sight is to WAL-log hint bits.  Simon
 opines it would be horrible from a performance standpoint to WAL-log
 every hint bit set, and I think we all agree with that.  So we need to
 find an alternative mechanism to WAL log hint bits.

It occurred to me that maybe we don't need to WAL-log the CRC checks.

Proposal

* We reserve enough space on a disk block for a CRC check. When a dirty
block is written to disk we calculate and annotate the CRC value, though
this is *not* WAL logged.

* In normal running we re-check the CRC when we read the block back into
shared_buffers.

* In recovery we will overwrite the last image of a block from WAL, so
we ignore the block CRC check, since the WAL record was already CRC
checked. If full_page_writes = off, we ignore and zero the block's CRC
for any block touched during recovery. We do those things because the
block CRC in the WAL is likely to be different to that on disk, due to
hints.

* We also re-check the CRC on a block immediately before we dirty the
block (for any reason). This minimises the possibility of in-memory data
corruption for blocks.

So in the typical case all blocks moving from disk - memory and from
clean - dirty are CRC checked. So in the case where we have
full_page_writes = on then we have a good CRC every time. In the
full_page_writes = off case we are exposed only on the blocks that
changed during last checkpoint cycle and only if we crash. That seems
good because most databases are up 99% of the time, so any corruptions
are likely to occur in normal running, not as a result of crashes.

This would be a run-time option.

Like it?

-- 
 Simon Riggs   www.2ndQuadrant.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Block-level CRC checks

2009-11-30 Thread Joshua D. Drake

On Mon, 2009-11-30 at 13:21 +, Simon Riggs wrote:
 On Fri, 2008-10-17 at 12:26 -0300, Alvaro Herrera wrote:
  So this discussion died with no solution arising to the
  hint-bit-setting-invalidates-the-CRC problem.
  
  Apparently the only solution in sight is to WAL-log hint bits.  Simon
  opines it would be horrible from a performance standpoint to WAL-log
  every hint bit set, and I think we all agree with that.  So we need to
  find an alternative mechanism to WAL log hint bits.
 
 It occurred to me that maybe we don't need to WAL-log the CRC checks.
 
 Proposal
 
 * We reserve enough space on a disk block for a CRC check. When a dirty
 block is written to disk we calculate and annotate the CRC value, though
 this is *not* WAL logged.
 
 * In normal running we re-check the CRC when we read the block back into
 shared_buffers.
 
 * In recovery we will overwrite the last image of a block from WAL, so
 we ignore the block CRC check, since the WAL record was already CRC
 checked. If full_page_writes = off, we ignore and zero the block's CRC
 for any block touched during recovery. We do those things because the
 block CRC in the WAL is likely to be different to that on disk, due to
 hints.
 
 * We also re-check the CRC on a block immediately before we dirty the
 block (for any reason). This minimises the possibility of in-memory data
 corruption for blocks.
 
 So in the typical case all blocks moving from disk - memory and from
 clean - dirty are CRC checked. So in the case where we have
 full_page_writes = on then we have a good CRC every time. In the
 full_page_writes = off case we are exposed only on the blocks that
 changed during last checkpoint cycle and only if we crash. That seems
 good because most databases are up 99% of the time, so any corruptions
 are likely to occur in normal running, not as a result of crashes.
 
 This would be a run-time option.
 
 Like it?
 

Just FYI, Alvaro is out of town and our of email access (almost
exclusively). It may take him another week or so to get back to this.

Joshua D. Drake



 -- 
  Simon Riggs   www.2ndQuadrant.com
 
 


-- 
PostgreSQL.org Major Contributor
Command Prompt, Inc: http://www.commandprompt.com/ - 503.667.4564
Consulting, Training, Support, Custom Development, Engineering
If the world pushes look it in the eye and GRR. Then push back harder. - 
Salamander


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

1 2 3 4 >

1 - 100 of 339 matches

Mail list logo