Re: [HACKERS] Block-level CRC checks
On Fri, Dec 4, 2009 at 10:47 PM, Chuck McDevitt cmcdev...@greenplum.com wrote: A curiosity question regarding torn pages: How does this work on file systems that don't write in-place, but instead always do copy-on-write? My example would be Sun's ZFS file system (In Solaris BSD). Because of its snapshot rollback functionality, it never writes a page in-place, but instead always copies it to another place on disk. How does this affect the corruption caused by a torn write? Can we end up with horrible corruption on this type of filesystem where we wouldn't on normal file systems, where we are writing to a previously zeroed area on disk? Sorry if this is a stupid question... Hopefully somebody can reassure me that this isn't an issue. It's not a stupid question, we're not 100% sure but we believe ZFS doesn't need full page writes because it's immune to torn pages. I think the idea of ZFS is that the new partially written page isn't visible because it's not linked into the tree until it's been completely written. To me it appears this would depend on the drive system ordering writes very strictly which seems hard to be sure is happening. Perhaps this is tied to the tricks they do to avoid contention on the root, if they do a write barrier before every root update that seems like it should be sufficient to me, but I don't know at that level of detail. -- greg -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Block-level CRC checks
It can save space because the line pointers have less alignment requirements. But I don't see any point in the current state. -- Greg On 2009-12-04, at 3:48 PM, Tom Lane t...@sss.pgh.pa.us wrote: Greg Stark gsst...@mit.edu writes: I'm not sure why I said including ctid. We would have to move everything transactional to the line pointer, including xmin, xmax, ctid, all the hint bits, the updated flags, hot flags, etc. The only things left in the tuple header would be things that have to be there such as HAS_OIDS, HAS_NULLS, natts, hoff, etc. It would be a pretty drastic change, though a fairly logical one. I recall someone actually submitted a patch to separate out the transactional bits anyways a while back, just to save a few bytes in in-memory tuples. If we could save on disk-space usage it would be a lot more compelling. But it doesn't look to me like it really saves enough often enough to be worth so much code churn. It would also break things for indexes, which don't need all that stuff in their line pointers. More to the point, moving the same bits to someplace else on the page doesn't save anything at all. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Block-level CRC checks
On Dec 3, 2009, at 1:53 PM, Jonah H. Harris wrote: On Tue, Dec 1, 2009 at 1:27 PM, Joshua D. Drake j...@commandprompt.com wrote: On Tue, 2009-12-01 at 13:20 -0500, Robert Haas wrote: Does $COMPETITOR offer this feature? My understanding is that MSSQL does. I am not sure about Oracle. Those are the only two I run into (I don't run into MySQL at all). I know others likely compete in the DB2 space. To my knowledge, MySQL, InnoDB, BerkeleyDB, solidDB, Oracle, SQL Server, Sybase, DB2, eXtremeDB, RDB, and Teradata all checksum pages. So... now that the upgrade discussion seems to have died down... was any consensus reached on how to do said checksumming? -- Jim C. Nasby, Database Architect j...@nasby.net 512.569.9461 (cell) http://jim.nasby.net -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Block-level CRC checks
On Fri, 2009-12-04 at 03:32 -0600, decibel wrote: So... now that the upgrade discussion seems to have died down... was any consensus reached on how to do said checksumming? Possibly. Please can you go through the discussion and pull out a balanced summary of how to proceed? I lost track a while back and I'm sure many others did also. -- Simon Riggs www.2ndQuadrant.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Block-level CRC checks
Kevin, md5sum of each tuple? As an optional system column (a la oid)? I am mainly an application programmer working with PostgreSQL. And I want to point out an additional usefullness of an md5sum of each tuple: it makes comparing table-contents in replicated / related databases MUCH more feasible. I am in the process of adding a user-space myhash column to all my applications tables, filled by a trigger on insert / update. It really speeds up table comparison across databases; and it is very helpfull in debugging replications. Harald -- GHUM Harald Massa persuadere et programmare Harald Armin Massa Spielberger Straße 49 70435 Stuttgart 0173/9409607 no fx, no carrier pigeon - %s is too gigantic of an industry to bend to the whims of reality -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Block-level CRC checks
On Fri, Dec 4, 2009 at 9:34 AM, Simon Riggs si...@2ndquadrant.com wrote: Possibly. Please can you go through the discussion and pull out a balanced summary of how to proceed? I lost track a while back and I'm sure many others did also. I summarized the three feasible plans I think I saw; 407d949e0912011713j63045989j67b7b343ef00c...@mail.gmail.com -- greg -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Block-level CRC checks
decibel wrote: On Dec 3, 2009, at 1:53 PM, Jonah H. Harris wrote: On Tue, Dec 1, 2009 at 1:27 PM, Joshua D. Drake j...@commandprompt.com wrote: On Tue, 2009-12-01 at 13:20 -0500, Robert Haas wrote: Does $COMPETITOR offer this feature? My understanding is that MSSQL does. I am not sure about Oracle. Those are the only two I run into (I don't run into MySQL at all). I know others likely compete in the DB2 space. To my knowledge, MySQL, InnoDB, BerkeleyDB, solidDB, Oracle, SQL Server, Sybase, DB2, eXtremeDB, RDB, and Teradata all checksum pages. So... now that the upgrade discussion seems to have died down... was any consensus reached on how to do said checksumming? I think the hint bit has to be added to the item pointer, by using the offset bits that are already zero, according to Greg Stark. That solution leads to easy programming, no expanding hint bit array, and it is backward compatible so doesn't cause problems for pg_migrator. -- Bruce Momjian br...@momjian.ushttp://momjian.us EnterpriseDB http://enterprisedb.com + If your life is a hard drive, Christ can be your backup. + -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Block-level CRC checks
On Fri, 2009-12-04 at 07:12 -0500, Bruce Momjian wrote: I think the hint bit has to be added to the item pointer, by using the offset bits that are already zero, according to Greg Stark. That solution leads to easy programming, no expanding hint bit array, and it is backward compatible so doesn't cause problems for pg_migrator. Seems like a reasonable way forward. As I pointed out here http://archives.postgresql.org/pgsql-hackers/2009-12/msg00056.php we only need to use 3 bits not 4, but it does limit tuple length to 4096 for all block sizes. (Two different options there for doing that). An added advantage of this approach is that the cachelines for the item pointer array will already be in CPU cache, so there is no additional access time when we set the hint bits when they are moved to their new position. I should also point out that removing 4 bits from the tuple header would allow us to get rid of t_infomask2, reducing tuple length by a further 2 bytes. -- Simon Riggs www.2ndQuadrant.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Block-level CRC checks
BTW with VACUUM FULL removed I assume we're going to get rid of HEAP_MOVED_IN and HEAP_MOVED_OFF too, right? -- Alvaro Herrerahttp://www.CommandPrompt.com/ PostgreSQL Replication, Consulting, Custom Development, 24x7 support -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Block-level CRC checks
Simon Riggs wrote: On Fri, 2009-12-04 at 07:12 -0500, Bruce Momjian wrote: I think the hint bit has to be added to the item pointer, by using the offset bits that are already zero, according to Greg Stark. That solution leads to easy programming, no expanding hint bit array, and it is backward compatible so doesn't cause problems for pg_migrator. Seems like a reasonable way forward. As I pointed out here http://archives.postgresql.org/pgsql-hackers/2009-12/msg00056.php we only need to use 3 bits not 4, but it does limit tuple length to 4096 for all block sizes. (Two different options there for doing that). An added advantage of this approach is that the cachelines for the item pointer array will already be in CPU cache, so there is no additional access time when we set the hint bits when they are moved to their new position. I should also point out that removing 4 bits from the tuple header would allow us to get rid of t_infomask2, reducing tuple length by a further 2 bytes. Wow, that is a nice win. Does alignment allow us to actually use that space? -- Bruce Momjian br...@momjian.ushttp://momjian.us EnterpriseDB http://enterprisedb.com + If your life is a hard drive, Christ can be your backup. + -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Block-level CRC checks
On Fri, 2009-12-04 at 07:54 -0500, Bruce Momjian wrote: I should also point out that removing 4 bits from the tuple header would allow us to get rid of t_infomask2, reducing tuple length by a further 2 bytes. Wow, that is a nice win. Does alignment allow us to actually use that space? It would mean that tables up to 24 columns wide would still be 24 bytes wide, whereas 8 columns now has to fit in 32 bytes. So in practical terms most tables would benefit in your average database. -- Simon Riggs www.2ndQuadrant.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Block-level CRC checks
On Fri, 2009-12-04 at 09:52 -0300, Alvaro Herrera wrote: BTW with VACUUM FULL removed I assume we're going to get rid of HEAP_MOVED_IN and HEAP_MOVED_OFF too, right? Much as I would like to see those go, no. VF code should remain for some time yet, IMHO. We could remove it, but doing so is not a priority because it buys us nothing in terms of features and its the type of thing we should do at the start of a release cycle, not end. I certainly don't have time to do it, at least. -- Simon Riggs www.2ndQuadrant.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Block-level CRC checks
Simon Riggs wrote: On Fri, 2009-12-04 at 09:52 -0300, Alvaro Herrera wrote: BTW with VACUUM FULL removed I assume we're going to get rid of HEAP_MOVED_IN and HEAP_MOVED_OFF too, right? Much as I would like to see those go, no. VF code should remain for some time yet, IMHO. I don't think we need to keep VF code otherwise, but I would leave HEAP_MOVED_IN/OFF support alone for now for in-place upgrade. Otherwise we need a pre-upgrade script or something to scrub them off. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Block-level CRC checks
On Fri, Dec 4, 2009 at 12:57 PM, Simon Riggs si...@2ndquadrant.com wrote: On Fri, 2009-12-04 at 07:54 -0500, Bruce Momjian wrote: I should also point out that removing 4 bits from the tuple header would allow us to get rid of t_infomask2, reducing tuple length by a further 2 bytes. Wow, that is a nice win. Does alignment allow us to actually use that space? It would mean that tables up to 24 columns wide would still be 24 bytes wide, whereas 8 columns now has to fit in 32 bytes. So in practical terms most tables would benefit in your average database. I don't think getting rid of infomask2 wins us 2 bytes so fast. The rest of those two bytes is natts which of course we still need. If we lose vacuum full then the table's open for reducing the width of command id too if we need more bits. If we do that and we moved everything we could to the line pointers including ctid we might just be able to squeeze the tuple overhead down to 16 bytes. That would win 8 bytes per tuple for people with no null columns or with nulls and a total of 9-64 columns but if they have 1-8 columns and any are null it would actually consume more space. But it looks to me like it would be very very tight and require drastic measures -- I think we would be left with something like 11 bits for commandid and no spare bits in the tuple header at all. -- greg -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Block-level CRC checks
On Fri, Dec 4, 2009 at 1:35 PM, Greg Stark gsst...@mit.edu wrote: If we lose vacuum full then the table's open for reducing the width of command id too if we need more bits. If we do that and we moved everything we could to the line pointers including ctid we might just be able to squeeze the tuple overhead down to 16 bytes. I'm not sure why I said including ctid. We would have to move everything transactional to the line pointer, including xmin, xmax, ctid, all the hint bits, the updated flags, hot flags, etc. The only things left in the tuple header would be things that have to be there such as HAS_OIDS, HAS_NULLS, natts, hoff, etc. It would be a pretty drastic change, though a fairly logical one. I recall someone actually submitted a patch to separate out the transactional bits anyways a while back, just to save a few bytes in in-memory tuples. If we could save on disk-space usage it would be a lot more compelling. But it doesn't look to me like it really saves enough often enough to be worth so much code churn. -- greg -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Block-level CRC checks
Greg Stark escribió: On Fri, Dec 4, 2009 at 1:35 PM, Greg Stark gsst...@mit.edu wrote: If we lose vacuum full then the table's open for reducing the width of command id too if we need more bits. If we do that and we moved everything we could to the line pointers including ctid we might just be able to squeeze the tuple overhead down to 16 bytes. I'm not sure why I said including ctid. We would have to move everything transactional to the line pointer, including xmin, xmax, ctid, all the hint bits, the updated flags, hot flags, etc. The only things left in the tuple header would be things that have to be there such as HAS_OIDS, HAS_NULLS, natts, hoff, etc. It would be a pretty drastic change, though a fairly logical one. Do we need XMAX_EXCL_LOCK and XMAX_SHARED_LOCK to be moved? It seems to me that they can stay with the tuple header because they are set by wal-logged operations. Same for XMAX_IS_MULTI. The HASfoo bits are all set on tuple creation, never touched later, so they can stay in the header too. We only need XMIN_COMMITTED, XMIN_INVALID, XMAX_COMMITTED, XMAX_INVALID, HEAP_COMBOCID on the line pointer AFAICS ... oh, and HEAP_HOT_UPDATED and HEAP_ONLY_TUPLE, not sure. -- Alvaro Herrerahttp://www.CommandPrompt.com/ PostgreSQL Replication, Consulting, Custom Development, 24x7 support -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Block-level CRC checks
Heikki Linnakangas escribió: Simon Riggs wrote: On Fri, 2009-12-04 at 09:52 -0300, Alvaro Herrera wrote: BTW with VACUUM FULL removed I assume we're going to get rid of HEAP_MOVED_IN and HEAP_MOVED_OFF too, right? Much as I would like to see those go, no. VF code should remain for some time yet, IMHO. I don't think we need to keep VF code otherwise, but I would leave HEAP_MOVED_IN/OFF support alone for now for in-place upgrade. Otherwise we need a pre-upgrade script or something to scrub them off. CRCs are going to need scrubbing anyway, no? Oh, but you're assuming that CRCs are optional, so not everybody would need that, right? -- Alvaro Herrerahttp://www.CommandPrompt.com/ The PostgreSQL Company - Command Prompt, Inc. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Block-level CRC checks
On Fri, Dec 4, 2009 at 9:48 AM, Alvaro Herrera alvhe...@commandprompt.com wrote: Heikki Linnakangas escribió: Simon Riggs wrote: On Fri, 2009-12-04 at 09:52 -0300, Alvaro Herrera wrote: BTW with VACUUM FULL removed I assume we're going to get rid of HEAP_MOVED_IN and HEAP_MOVED_OFF too, right? Much as I would like to see those go, no. VF code should remain for some time yet, IMHO. I don't think we need to keep VF code otherwise, but I would leave HEAP_MOVED_IN/OFF support alone for now for in-place upgrade. Otherwise we need a pre-upgrade script or something to scrub them off. CRCs are going to need scrubbing anyway, no? Oh, but you're assuming that CRCs are optional, so not everybody would need that, right? If we can make not only the validity but also the presence of the CRC field optional, it will simplify things greatly for in-place upgrade, I think, because the upgrade won't itself require expanding the page. Turning on the CRC functionality for a particular table may require expanding the page, but that's a different problem. :-) Have we thought about what other things have changed between 8.4 and 8.5 that might cause problems for in-place upgrade? ...Robert -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Block-level CRC checks
Simon Riggs si...@2ndquadrant.com writes: As I pointed out here http://archives.postgresql.org/pgsql-hackers/2009-12/msg00056.php we only need to use 3 bits not 4, but it does limit tuple length to 4096 for all block sizes. (Two different options there for doing that). Limiting the tuple length is a deal-breaker. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Block-level CRC checks
Greg Stark gsst...@mit.edu writes: I'm not sure why I said including ctid. We would have to move everything transactional to the line pointer, including xmin, xmax, ctid, all the hint bits, the updated flags, hot flags, etc. The only things left in the tuple header would be things that have to be there such as HAS_OIDS, HAS_NULLS, natts, hoff, etc. It would be a pretty drastic change, though a fairly logical one. I recall someone actually submitted a patch to separate out the transactional bits anyways a while back, just to save a few bytes in in-memory tuples. If we could save on disk-space usage it would be a lot more compelling. But it doesn't look to me like it really saves enough often enough to be worth so much code churn. It would also break things for indexes, which don't need all that stuff in their line pointers. More to the point, moving the same bits to someplace else on the page doesn't save anything at all. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Block-level CRC checks
On Fri, 2009-12-04 at 10:43 -0500, Tom Lane wrote: Simon Riggs si...@2ndquadrant.com writes: As I pointed out here http://archives.postgresql.org/pgsql-hackers/2009-12/msg00056.php we only need to use 3 bits not 4, but it does limit tuple length to 4096 for all block sizes. (Two different options there for doing that). Limiting the tuple length is a deal-breaker. If people that use 32kB block sizes exist in practice, I note that because tuples are at least 4 byte aligned that the first 2 bits of the length are always unused. So they're available for those with strangely long tuples, and can be used to signify high order bytes and so max tuple length could be 16384. With tuples that long, it would be better to assume 8-byte minimum alignment, which would put max tuple length back up to 32KB again. None of that need effect people with a standard 8192 byte blocksize. -- Simon Riggs www.2ndQuadrant.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Block-level CRC checks
Robert Haas robertmh...@gmail.com writes: Have we thought about what other things have changed between 8.4 and 8.5 that might cause problems for in-place upgrade? So far, nothing. We even made Andrew Gierth jump through hoops to keep hstore's on-disk representation upwards compatible. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Block-level CRC checks
On Fri, 2009-12-04 at 13:35 +, Greg Stark wrote: I don't think getting rid of infomask2 wins us 2 bytes so fast. The rest of those two bytes is natts which of course we still need. err, yes, OK. -- Simon Riggs www.2ndQuadrant.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Block-level CRC checks
Robert Haas wrote: On Fri, Dec 4, 2009 at 9:48 AM, Alvaro Herrera alvhe...@commandprompt.com wrote: Heikki Linnakangas escribi?: Simon Riggs wrote: On Fri, 2009-12-04 at 09:52 -0300, Alvaro Herrera wrote: BTW with VACUUM FULL removed I assume we're going to get rid of HEAP_MOVED_IN and HEAP_MOVED_OFF too, right? Much as I would like to see those go, no. VF code should remain for some time yet, IMHO. I don't think we need to keep VF code otherwise, but I would leave HEAP_MOVED_IN/OFF support alone for now for in-place upgrade. Otherwise we need a pre-upgrade script or something to scrub them off. CRCs are going to need scrubbing anyway, no? ?Oh, but you're assuming that CRCs are optional, so not everybody would need that, right? If we can make not only the validity but also the presence of the CRC field optional, it will simplify things greatly for in-place upgrade, I think, because the upgrade won't itself require expanding the page. Turning on the CRC functionality for a particular table may require expanding the page, but that's a different problem. :-) Well, I am not sure how we would turn the _space_ used for CRC on and off because you would have to rewrite the entire table/database to turn it on, which seems unfortunate. -- Bruce Momjian br...@momjian.ushttp://momjian.us EnterpriseDB http://enterprisedb.com + If your life is a hard drive, Christ can be your backup. + -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Block-level CRC checks
On Fri, Dec 4, 2009 at 2:04 PM, Bruce Momjian br...@momjian.us wrote: Robert Haas wrote: On Fri, Dec 4, 2009 at 9:48 AM, Alvaro Herrera alvhe...@commandprompt.com wrote: Heikki Linnakangas escribi?: Simon Riggs wrote: On Fri, 2009-12-04 at 09:52 -0300, Alvaro Herrera wrote: BTW with VACUUM FULL removed I assume we're going to get rid of HEAP_MOVED_IN and HEAP_MOVED_OFF too, right? Much as I would like to see those go, no. VF code should remain for some time yet, IMHO. I don't think we need to keep VF code otherwise, but I would leave HEAP_MOVED_IN/OFF support alone for now for in-place upgrade. Otherwise we need a pre-upgrade script or something to scrub them off. CRCs are going to need scrubbing anyway, no? ?Oh, but you're assuming that CRCs are optional, so not everybody would need that, right? If we can make not only the validity but also the presence of the CRC field optional, it will simplify things greatly for in-place upgrade, I think, because the upgrade won't itself require expanding the page. Turning on the CRC functionality for a particular table may require expanding the page, but that's a different problem. :-) Well, I am not sure how we would turn the _space_ used for CRC on and off because you would have to rewrite the entire table/database to turn it on, which seems unfortunate. Well, presumably you're going to have to do some of that work anyway, because even if the space is set aside you're still going to have to read the page in, CRC it, and write it back out. However if the space is not pre-allocated then you also have to deal with moving tuples to other pages. But that problem is going to have to be dealt with somewhere along the line no matter what we do, because if you're upgrading an 8.3 or 8.4 system to 8.5, you need to add that space sometime: either before migration (with a pre-upgrade utility), or after migration (by some sort of page converter/tuple mover), or only when/if enabling the CRC feature. One nice thing about making it the CRC feature's problem to make space on each page is that people who don't want to use CRCs can still use those extra 4 bytes/page for data. That might not be worth the code complexity if we were starting from scratch, but I'm thinking that most of the code complexity is a given if we want to also support in-place upgrade. ...Robert -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Block-level CRC checks
Massa, Harald Armin wrote: I am in the process of adding a user-space myhash column to all my applications tables, filled by a trigger on insert / update. It really speeds up table comparison across databases; and it is very helpfull in debugging replications. Have you seen pg_comparator? -- Alvaro Herrerahttp://www.CommandPrompt.com/ The PostgreSQL Company - Command Prompt, Inc. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Block-level CRC checks
Robert Haas wrote: Well, I am not sure how we would turn the _space_ used for CRC on and off because you would have to rewrite the entire table/database to turn it on, which seems unfortunate. Well, presumably you're going to have to do some of that work anyway, because even if the space is set aside you're still going to have to read the page in, CRC it, and write it back out. However if the space is not pre-allocated then you also have to deal with moving tuples to other pages. But that problem is going to have to be dealt with somewhere along the line no matter what we do, because if you're upgrading an 8.3 or 8.4 system to 8.5, you need to add that space sometime: either before migration (with a pre-upgrade utility), or after migration (by some sort of page converter/tuple mover), or only when/if enabling the CRC feature. One nice thing about making it the CRC feature's problem to make space on each page is that people who don't want to use CRCs can still use those extra 4 bytes/page for data. That might not be worth the code complexity if we were starting from scratch, but I'm thinking that most of the code complexity is a given if we want to also support in-place upgrade. My guess is we can find somewhere on a 8.4 heap/index page to add four bytes. -- Bruce Momjian br...@momjian.ushttp://momjian.us EnterpriseDB http://enterprisedb.com + If your life is a hard drive, Christ can be your backup. + -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Block-level CRC checks
A curiosity question regarding torn pages: How does this work on file systems that don't write in-place, but instead always do copy-on-write? My example would be Sun's ZFS file system (In Solaris BSD). Because of its snapshot rollback functionality, it never writes a page in-place, but instead always copies it to another place on disk. How does this affect the corruption caused by a torn write? Can we end up with horrible corruption on this type of filesystem where we wouldn't on normal file systems, where we are writing to a previously zeroed area on disk? Sorry if this is a stupid question... Hopefully somebody can reassure me that this isn't an issue. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Block-level CRC checks
On Fri, 2009-12-04 at 14:47 -0800, Chuck McDevitt wrote: A curiosity question regarding torn pages: How does this work on file systems that don't write in-place, but instead always do copy-on-write? My example would be Sun's ZFS file system (In Solaris BSD). Because of its snapshot rollback functionality, it never writes a page in-place, but instead always copies it to another place on disk. How does this affect the corruption caused by a torn write? Can we end up with horrible corruption on this type of filesystem where we wouldn't on normal file systems, where we are writing to a previously zeroed area on disk? Sorry if this is a stupid question... Hopefully somebody can reassure me that this isn't an issue. Think we're still good. Not a stupid question. Hint bits are set while the block is in shared_buffers and setting a hint bit dirties the page, but does not write WAL. Because the page is dirty we re-write the whole block at checkpoint, by bgwriter cleaning or via dirty page eviction. So ZFS is OK, but we do more writing than we want to, sometimes. -- Simon Riggs www.2ndQuadrant.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Block-level CRC checks
I am in the process of adding a user-space myhash column to all my applications tables, filled by a trigger on insert / update. It really speeds up table comparison across databases; and it is very helpfull in debugging replications. Have you seen pg_comparator? yes, saw the lightning talk at pgday.eu it also uses md5 hashes, just in an own schema. Guess pg_comparator would profit from an integrated MD5 hash. Harald -- GHUM Harald Massa persuadere et programmare Harald Armin Massa Spielberger Straße 49 70435 Stuttgart 0173/9409607 no fx, no carrier pigeon - %s is too gigantic of an industry to bend to the whims of reality -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Block-level CRC checks
On Tue, Dec 1, 2009 at 1:27 PM, Joshua D. Drake j...@commandprompt.comwrote: On Tue, 2009-12-01 at 13:20 -0500, Robert Haas wrote: Does $COMPETITOR offer this feature? My understanding is that MSSQL does. I am not sure about Oracle. Those are the only two I run into (I don't run into MySQL at all). I know others likely compete in the DB2 space. To my knowledge, MySQL, InnoDB, BerkeleyDB, solidDB, Oracle, SQL Server, Sybase, DB2, eXtremeDB, RDB, and Teradata all checksum pages. -- Jonah H. Harris, Senior DBA myYearbook.com
Re: [HACKERS] Block-level CRC checks
On tis, 2009-12-01 at 19:41 +, Greg Stark wrote: Also, it would require reading back each page as it's written to disk, which is OK for a bunch of single-row writes, but for bulk data loads a significant problem. Not sure what that really means for Postgres. It would just mean reading back the same page of memory from the filesystem cache that we just read. Surely the file system ought to be the place where to solve this. After all, we don't put link-level corruption detection into the libpq protocol either. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Block-level CRC checks
On tis, 2009-12-01 at 17:47 -0500, Tom Lane wrote: Bruce Momjian br...@momjian.us writes: I also like the idea that we don't need to CRC check the line pointers because any corruption there is going to appear immediately. However, the bad news is that we wouldn't find the corruption until we try to access bad data and might crash. That sounds exactly like the corruption detection system we have now. If you think that behavior is acceptable, we can skip this whole discussion. I think one of the motivations for this CRC business was to detect corruption in the user data. As you say, we already handle corruption in the metadata. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Block-level CRC checks
Simon Riggs wrote: There is no creation of corruption events. This scheme detects corruption events that *have* occurred. Now I understand that we previously would have recovered seamlessly from such events, but they were corruption events nonetheless and I think they need to be reported. (For why, see Conclusion #2, below). No, you're still missing the point. The point is *not* random bit errors affecting hint bits, but the torn page problem. Today, a torn page is a completely valid and expected behavior from the OS and storage subsystem. We handle it with full_page_writes, and by relying on the fact that it's OK for a hint bit set to get lost. With your scheme, a torn page would become a corrupt page. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Block-level CRC checks
On Tue, 2009-12-01 at 10:04 +0200, Heikki Linnakangas wrote: Simon Riggs wrote: There is no creation of corruption events. This scheme detects corruption events that *have* occurred. Now I understand that we previously would have recovered seamlessly from such events, but they were corruption events nonetheless and I think they need to be reported. (For why, see Conclusion #2, below). No, you're still missing the point. The point is *not* random bit errors affecting hint bits, but the torn page problem. Today, a torn page is a completely valid and expected behavior from the OS and storage subsystem. We handle it with full_page_writes, and by relying on the fact that it's OK for a hint bit set to get lost. With your scheme, a torn page would become a corrupt page. Well, its easy to keep going on about how much you think I misunderstand. But I think that's just misdirection. The way we handle torn page corruptions *hides* actual corruptions from us. The frequency of true positives and false positives is important here. If the false positive ratio is very small, then reporting them is not a problem because of the benefit we get from having spotted the true positives. Some convicted murderers didn't do it, but that is not an argument for letting them all go free (without knowing the details). So we need to know what the false positive ratio is before we evaluate the benefit of either reporting or non-reporting possible corruption events. When do you think torn pages happen? Only at crash, or other times also? Do they always happen at crash? Are there ways to re-check a block that has suffered a hint-related torn page issue? Are there ways to isolate and minimise the reporting of false positives? Those are important questions and this is not black and white. If the *only* answer really is we-must-WAL-log everything, then that is the answer, as an option. I suspect that there is a less strict possibility, if we question our assumptions and look at the frequencies. We know that I have no time to work on this; I am just trying to hold open the door to a few possibilities that we have not fully considered in a balanced way. And I myself am guilty of having slammed the door previously. I encourage development of a way forward based upon a balance of utility. -- Simon Riggs www.2ndQuadrant.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Block-level CRC checks
Simon Riggs wrote: The way we handle torn page corruptions *hides* actual corruptions from us. The frequency of true positives and false positives is important here. If the false positive ratio is very small, then reporting them is not a problem because of the benefit we get from having spotted the true positives. Some convicted murderers didn't do it, but that is not an argument for letting them all go free (without knowing the details). So we need to know what the false positive ratio is before we evaluate the benefit of either reporting or non-reporting possible corruption events. When do you think torn pages happen? Only at crash, or other times also? Do they always happen at crash? Are there ways to re-check a block that has suffered a hint-related torn page issue? Are there ways to isolate and minimise the reporting of false positives? Those are important questions and this is not black and white. If the *only* answer really is we-must-WAL-log everything, then that is the answer, as an option. I suspect that there is a less strict possibility, if we question our assumptions and look at the frequencies. We know that I have no time to work on this; I am just trying to hold open the door to a few possibilities that we have not fully considered in a balanced way. And I myself am guilty of having slammed the door previously. I encourage development of a way forward based upon a balance of utility. I think the problem boils down to what the user response should be to a corruption report. If it is a torn page, it would be corrected and the user doesn't have to do anything. If it is something that is not correctable, then the user has corruption and/or bad hardware. I think the problem is that the existing proposal can't distinguish between these two cases so the user has no idea how to respond to the report. -- Bruce Momjian br...@momjian.ushttp://momjian.us EnterpriseDB http://enterprisedb.com + If your life is a hard drive, Christ can be your backup. + -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Block-level CRC checks
On Tue, 2009-12-01 at 06:35 -0500, Bruce Momjian wrote: Simon Riggs wrote: The way we handle torn page corruptions *hides* actual corruptions from us. The frequency of true positives and false positives is important here. If the false positive ratio is very small, then reporting them is not a problem because of the benefit we get from having spotted the true positives. Some convicted murderers didn't do it, but that is not an argument for letting them all go free (without knowing the details). So we need to know what the false positive ratio is before we evaluate the benefit of either reporting or non-reporting possible corruption events. When do you think torn pages happen? Only at crash, or other times also? Do they always happen at crash? Are there ways to re-check a block that has suffered a hint-related torn page issue? Are there ways to isolate and minimise the reporting of false positives? Those are important questions and this is not black and white. If the *only* answer really is we-must-WAL-log everything, then that is the answer, as an option. I suspect that there is a less strict possibility, if we question our assumptions and look at the frequencies. We know that I have no time to work on this; I am just trying to hold open the door to a few possibilities that we have not fully considered in a balanced way. And I myself am guilty of having slammed the door previously. I encourage development of a way forward based upon a balance of utility. I think the problem boils down to what the user response should be to a corruption report. If it is a torn page, it would be corrected and the user doesn't have to do anything. If it is something that is not correctable, then the user has corruption and/or bad hardware. I think the problem is that the existing proposal can't distinguish between these two cases so the user has no idea how to respond to the report. If 99.5% of cases are real corruption then there is little need to distinguish between the cases, nor much value in doing so. The prevalence of the different error types is critical to understanding how to respond. If a man pulls a gun on you, your first thought isn't some people remove guns from their jacket to polish them, so perhaps he intends to polish it now because the prevalence of shootings is high, when faced by people with guns, and the risk of dying is also high. You make a judgement based upon the prevalence and the risk. That is all I am asking for us to do here, make a balanced call. These recent comments are a change in my own position, based upon evaluating the prevalence and the risk. I ask others to consider the same line of thought rather than a black/white assessment. All useful detection mechanisms have non-zero false positives because we would rather sometimes ring the bell for no reason than to let bad things through silently, as we do now. -- Simon Riggs www.2ndQuadrant.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Block-level CRC checks
Simon Riggs wrote: I think the problem is that the existing proposal can't distinguish between these two cases so the user has no idea how to respond to the report. If 99.5% of cases are real corruption then there is little need to distinguish between the cases, nor much value in doing so. The prevalence of the different error types is critical to understanding how to respond. If a man pulls a gun on you, your first thought isn't some people remove guns from their jacket to polish them, so perhaps he intends to polish it now because the prevalence of shootings is high, when faced by people with guns, and the risk of dying is also high. You make a judgement based upon the prevalence and the risk. That is all I am asking for us to do here, make a balanced call. These recent comments are a change in my own position, based upon evaluating the prevalence and the risk. I ask others to consider the same line of thought rather than a black/white assessment. All useful detection mechanisms have non-zero false positives because we would rather sometimes ring the bell for no reason than to let bad things through silently, as we do now. OK, but what happens if someone gets the failure report, assumes their hardware is faulty and replaces it, and then gets a failure report again? I assume torn pages are 99% of the reported problem, which are expected and are fixed, and bad hardware 1%, quite the opposite of your numbers above. What might be interesting is to report CRC mismatches if the database was shut down cleanly previously; I think in those cases we shouldn't have torn pages. -- Bruce Momjian br...@momjian.ushttp://momjian.us EnterpriseDB http://enterprisedb.com + If your life is a hard drive, Christ can be your backup. + -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Block-level CRC checks
On Mon, Nov 30, 2009 at 3:27 PM, Heikki Linnakangas heikki.linnakan...@enterprisedb.com wrote: Simon Riggs wrote: Proposal * We reserve enough space on a disk block for a CRC check. When a dirty block is written to disk we calculate and annotate the CRC value, though this is *not* WAL logged. Imagine this: 1. A hint bit is set. It is not WAL-logged, but the page is dirtied. 2. The buffer is flushed out of the buffer cache to the OS. A new CRC is calculated and stored on the page. 3. Half of the page is flushed to disk (aka torn page problem). The CRC made it to disk but the flipped hint bit didn't. You now have a page with incorrect CRC on disk. This is probably a stupid question, but why doesn't the other half of the page make it to disk? Somebody pulls the plug first? ...Robert -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Block-level CRC checks
Robert Haas wrote: On Mon, Nov 30, 2009 at 3:27 PM, Heikki Linnakangas heikki.linnakan...@enterprisedb.com wrote: Simon Riggs wrote: Proposal * We reserve enough space on a disk block for a CRC check. When a dirty block is written to disk we calculate and annotate the CRC value, though this is *not* WAL logged. Imagine this: 1. A hint bit is set. It is not WAL-logged, but the page is dirtied. 2. The buffer is flushed out of the buffer cache to the OS. A new CRC is calculated and stored on the page. 3. Half of the page is flushed to disk (aka torn page problem). The CRC made it to disk but the flipped hint bit didn't. You now have a page with incorrect CRC on disk. This is probably a stupid question, but why doesn't the other half of the page make it to disk? Somebody pulls the plug first? Yep, the pages are 512 bytes on disk, so you might get only some of the 16 512-byte blocks to disk, or the 512-byte block might be partially written. Full page writes fix these on recovery. -- Bruce Momjian br...@momjian.ushttp://momjian.us EnterpriseDB http://enterprisedb.com + If your life is a hard drive, Christ can be your backup. + -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Block-level CRC checks
On Tue, 2009-12-01 at 07:05 -0500, Bruce Momjian wrote: I assume torn pages are 99% of the reported problem, which are expected and are fixed, and bad hardware 1%, quite the opposite of your numbers above. On what basis do you make that assumption? -- Simon Riggs www.2ndQuadrant.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Block-level CRC checks
Simon Riggs wrote: On Tue, 2009-12-01 at 07:05 -0500, Bruce Momjian wrote: I assume torn pages are 99% of the reported problem, which are expected and are fixed, and bad hardware 1%, quite the opposite of your numbers above. On what basis do you make that assumption? Because we added full page write protection to fix the reported problem of torn pages, which we had on occasion; now we don't. Bad hardware reports are less frequent. And we know we can reproduce torn pages by shutting of power to a server without battery-backed cache. We don't know how to produce I/O failures on demand. -- Bruce Momjian br...@momjian.ushttp://momjian.us EnterpriseDB http://enterprisedb.com + If your life is a hard drive, Christ can be your backup. + -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Block-level CRC checks
bruce wrote: What might be interesting is to report CRC mismatches if the database was shut down cleanly previously; I think in those cases we shouldn't have torn pages. Sorry, stupid idea on my part. We don't WAL log hit bit changes so there is no guarantee the page is in WAL on recovery. -- Bruce Momjian br...@momjian.ushttp://momjian.us EnterpriseDB http://enterprisedb.com + If your life is a hard drive, Christ can be your backup. + -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Block-level CRC checks
On Tue, 2009-12-01 at 07:58 -0500, Bruce Momjian wrote: bruce wrote: What might be interesting is to report CRC mismatches if the database was shut down cleanly previously; I think in those cases we shouldn't have torn pages. Sorry, stupid idea on my part. We don't WAL log hit bit changes so there is no guarantee the page is in WAL on recovery. I thought it was a reasonable idea. We would need to re-check CRCs after a crash and zero any that mismatched. Then we can start checking them again as we run. In any case, it seems strange to do nothing to protect the database in normal running just because there is one type of problem that occurs when we crash. -- Simon Riggs www.2ndQuadrant.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Block-level CRC checks
On Tue, 2009-12-01 at 07:42 -0500, Bruce Momjian wrote: Simon Riggs wrote: On Tue, 2009-12-01 at 07:05 -0500, Bruce Momjian wrote: I assume torn pages are 99% of the reported problem, which are expected and are fixed, and bad hardware 1%, quite the opposite of your numbers above. On what basis do you make that assumption? Because we added full page write protection to fix the reported problem of torn pages, which we had on occasion; now we don't. Bad hardware reports are less frequent. Bad hardware reports are infrequent because we lack a detection system for them, which is the topic of this thread. It would be circular to argue that as a case against. It's also an argument that only effects crashes. -- Simon Riggs www.2ndQuadrant.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Block-level CRC checks
Bruce Momjian wrote: What might be interesting is to report CRC mismatches if the database was shut down cleanly previously; I think in those cases we shouldn't have torn pages. Unfortunately that's not true. You can crash, leading to a torn page, and then start up the database and shut it down cleanly. The torn page is still there, even though the last shutdown was a clean one. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Block-level CRC checks
On Mon, Nov 30, 2009 at 9:27 PM, Heikki Linnakangas heikki.linnakan...@enterprisedb.com wrote: Simon Riggs wrote: Proposal * We reserve enough space on a disk block for a CRC check. When a dirty block is written to disk we calculate and annotate the CRC value, though this is *not* WAL logged. Imagine this: 1. A hint bit is set. It is not WAL-logged, but the page is dirtied. 2. The buffer is flushed out of the buffer cache to the OS. A new CRC is calculated and stored on the page. 3. Half of the page is flushed to disk (aka torn page problem). The CRC made it to disk but the flipped hint bit didn't. You now have a page with incorrect CRC on disk. What if we treated the hint bits as all-zeros for the purpose of CRC calculation? This would exclude them from the checksum. Greetings Marcin Mańk -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Block-level CRC checks
On Tuesday 01 December 2009 14:38:26 marcin mank wrote: On Mon, Nov 30, 2009 at 9:27 PM, Heikki Linnakangas heikki.linnakan...@enterprisedb.com wrote: Simon Riggs wrote: Proposal * We reserve enough space on a disk block for a CRC check. When a dirty block is written to disk we calculate and annotate the CRC value, though this is *not* WAL logged. Imagine this: 1. A hint bit is set. It is not WAL-logged, but the page is dirtied. 2. The buffer is flushed out of the buffer cache to the OS. A new CRC is calculated and stored on the page. 3. Half of the page is flushed to disk (aka torn page problem). The CRC made it to disk but the flipped hint bit didn't. You now have a page with incorrect CRC on disk. What if we treated the hint bits as all-zeros for the purpose of CRC calculation? This would exclude them from the checksum. That sounds like doing a complete copy of the wal page zeroing specific fields and then doing wal - rather expensive I would say. Both, during computing the checksum and checking it... Andres -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Block-level CRC checks
* Andres Freund and...@anarazel.de [091201 08:42]: On Tuesday 01 December 2009 14:38:26 marcin mank wrote: On Mon, Nov 30, 2009 at 9:27 PM, Heikki Linnakangas heikki.linnakan...@enterprisedb.com wrote: Simon Riggs wrote: Proposal * We reserve enough space on a disk block for a CRC check. When a dirty block is written to disk we calculate and annotate the CRC value, though this is *not* WAL logged. Imagine this: 1. A hint bit is set. It is not WAL-logged, but the page is dirtied. 2. The buffer is flushed out of the buffer cache to the OS. A new CRC is calculated and stored on the page. 3. Half of the page is flushed to disk (aka torn page problem). The CRC made it to disk but the flipped hint bit didn't. You now have a page with incorrect CRC on disk. What if we treated the hint bits as all-zeros for the purpose of CRC calculation? This would exclude them from the checksum. That sounds like doing a complete copy of the wal page zeroing specific fields and then doing wal - rather expensive I would say. Both, during computing the checksum and checking it... No, it has nothing to do with WAL, it has to do with when writing pages out... You already double-buffer them (to avoid the page changing while you checksum it) before calling write, but the code writing (and then reading) pages doesn't currently have to know all the internal stuff needed decide what's a hint bit and what's not... And adding that information into the buffer in/out would be a huge wart on the modularity of the PG code... a. -- Aidan Van Dyk Create like a god, ai...@highrise.ca command like a king, http://www.highrise.ca/ work like a slave. signature.asc Description: Digital signature
Re: [HACKERS] Block-level CRC checks
On Tue, Dec 1, 2009 at 8:30 AM, Heikki Linnakangas heikki.linnakan...@enterprisedb.com wrote: Bruce Momjian wrote: What might be interesting is to report CRC mismatches if the database was shut down cleanly previously; I think in those cases we shouldn't have torn pages. Unfortunately that's not true. You can crash, leading to a torn page, and then start up the database and shut it down cleanly. The torn page is still there, even though the last shutdown was a clean one. Thinking through this, as I understand it, in order to prevent this problem, you'd need to be able to predict at recovery time which pages might have been torn by the unclean shutdown. In order to do that, you'd need to know which pages were waiting to be written to disk at the time of the shutdown. For ordinary page modifications, that's not a problem, because there will be WAL records for those pages that need to be replayed, and we could recompute the CRC at the same time. But for hint bit changes, there's no persistent state that would tell us which hint bits were in the midst of being flipped when the system went down, so the only way to make sure all the CRCs are correct would be to rescan every page in the entire cluster and recompute every CRC. Is that right? ...Robert -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Block-level CRC checks
On Tuesday 01 December 2009 15:26:21 Aidan Van Dyk wrote: * Andres Freund and...@anarazel.de [091201 08:42]: On Tuesday 01 December 2009 14:38:26 marcin mank wrote: On Mon, Nov 30, 2009 at 9:27 PM, Heikki Linnakangas heikki.linnakan...@enterprisedb.com wrote: Simon Riggs wrote: Proposal * We reserve enough space on a disk block for a CRC check. When a dirty block is written to disk we calculate and annotate the CRC value, though this is *not* WAL logged. Imagine this: 1. A hint bit is set. It is not WAL-logged, but the page is dirtied. 2. The buffer is flushed out of the buffer cache to the OS. A new CRC is calculated and stored on the page. 3. Half of the page is flushed to disk (aka torn page problem). The CRC made it to disk but the flipped hint bit didn't. You now have a page with incorrect CRC on disk. What if we treated the hint bits as all-zeros for the purpose of CRC calculation? This would exclude them from the checksum. That sounds like doing a complete copy of the wal page zeroing specific fields and then doing wal - rather expensive I would say. Both, during computing the checksum and checking it... No, it has nothing to do with WAL, it has to do with when writing pages out... You already double-buffer them (to avoid the page changing while you checksum it) before calling write, but the code writing (and then reading) pages doesn't currently have to know all the internal stuff needed decide what's a hint bit and what's not... err, yes. That WAL slipped in, sorry. But it would still either mean a third copy of the page or a rather complex jumping around on the page... Andres -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Block-level CRC checks
Robert Haas wrote: On Tue, Dec 1, 2009 at 8:30 AM, Heikki Linnakangas heikki.linnakan...@enterprisedb.com wrote: Bruce Momjian wrote: What might be interesting is to report CRC mismatches if the database was shut down cleanly previously; I think in those cases we shouldn't have torn pages. Unfortunately that's not true. You can crash, leading to a torn page, and then start up the database and shut it down cleanly. The torn page is still there, even though the last shutdown was a clean one. Thinking through this, as I understand it, in order to prevent this problem, you'd need to be able to predict at recovery time which pages might have been torn by the unclean shutdown. In order to do that, you'd need to know which pages were waiting to be written to disk at the time of the shutdown. For ordinary page modifications, that's not a problem, because there will be WAL records for those pages that need to be replayed, and we could recompute the CRC at the same time. But for hint bit changes, there's no persistent state that would tell us which hint bits were in the midst of being flipped when the system went down, so the only way to make sure all the CRCs are correct would be to rescan every page in the entire cluster and recompute every CRC. Is that right? Yep. Even if rescanning every page in the cluster was feasible from a performance point-of-view, it would make the CRC checking a lot less useful. It's not hard to imagine that when a hardware glitch happens causing corruption, it also causes the system to crash. Recalculating the CRCs after crash would mask the corruption. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Block-level CRC checks
On Tue, 2009-12-01 at 15:30 +0200, Heikki Linnakangas wrote: Bruce Momjian wrote: What might be interesting is to report CRC mismatches if the database was shut down cleanly previously; I think in those cases we shouldn't have torn pages. Unfortunately that's not true. You can crash, leading to a torn page, and then start up the database and shut it down cleanly. The torn page is still there, even though the last shutdown was a clean one. There seems to be two ways forwards: journalling or fsck. We can either * WAL-log all changes to a page (journalling) (8-byte overhead) * After a crash disable CRC checks until a full database scan has either re-checked CRC or found CRC mismatch, report it in the LOG and then reset the CRC. (fsck) (8-byte overhead) Both of which can be optimised in various ways. Also, we might * Put all hint bits in the block header to allow them to be excluded more easily from CRC checking. If we used 3 more bits from ItemIdData.lp_len (limiting tuple length to 4096) then we could store some hints in the item pointer. HEAP_XMIN_INVALID can be stored as LP_DEAD, since that will happen very quickly anyway. -- Simon Riggs www.2ndQuadrant.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Block-level CRC checks
On Tue, Dec 1, 2009 at 9:40 AM, Heikki Linnakangas heikki.linnakan...@enterprisedb.com wrote: Even if rescanning every page in the cluster was feasible from a performance point-of-view, it would make the CRC checking a lot less useful. It's not hard to imagine that when a hardware glitch happens causing corruption, it also causes the system to crash. Recalculating the CRCs after crash would mask the corruption. Yeah. Thanks for the explanation - I think I understand the problem now. ...Robert -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Block-level CRC checks
* Simon Riggs: * Put all hint bits in the block header to allow them to be excluded more easily from CRC checking. If we used 3 more bits from ItemIdData.lp_len (limiting tuple length to 4096) then we could store some hints in the item pointer. HEAP_XMIN_INVALID can be stored as LP_DEAD, since that will happen very quickly anyway. What about putting the whole visibility information out-of-line, into its own B-tree, indexed by page number? -- Florian Weimerfwei...@bfk.de BFK edv-consulting GmbH http://www.bfk.de/ Kriegsstraße 100 tel: +49-721-96201-1 D-76133 Karlsruhe fax: +49-721-96201-99 -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Block-level CRC checks
Florian Weimer fwei...@bfk.de writes: What about putting the whole visibility information out-of-line, into its own B-tree, indexed by page number? Hint bits need to be *cheap* to examine. Otherwise there's little point in having them at all. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Block-level CRC checks
On Tue, 2009-12-01 at 16:40 +0200, Heikki Linnakangas wrote: It's not hard to imagine that when a hardware glitch happens causing corruption, it also causes the system to crash. Recalculating the CRCs after crash would mask the corruption. They are already masked from us, so continuing to mask those errors would not put us in a worse position. If we are saying that 99% of page corruptions are caused at crash time because of torn pages on hint bits, then only WAL logging can help us find the 1%. I'm not convinced that is an accurate or safe assumption and I'd at least like to see LOG entries showing what happened. ISTM we could go for two levels of protection. CRC checks and scanner for Level 1 protection, then full WAL logging for Level 2 protection. -- Simon Riggs www.2ndQuadrant.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Block-level CRC checks
Simon Riggs si...@2ndquadrant.com writes: On Tue, 2009-12-01 at 16:40 +0200, Heikki Linnakangas wrote: It's not hard to imagine that when a hardware glitch happens causing corruption, it also causes the system to crash. Recalculating the CRCs after crash would mask the corruption. They are already masked from us, so continuing to mask those errors would not put us in a worse position. No, it would just destroy a large part of the argument for why this is worth doing. We detect disk errors ... except for ones that happen during a database crash. Say what? The fundamental problem with this is the same as it's been all along: the tradeoff between implementation work expended, performance overhead added, and net number of real problems detected (with a suitably large demerit for actually *introducing* problems) just doesn't look attractive. You can make various compromises that improve one or two of these factors at the cost of making the others worse, but at the end of the day I've still not seen a combination that seems worth doing. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Block-level CRC checks
On Tue, Dec 1, 2009 at 10:35 AM, Simon Riggs si...@2ndquadrant.com wrote: On Tue, 2009-12-01 at 16:40 +0200, Heikki Linnakangas wrote: It's not hard to imagine that when a hardware glitch happens causing corruption, it also causes the system to crash. Recalculating the CRCs after crash would mask the corruption. They are already masked from us, so continuing to mask those errors would not put us in a worse position. If we are saying that 99% of page corruptions are caused at crash time because of torn pages on hint bits, then only WAL logging can help us find the 1%. I'm not convinced that is an accurate or safe assumption and I'd at least like to see LOG entries showing what happened. It may or may not be true that most page corruptions happen at crash time, but it's certainly false that they are caused at crash time *because of torn pages on hint bits*. If only part of a block is written to disk and the unwritten parts contain hint-bit changes - that's not corruption. That's design behavior. Any CRC system needs to avoid complaining about errors when that happens because otherwise people will think that their database is corrupted and their hardware is faulty when in reality it is not. If we could find a way to put the hint bits in the same 512-byte block as the CRC, that might do it, but I'm not sure whether that is possible. Ignoring CRC errors after a crash until we've re-CRC'd the entire database will certainly eliminate the bogus error reports, but it seems likely to mask a large percentage of legitimate errors. For example, suppose that I write 1MB of data out to disk and then don't access it for a year. During that time the data is corrupted. Then the system crashes. Upon recovery, since there's no way of knowing whether hint bits on those pages were being updated at the time of the crash, so the system re-CRC's the corrupted data and declares it known good. Six months later, I try to access the data and find out that it's bad. Sucks to be me. Now consider the following alternative scenario: I write the block to disk. Five minutes later, without an intervening crash, I read it back in and it's bad. Yeah, the system detects it. Which is more likely? I'm not an expert on disk failure modes, but my intuition is that the first one will happen often enough to make us look silly. Is it 10%? 20%? 50%? I don't know. But ISTM that a CRC system that has no ability to determine whether a system is still ok post-crash is not a compelling proposition, even though it might still be able to detect some problems. ...Robert -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Block-level CRC checks
On Tue, 2009-12-01 at 07:05 -0500, Bruce Momjian wrote: All useful detection mechanisms have non-zero false positives because we would rather sometimes ring the bell for no reason than to let bad things through silently, as we do now. OK, but what happens if someone gets the failure report, assumes their hardware is faulty and replaces it, and then gets a failure report again? They are stupid? Nobody just replaces hardware. You test it. We can't fix stupid. Joshua D. Drake -- PostgreSQL.org Major Contributor Command Prompt, Inc: http://www.commandprompt.com/ - 503.667.4564 Consulting, Training, Support, Custom Development, Engineering If the world pushes look it in the eye and GRR. Then push back harder. - Salamander -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Block-level CRC checks
On Tue, 2009-12-01 at 10:55 -0500, Tom Lane wrote: Simon Riggs si...@2ndquadrant.com writes: On Tue, 2009-12-01 at 16:40 +0200, Heikki Linnakangas wrote: It's not hard to imagine that when a hardware glitch happens causing corruption, it also causes the system to crash. Recalculating the CRCs after crash would mask the corruption. They are already masked from us, so continuing to mask those errors would not put us in a worse position. No, it would just destroy a large part of the argument for why this is worth doing. We detect disk errors ... except for ones that happen during a database crash. Say what? I know what I said sounds ridiculous, I'm just trying to keep my mind open about the tradeoffs. The way to detect 100% of corruptions is to WAL-log 100% of writes to blocks and we know that sucks performance - twas me that said it in the original discussion. I'm trying to explore whether we can detect 100% of other errors at some intermediate percentage of WAL-logging. If we decide that there isn't an intermediate position worth taking, I'm happy, as long it was a fact-based decision. The fundamental problem with this is the same as it's been all along: the tradeoff between implementation work expended, performance overhead added, and net number of real problems detected (with a suitably large demerit for actually *introducing* problems) just doesn't look attractive. You can make various compromises that improve one or two of these factors at the cost of making the others worse, but at the end of the day I've still not seen a combination that seems worth doing. I agree. But also I do believe there are people that care enough about this to absorb a performance hit and the new features in 8.5 will bring in a new crop of people that care about those things very much. -- Simon Riggs www.2ndQuadrant.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Block-level CRC checks
On Tue, 2009-12-01 at 10:55 -0500, Tom Lane wrote: Simon Riggs si...@2ndquadrant.com writes: On Tue, 2009-12-01 at 16:40 +0200, Heikki Linnakangas wrote: It's not hard to imagine that when a hardware glitch happens causing corruption, it also causes the system to crash. Recalculating the CRCs after crash would mask the corruption. They are already masked from us, so continuing to mask those errors would not put us in a worse position. No, it would just destroy a large part of the argument for why this is worth doing. We detect disk errors ... except for ones that happen during a database crash. Say what? The fundamental problem with this is the same as it's been all along: the tradeoff between implementation work expended, performance overhead added, and net number of real problems detected (with a suitably large demerit for actually *introducing* problems) just doesn't look attractive. You can make various compromises that improve one or two of these factors at the cost of making the others worse, but at the end of the day I've still not seen a combination that seems worth doing. Let me try a different but similar perspective. The problem we are trying to solve here, only matters to a very small subset of the people actually using PostgreSQL. Specifically, a percentage that is using PostgreSQL in a situation where they can lose many thousands of dollars per minute or hour should an outage occur. On the other hand it is those very people that are *paying* people to try and implement these features. Kind of a catch-22. The hard core reality is this. *IF* it is one of the goals of this project to insure that the software can be safely, effectively, and responsibly operated in a manner that is acceptable to C* level people in a Fortune level company then we *must* solve this problem. If it is not the goal of the project, leave it to EDB/CMD/2ndQuandrant to fork it because it will eventually happen. Our customers are demanding these features. Sincerely, Joshua D. Drake -- PostgreSQL.org Major Contributor Command Prompt, Inc: http://www.commandprompt.com/ - 503.667.4564 Consulting, Training, Support, Custom Development, Engineering If the world pushes look it in the eye and GRR. Then push back harder. - Salamander -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Block-level CRC checks
Simon Riggs wrote: Also, we might * Put all hint bits in the block header to allow them to be excluded more easily from CRC checking. If we used 3 more bits from ItemIdData.lp_len (limiting tuple length to 4096) then we could store some hints in the item pointer. HEAP_XMIN_INVALID can be stored as LP_DEAD, since that will happen very quickly anyway. OK, here is another idea, maybe crazy: When we read in a page that has an invalid CRC, we check the page to see which hint bits are _not_ set, and we try setting them to see if can get a matching CRC. If there no missing hint bits and the CRC doesn't match, we know the page is corrupt. If two hint bits are missing, we can try setting one and both of them and see if can get a matching CRC. If we can, the page is OK, if not, it is corrupt. Now if 32 hint bits are missing, but could be based on transaction status, then we would need 2^32 possible hint bit combinations, so we can't do the test and we just assume the page is valid. I have no idea what percentage of corruption this would detect, but it might have minimal overhead because the overhead only happens when we detect a non-matching CRC due to a crash of some sort. -- Bruce Momjian br...@momjian.ushttp://momjian.us EnterpriseDB http://enterprisedb.com + If your life is a hard drive, Christ can be your backup. + -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Block-level CRC checks
On Tue, Dec 1, 2009 at 1:02 PM, Joshua D. Drake j...@commandprompt.com wrote: The hard core reality is this. *IF* it is one of the goals of this project to insure that the software can be safely, effectively, and responsibly operated in a manner that is acceptable to C* level people in a Fortune level company then we *must* solve this problem. If it is not the goal of the project, leave it to EDB/CMD/2ndQuandrant to fork it because it will eventually happen. Our customers are demanding these features. OK, and when you fork it, how do you plan to implement it? The problem AFAICS is not that anyone hugely dislikes the feature; it's that nobody is really clear on how to implement it in a way that's actually useful. So far the only somewhat reasonable suggestions I've seen seem to be: 1. WAL-log setting the hint bits. If you don't like the resulting performance, shut off the feature. 2. Rearrange the page so that all the hint bits are in the first 512 bytes along with the CRC, so that there can be no torn pages. AFAICS, no one has rendered judgment on whether this is a feasible solution. Does $COMPETITOR offer this feature? ...Robert -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Block-level CRC checks
On Tue, 2009-12-01 at 13:05 -0500, Bruce Momjian wrote: Simon Riggs wrote: Also, we might * Put all hint bits in the block header to allow them to be excluded more easily from CRC checking. If we used 3 more bits from ItemIdData.lp_len (limiting tuple length to 4096) then we could store some hints in the item pointer. HEAP_XMIN_INVALID can be stored as LP_DEAD, since that will happen very quickly anyway. OK, here is another idea, maybe crazy: When there's nothing else left, crazy wins. When we read in a page that has an invalid CRC, we check the page to see which hint bits are _not_ set, and we try setting them to see if can get a matching CRC. If there no missing hint bits and the CRC doesn't match, we know the page is corrupt. If two hint bits are missing, we can try setting one and both of them and see if can get a matching CRC. If we can, the page is OK, if not, it is corrupt. Now if 32 hint bits are missing, but could be based on transaction status, then we would need 2^32 possible hint bit combinations, so we can't do the test and we just assume the page is valid. I have no idea what percentage of corruption this would detect, but it might have minimal overhead because the overhead only happens when we detect a non-matching CRC due to a crash of some sort. Perhaps we could store a sector-based parity bit each 512 bytes in the block. If there are an even number of hint bits set, if odd we unset the parity bit. So whenever we set a hint bit we flip the parity bit for that sector. That way we could detect which sectors are potentially missing in an effort to minimize the number of combinations we need to test. That would require only 16 bits for an 8192 byte block; we store it next to the CRC, so we know that was never altered separately. So total 6 byte overhead. -- Simon Riggs www.2ndQuadrant.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Block-level CRC checks
On Tue, 2009-12-01 at 13:20 -0500, Robert Haas wrote: On Tue, Dec 1, 2009 at 1:02 PM, Joshua D. Drake j...@commandprompt.com wrote: The hard core reality is this. *IF* it is one of the goals of this project to insure that the software can be safely, effectively, and responsibly operated in a manner that is acceptable to C* level people in a Fortune level company then we *must* solve this problem. If it is not the goal of the project, leave it to EDB/CMD/2ndQuandrant to fork it because it will eventually happen. Our customers are demanding these features. OK, and when you fork it, how do you plan to implement it? Hey man, I am not an engineer :P. You know that. I am just speaking the pressures that some of us are having in the marketplace about these types of features. red judgment on whether this is a feasible solution. Does $COMPETITOR offer this feature? My understanding is that MSSQL does. I am not sure about Oracle. Those are the only two I run into (I don't run into MySQL at all). I know others likely compete in the DB2 space. Sincerely, Joshua D. Drake -- PostgreSQL.org Major Contributor Command Prompt, Inc: http://www.commandprompt.com/ - 503.667.4564 Consulting, Training, Support, Custom Development, Engineering If the world pushes look it in the eye and GRR. Then push back harder. - Salamander -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Block-level CRC checks
Bruce Momjian br...@momjian.us writes: OK, here is another idea, maybe crazy: When we read in a page that has an invalid CRC, we check the page to see which hint bits are _not_ set, and we try setting them to see if can get a matching CRC. If there no missing hint bits and the CRC doesn't match, we know the page is corrupt. If two hint bits are missing, we can try setting one and both of them and see if can get a matching CRC. If we can, the page is OK, if not, it is corrupt. Now if 32 hint bits are missing, but could be based on transaction status, then we would need 2^32 possible hint bit combinations, so we can't do the test and we just assume the page is valid. A typical page is going to have something like 100 tuples, so potentially 2^400 combinations to try. I don't see this being realistic from that standpoint. What's much worse is that to even find the potentially missing hint bits, you need to make very strong assumptions about the validity of the rest of the page. The suggestions that were made upthread about moving the hint bits could resolve the second objection, but once you do that you might as well just exclude them from the CRC and eliminate the guessing. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Block-level CRC checks
Simon Riggs si...@2ndquadrant.com writes: On Tue, 2009-12-01 at 13:05 -0500, Bruce Momjian wrote: When we read in a page that has an invalid CRC, we check the page to see which hint bits are _not_ set, and we try setting them to see if can get a matching CRC. Perhaps we could store a sector-based parity bit each 512 bytes in the block. If there are an even number of hint bits set, if odd we unset the parity bit. So whenever we set a hint bit we flip the parity bit for that sector. That way we could detect which sectors are potentially missing in an effort to minimize the number of combinations we need to test. Actually, the killer problem with *any* scheme involving guessing is that each bit you guess translates directly to removing one bit of confidence from the CRC value. If you try to guess at as many as 32 bits, it is practically guaranteed that you will find a combination that makes a 32-bit CRC appear to match. Well before that, you have degraded the reliability of the error detection to the point that there's no point. The bottom line here seems to be that the only practical way to do anything like this is to move the hint bits into their own area of the page, and then exclude them from the CRC. Are we prepared to once again blow off any hope of in-place update for another release cycle? regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Block-level CRC checks
* Tom Lane t...@sss.pgh.pa.us [091201 13:58]: Actually, the killer problem with *any* scheme involving guessing is that each bit you guess translates directly to removing one bit of confidence from the CRC value. If you try to guess at as many as 32 bits, it is practically guaranteed that you will find a combination that makes a 32-bit CRC appear to match. Well before that, you have degraded the reliability of the error detection to the point that there's no point. Exactly. The bottom line here seems to be that the only practical way to do anything like this is to move the hint bits into their own area of the page, and then exclude them from the CRC. Are we prepared to once again blow off any hope of in-place update for another release cycle? Well, *I* think if we're ever going to have really reliable in-place upgrades that we can expect to function release after release, we're going to need to be able to read in old version pages, and convert them to current version pages, for some set of old version (I'ld be happy with $VERSION-1)... But I don't see that happening any time soon... But I'm not loading TB of data either, my largest clusters are a couple of gigs, so I acknowledge my priorities are probably quite different then some of the companies driving a lot of the heavy development. a. -- Aidan Van Dyk Create like a god, ai...@highrise.ca command like a king, http://www.highrise.ca/ work like a slave. signature.asc Description: Digital signature
Re: [HACKERS] Block-level CRC checks
On Tue, Dec 1, 2009 at 6:41 PM, Tom Lane t...@sss.pgh.pa.us wrote: Bruce Momjian br...@momjian.us writes: OK, here is another idea, maybe crazy: When we read in a page that has an invalid CRC, we check the page to see which hint bits are _not_ set, and we try setting them to see if can get a matching CRC. Unfortunately you would also have to try *unsetting* every hint bit as well since the updated hint bits might have made it to disk but not the CRC leaving the old CRC for the block with the unset bits. I actually independently had the same thought today that Simon had of moving the hint bits to the line pointer. We can obtain more free bits in the line pointers by dividing the item offsets and sizes by maxalign if we need it. That should give at least 4 spare bits which is all we need for the four VALID/INVALID hint bits. It should be relatively cheap to skip the hint bits in the line pointers since they'll be the same bits of every 16-bit value for a whole range. Alternatively we could just CRC the tuples and assume a corrupted line pointer will show itself quickly. That would actually make it faster than a straight CRC of the whole block -- making lemonade out of lemons as it were. There's still the all-tuples-in-page-are-visible hint bit and the hint bits in btree pages. I'm not sure if those are easier or harder to solve. We might be able to assume the all-visible flag will not be torn from the crc as long as they're within the same 512 byte sector. And iirc the btree hint bits are in the line pointers themselves as well? Another thought is that would could use the MSSQL-style torn page detection of including a counter (or even a bit?) in every 512-byte chunk which gets incremented every time the page is written. If they don't all match when read in then the page was torn and we can't check the CRC. That gets us the advantage that we can inform the user that a torn page was detected so they know that they must absolutely use full_page_writes on their system. Currently users are in the dark whether their system is susceptible to them or not and have now idea with what frequency. Even here there are quite divergent opinions about their frequency and which systems are susceptible to them or immune. -- greg -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Block-level CRC checks
Greg Stark gsst...@mit.edu writes: Another thought is that would could use the MSSQL-style torn page detection of including a counter (or even a bit?) in every 512-byte chunk which gets incremented every time the page is written. I think we can dismiss that idea, or any idea involving a per-page status value, out of hand. The implications for tuple layout are just too messy. I'm not especially thrilled with making assumptions about the underlying device block size anyway. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Block-level CRC checks
All, I feel strongly that we should be verifying pages on write, or at least providing the option to do so, because hardware is simply not reliable. And a lot of our biggest users are having issues; it seems pretty much guarenteed that if you have more than 20 postgres servers, at least one of them will have bad memory, bad RAID and/or a bad driver. (and yes, InnoDB, DB2 and Oracle do provide tools to detect hardware corruption when it happens. Oracle even provides correction tools. We are *way* behind them in this regard) There are two primary conditions we are testing for: (a) bad RAM, which happens as frequently as 8% of the time on commodity servers, and given a sufficient amount of RAM happens 99% of the time due to quantum effects, and (b) bad I/O, in the form of bad drivers, bad RAID, and/or bad disks. Our users want to potentially take two degrees of action on this: 1. detect the corruption immediately when it happens, so that they can effectively troubleshoot the cause of the corruption, and potentially shut down the database before further corruption occurs and while they still have clean backups. 2. make an attempt to fix the corrupted page before/immediately after it is written. Further, based on talking to some of these users who are having chronic and not-debuggable issues on their sets of 100's of PostgreSQL servers, there are some other specs: -- Many users would be willing to sacrifice significant performance (up to 20%) as a start-time option in order to be corruption-proof. -- Even more users would only be interested in using the anti-corruption options after they know they have a problem to troubleshoot it, and then turn the corruption detection back off. So, based on my conversations with users, what we really want is a solution which does (1) for both (a) and (b) as a start-time option, and having siginificant performance overhead for this option is OK. Now, does block-level CRCs qualify? The problem I have with CRC checks is that it only detects bad I/O, and is completely unable to detect data corruption due to bad memory. This means that really we want a different solution which can detect both bad RAM and bad I/O, and should only fall back on CRC checks if we're unable to devise one. One of the things Simon and I talked about in Japan is that most of the time, data corruption makes the data page and/or tuple unreadable. So, checking data format for readable pages and tuples (and index nodes) both before and after write to disk (the latter would presumably be handled by the bgwriter and/or checkpointer) would catch a lot of kinds of corruption before they had a chance to spread. However, that solution would not detect subtle corruption, like single-bit-flipping issues caused by quantum errors. Also, it would require reading back each page as it's written to disk, which is OK for a bunch of single-row writes, but for bulk data loads a significant problem. So, what I'm saying is that I think we really want a better solution, and am throwing this out there to see if anyone is clever enough. --Josh Berkus -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Block-level CRC checks
Josh Berkus j...@agliodbs.com wrote: And a lot of our biggest users are having issues; it seems pretty much guarenteed that if you have more than 20 postgres servers, at least one of them will have bad memory, bad RAID and/or a bad driver. Huh?!? We have about 200 clusters running on about 100 boxes, and we see that very rarely. On about 100 older boxes, relegated to less critical tasks, we see a failure maybe three or four times per year. It's usually not subtle, and a sane backup and redundant server policy has kept us from suffering much pain from these. I'm not questioning the value of adding features to detect corruption, but your numbers are hard to believe. The problem I have with CRC checks is that it only detects bad I/O, and is completely unable to detect data corruption due to bad memory. This means that really we want a different solution which can detect both bad RAM and bad I/O, and should only fall back on CRC checks if we're unable to devise one. md5sum of each tuple? As an optional system column (a la oid)? checking data format for readable pages and tuples (and index nodes) both before and after write to disk Given that PostgreSQL goes through the OS, and many of us are using RAID controllers with BBU RAM, how do you do a read with any confidence that it came from the disk? (I mean, I know how to do that for a performance test, but as a routine step during production use?) -Kevin -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Block-level CRC checks
On Tue, Dec 1, 2009 at 7:19 PM, Josh Berkus j...@agliodbs.com wrote: However, that solution would not detect subtle corruption, like single-bit-flipping issues caused by quantum errors. Well there is a solution for this, ECC RAM. There's *no* software solution for it. The corruption can just as easily happen the moment you write the value before you calculate any checksum or in the register holding the value before you even write it. Or it could occur the moment after you finish checking the checksum. Also you're not going to be able to be sure you're checking the actual dram and not the L2 cache or the processor's L1/L0 caches. ECC RAM solves this problem properly and it does work. There's not much point in paying a much bigger cost for an ineffective solution. Also, it would require reading back each page as it's written to disk, which is OK for a bunch of single-row writes, but for bulk data loads a significant problem. Not sure what that really means for Postgres. It would just mean reading back the same page of memory from the filesystem cache that we just read. It sounds like you're describing fsyncing every single page to disk and then wait 1min/7200 or even 1min/15k to do a direct read for every single page -- that's not a 20% performance hit though. We would have to change our mascot from the elephant to a snail I think. You could imagine a more complex solution where you have a separate process wait until the next checkpoint then do direct reads for all the blocks written since the previous checkpoint (which have now been fsynced) and verify that the block on disk has a verifiable CRC. I'm not sure even direct reads let you get the block on disk if someone else has written the block into cache though. If you could then this sounds like it could be made to work efficiently (with sequential bitmap-style scans) and could be quite handy. What I like about that is you could deprioritize this process's i/o so that it didn't impact the main processing. As things stand this wouldn't detect pages written because they were dirtied by hint bit updates but that could be addressed a few different ways. -- greg -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Block-level CRC checks
On Tue, Dec 1, 2009 at 2:06 PM, Aidan Van Dyk ai...@highrise.ca wrote: Well, *I* think if we're ever going to have really reliable in-place upgrades that we can expect to function release after release, we're going to need to be able to read in old version pages, and convert them to current version pages, for some set of old version (I'ld be happy with $VERSION-1)... But I don't see that happening any time soon... I agree. I've attempted to make this point before - as has Zdenek - and been scorned for it, but I don't think it's become any less true for all of that. I don't think you have to look much further than the limitations on upgrading from 8.3 to 8.4 to conclude that the current strategy is always going to be pretty hit or miss. http://cvs.pgfoundry.org/cgi-bin/cvsweb.cgi/~checkout~/pg-migrator/pg_migrator/README?rev=1.59content-type=text/plain ...Robert -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Block-level CRC checks
On Tue, Dec 1, 2009 at 7:51 PM, Robert Haas robertmh...@gmail.com wrote: On Tue, Dec 1, 2009 at 2:06 PM, Aidan Van Dyk ai...@highrise.ca wrote: Well, *I* think if we're ever going to have really reliable in-place upgrades that we can expect to function release after release, we're going to need to be able to read in old version pages, and convert them to current version pages, for some set of old version (I'ld be happy with $VERSION-1)... But I don't see that happening any time soon... I agree. I've attempted to make this point before - as has Zdenek - and been scorned for it, but I don't think it's become any less true for all of that. I don't think you have to look much further than the limitations on upgrading from 8.3 to 8.4 to conclude that the current strategy is always going to be pretty hit or miss. I find that hard to understand. I believe the consensus is that an on-demand page-level migration statregy like Aidan described is precisely the plan when it's necessary to handle page format changes. There were no page format changes for 8.3-8.4 however so there's no point writing dead code until it actually has anything to do. And there was no point writing it for previously releases because there was pg_migrator anyways. Zdenek's plan was basically the same but he wanted the backend to be able to handle any version page directly without conversion any time. Pointing at the 8.3 pg_migrator limitations is irrelevant -- every single one of those limitations would not be addressed by a page-level migration code path. They are all data-type redefinitions that can't be fixed without understanding the table structure and definition. These limitations would all require adding code to the new version to handle the old data types and their behaviour and to convert them to the new datatypes when a tuple is rewritten. In some cases this is really not easy at all. -- greg -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Block-level CRC checks
On Tue, Dec 1, 2009 at 8:04 PM, Greg Stark gsst...@mit.edu wrote: And there was no point writing it for previously releases because there was **no** pg_migrator anyways. oops -- greg -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Block-level CRC checks
On Tue, Dec 1, 2009 at 3:04 PM, Greg Stark gsst...@mit.edu wrote: I find that hard to understand. I believe the consensus is that an on-demand page-level migration statregy like Aidan described is precisely the plan when it's necessary to handle page format changes. There were no page format changes for 8.3-8.4 however so there's no point writing dead code until it actually has anything to do. And there was no point writing it for previously releases because there was pg_migrator anyways. Zdenek's plan was basically the same but he wanted the backend to be able to handle any version page directly without conversion any time. Pointing at the 8.3 pg_migrator limitations is irrelevant -- every single one of those limitations would not be addressed by a page-level migration code path. They are all data-type redefinitions that can't be fixed without understanding the table structure and definition. These limitations would all require adding code to the new version to handle the old data types and their behaviour and to convert them to the new datatypes when a tuple is rewritten. In some cases this is really not easy at all. OK, fair enough. My implication that only page formats were at issue was off-base. My underlying point was that I think we have to be prepared to write code that can understand old binary formats (on the tuple, page, or relation level) if we want this to work and work reliably. I believe that there has been much resistance to that idea. If I am wrong, great! ...Robert -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Block-level CRC checks
Robert Haas robertmh...@gmail.com writes: OK, fair enough. My implication that only page formats were at issue was off-base. My underlying point was that I think we have to be prepared to write code that can understand old binary formats (on the tuple, page, or relation level) if we want this to work and work reliably. I believe that there has been much resistance to that idea. We keep looking for cheaper alternatives. There may not be any... regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Block-level CRC checks
Tom Lane wrote: Robert Haas robertmh...@gmail.com writes: OK, fair enough. My implication that only page formats were at issue was off-base. My underlying point was that I think we have to be prepared to write code that can understand old binary formats (on the tuple, page, or relation level) if we want this to work and work reliably. I believe that there has been much resistance to that idea. We keep looking for cheaper alternatives. There may not be any... Yeah. I think we might need to bite the bullet on that and start thinking more about different strategies for handling page versioning to satisfy various needs. I've been convinced for a while that some sort of versioning scheme is inevitable, but I do understand the reluctance. cheers andrew -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Block-level CRC checks
Andrew Dunstan wrote: Tom Lane wrote: Robert Haas robertmh...@gmail.com writes: OK, fair enough. My implication that only page formats were at issue was off-base. My underlying point was that I think we have to be prepared to write code that can understand old binary formats (on the tuple, page, or relation level) if we want this to work and work reliably. I believe that there has been much resistance to that idea. We keep looking for cheaper alternatives. There may not be any... Yeah. I think we might need to bite the bullet on that and start thinking more about different strategies for handling page versioning to satisfy various needs. I've been convinced for a while that some sort of versioning scheme is inevitable, but I do understand the reluctance. I always felt our final solution would be a combination of pg_migrator for system catalog changes and page format conversion for page changes. -- Bruce Momjian br...@momjian.ushttp://momjian.us EnterpriseDB http://enterprisedb.com + If your life is a hard drive, Christ can be your backup. + -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Block-level CRC checks
Tom Lane wrote: Bruce Momjian br...@momjian.us writes: OK, here is another idea, maybe crazy: When we read in a page that has an invalid CRC, we check the page to see which hint bits are _not_ set, and we try setting them to see if can get a matching CRC. If there no missing hint bits and the CRC doesn't match, we know the page is corrupt. If two hint bits are missing, we can try setting one and both of them and see if can get a matching CRC. If we can, the page is OK, if not, it is corrupt. Now if 32 hint bits are missing, but could be based on transaction status, then we would need 2^32 possible hint bit combinations, so we can't do the test and we just assume the page is valid. A typical page is going to have something like 100 tuples, so potentially 2^400 combinations to try. I don't see this being realistic from that standpoint. What's much worse is that to even find the potentially missing hint bits, you need to make very strong assumptions about the validity of the rest of the page. The suggestions that were made upthread about moving the hint bits could resolve the second objection, but once you do that you might as well just exclude them from the CRC and eliminate the guessing. OK, crazy idea #3. What if we had a per-page counter of the number of hint bits set --- that way, we would only consider a CRC check failure to be corruption if the count matched the hint bit count on the page. -- Bruce Momjian br...@momjian.ushttp://momjian.us EnterpriseDB http://enterprisedb.com + If your life is a hard drive, Christ can be your backup. + -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Block-level CRC checks
Bruce Momjian br...@momjian.us writes: OK, crazy idea #3. What if we had a per-page counter of the number of hint bits set --- that way, we would only consider a CRC check failure to be corruption if the count matched the hint bit count on the page. Seems like rather a large hole in the ability to detect corruption. In particular, this again assumes that you can accurately locate all the hint bits in a page whose condition is questionable. Pick up the wrong bits, you'll come to the wrong conclusion --- and the default behavior you propose here is the wrong result. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Block-level CRC checks
Bruce Momjian wrote: Tom Lane wrote: The suggestions that were made upthread about moving the hint bits could resolve the second objection, but once you do that you might as well just exclude them from the CRC and eliminate the guessing. OK, crazy idea #3. What if we had a per-page counter of the number of hint bits set --- that way, we would only consider a CRC check failure to be corruption if the count matched the hint bit count on the page. Can I piggy-back on Bruce's crazy idea and ask a stupid question? Why are we writing out the hint bits to disk anyway? Is it really so slow to calculate them on read + cache them that it's worth all this trouble? Are they not also to blame for the write my import data twice feature? -- Richard Huxton Archonet Ltd -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Page-level version upgrade (was: [HACKERS] Block-level CRC checks)
On Dec 1, 2009, at 12:58 PM, Tom Lane wrote: The bottom line here seems to be that the only practical way to do anything like this is to move the hint bits into their own area of the page, and then exclude them from the CRC. Are we prepared to once again blow off any hope of in-place update for another release cycle? What happened to the work that was being done to allow a page to be upgraded on the fly when it was read in from disk? -- Jim C. Nasby, Database Architect j...@nasby.net 512.569.9461 (cell) http://jim.nasby.net -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Block-level CRC checks
Tom Lane wrote: Bruce Momjian br...@momjian.us writes: OK, crazy idea #3. What if we had a per-page counter of the number of hint bits set --- that way, we would only consider a CRC check failure to be corruption if the count matched the hint bit count on the page. Seems like rather a large hole in the ability to detect corruption. In particular, this again assumes that you can accurately locate all the hint bits in a page whose condition is questionable. Pick up the wrong bits, you'll come to the wrong conclusion --- and the default behavior you propose here is the wrong result. I was assuming any update of hint bits would update the per-page counter so it would always be accurate. However, I seem to remember we don't lock the page when updating hint bits, so that wouldn't work. -- Bruce Momjian br...@momjian.ushttp://momjian.us EnterpriseDB http://enterprisedb.com + If your life is a hard drive, Christ can be your backup. + -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Block-level CRC checks
On Tue, Dec 1, 2009 at 9:57 PM, Richard Huxton d...@archonet.com wrote: Why are we writing out the hint bits to disk anyway? Is it really so slow to calculate them on read + cache them that it's worth all this trouble? Are they not also to blame for the write my import data twice feature? It would be interesting to experiment with different strategies. But the results would depend a lot on workloads and I doubt one strategy is best for everyone. It has often been suggested that we could set the hint bits but not dirty the page, so they would never be written out unless some other update hit the page. In most use cases that would probably result in the right thing happening where we avoid half the writes but still stop doing transaction status lookups relatively promptly. The scary thing is that there might be use cases such as static data loaded where the hint bits never get set and every scan of the page has to recheck those statuses until the tuples are frozen. (Not dirtying the page almost gets us out of the CRC problems -- it doesn't in our current setup because we don't take a lock when setting the hint bits, so you could set it on a page someone is in the middle of CRC checking and writing. There were other solutions proposed for that, including just making hint bits require locking the page or double buffering the write.) There does need to be something like the hint bits which does eventually have to be set because we can't keep transaction information around forever. Even if you keep the transaction information all the way back to the last freeze date (up to about 1GB and change I think) then the data has to be written twice, the second time is to freeze the transactions. In the worst case then reading a page requires a random page access (or two) from anywhere in that 1GB+ file for each tuple on the page (whether visible to us or not). -- greg -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Block-level CRC checks
On Dec 1, 2009, at 1:39 PM, Kevin Grittner wrote: Josh Berkus j...@agliodbs.com wrote: And a lot of our biggest users are having issues; it seems pretty much guarenteed that if you have more than 20 postgres servers, at least one of them will have bad memory, bad RAID and/or a bad driver. Huh?!? We have about 200 clusters running on about 100 boxes, and we see that very rarely. On about 100 older boxes, relegated to less critical tasks, we see a failure maybe three or four times per year. It's usually not subtle, and a sane backup and redundant server policy has kept us from suffering much pain from these. I'm not questioning the value of adding features to detect corruption, but your numbers are hard to believe. That's just your experience. Others have had different experiences. And honestly, bickering about exact numbers misses Josh's point completely. Postgres is seriously lacking in it's ability to detect hardware problems, and hardware *does fail*. And you can't just assume that when it fails it blows up completely. We really do need some capability for detecting errors. The problem I have with CRC checks is that it only detects bad I/O, and is completely unable to detect data corruption due to bad memory. This means that really we want a different solution which can detect both bad RAM and bad I/O, and should only fall back on CRC checks if we're unable to devise one. md5sum of each tuple? As an optional system column (a la oid) That's a possibility. As Josh mentioned, some people will pay a serious performance hit to ensure that their data is safe and correct. The CRC proposal was intended as a middle of the road approach that would at least tell you that your hardware was probably OK. There's certainly more that could be done. Also, I think some means of detecting torn pages would be very welcome. If this was done at the storage manager level it would probably be fairly transparent to the rest of the code. checking data format for readable pages and tuples (and index nodes) both before and after write to disk Given that PostgreSQL goes through the OS, and many of us are using RAID controllers with BBU RAM, how do you do a read with any confidence that it came from the disk? (I mean, I know how to do that for a performance test, but as a routine step during production use?) You'd probably need to go to some kind of stand-alone or background process that slowly reads and verifies the entire database. Unfortunately at that point you could only detect corruption and not correct it, but it'd still be better than nothing. -- Jim C. Nasby, Database Architect j...@nasby.net 512.569.9461 (cell) http://jim.nasby.net -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Block-level CRC checks
Greg Stark wrote: On Tue, Dec 1, 2009 at 6:41 PM, Tom Lane t...@sss.pgh.pa.us wrote: Bruce Momjian br...@momjian.us writes: OK, here is another idea, maybe crazy: When we read in a page that has an invalid CRC, we check the page to see which hint bits are _not_ set, and we try setting them to see if can get a matching CRC. Unfortunately you would also have to try *unsetting* every hint bit as well since the updated hint bits might have made it to disk but not the CRC leaving the old CRC for the block with the unset bits. I actually independently had the same thought today that Simon had of moving the hint bits to the line pointer. We can obtain more free bits in the line pointers by dividing the item offsets and sizes by maxalign if we need it. That should give at least 4 spare bits which is all we need for the four VALID/INVALID hint bits. It should be relatively cheap to skip the hint bits in the line pointers since they'll be the same bits of every 16-bit value for a whole range. Alternatively we could just CRC the tuples and assume a corrupted line pointer will show itself quickly. That would actually make it faster than a straight CRC of the whole block -- making lemonade out of lemons as it were. Yea, I am thinking we would have to have the hint bits in the line pointers --- if not, we would have to reserve a lot of free space to hold the maximum number of tuple hint bits --- seems like a waste. I also like the idea that we don't need to CRC check the line pointers because any corruption there is going to appear immediately. However, the bad news is that we wouldn't find the corruption until we try to access bad data and might crash. -- Bruce Momjian br...@momjian.ushttp://momjian.us EnterpriseDB http://enterprisedb.com + If your life is a hard drive, Christ can be your backup. + -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Block-level CRC checks
Greg Stark wrote: On Tue, Dec 1, 2009 at 9:57 PM, Richard Huxton d...@archonet.com wrote: Why are we writing out the hint bits to disk anyway? Is it really so slow to calculate them on read + cache them that it's worth all this trouble? Are they not also to blame for the write my import data twice feature? It would be interesting to experiment with different strategies. But the results would depend a lot on workloads and I doubt one strategy is best for everyone. It has often been suggested that we could set the hint bits but not dirty the page, so they would never be written out unless some other update hit the page. In most use cases that would probably result in the right thing happening where we avoid half the writes but still stop doing transaction status lookups relatively promptly. The scary thing is that there might be use cases such as static data loaded where the hint bits never get set and every scan of the page has to recheck those statuses until the tuples are frozen. And how scary is that? Assuming we cache the hints... 1. With the page itself, so same lifespan 2. Separately, perhaps with a different (longer) lifespan. Separately would then let you trade complexity for compactness - all of block B is deleted, all of table T is visible. So what is the cost of calculating the hint-bits for a whole block of tuples in one go vs reading that block from actual spinning disk? There does need to be something like the hint bits which does eventually have to be set because we can't keep transaction information around forever. Even if you keep the transaction information all the way back to the last freeze date (up to about 1GB and change I think) then the data has to be written twice, the second time is to freeze the transactions. In the worst case then reading a page requires a random page access (or two) from anywhere in that 1GB+ file for each tuple on the page (whether visible to us or not). While on that topic - I'm assuming freezing requires substantially more effort than updating hint bits? -- Richard Huxton Archonet Ltd -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Block-level CRC checks
Bruce Momjian br...@momjian.us writes: Greg Stark wrote: It should be relatively cheap to skip the hint bits in the line pointers since they'll be the same bits of every 16-bit value for a whole range. Alternatively we could just CRC the tuples and assume a corrupted line pointer will show itself quickly. That would actually make it faster than a straight CRC of the whole block -- making lemonade out of lemons as it were. I don't think relatively cheap is the right criterion here --- the question to me is how many assumptions are you making in order to compute the page's CRC. Each assumption degrades the reliability of the check, not to mention creating another maintenance hazard. Yea, I am thinking we would have to have the hint bits in the line pointers --- if not, we would have to reserve a lot of free space to hold the maximum number of tuple hint bits --- seems like a waste. Not if you're willing to move the line pointers around. I'd envision an extra pointer in the page header, with a layout along the lines of fixed-size page header hint bits line pointers free space tuples proper special space with the CRC covering everything except the hint bits and perhaps the free space (depending on whether you wanted to depend on two more pointers to be right). We would have to move the line pointers anytime we needed to grow the hint-bit space, and there would be a straightforward tradeoff between how often to move the pointers versus how much potentially-wasted space we leave at the end of the hint area. Or we could put the hint bits after the pointers, which might be better because the hints would be smaller == cheaper to move. I also like the idea that we don't need to CRC check the line pointers because any corruption there is going to appear immediately. However, the bad news is that we wouldn't find the corruption until we try to access bad data and might crash. That sounds exactly like the corruption detection system we have now. If you think that behavior is acceptable, we can skip this whole discussion. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Block-level CRC checks
Tom Lane wrote: Bruce Momjian br...@momjian.us writes: Greg Stark wrote: It should be relatively cheap to skip the hint bits in the line pointers since they'll be the same bits of every 16-bit value for a whole range. Alternatively we could just CRC the tuples and assume a corrupted line pointer will show itself quickly. That would actually make it faster than a straight CRC of the whole block -- making lemonade out of lemons as it were. I don't think relatively cheap is the right criterion here --- the question to me is how many assumptions are you making in order to compute the page's CRC. Each assumption degrades the reliability of the check, not to mention creating another maintenance hazard. Yea, I am thinking we would have to have the hint bits in the line pointers --- if not, we would have to reserve a lot of free space to hold the maximum number of tuple hint bits --- seems like a waste. Not if you're willing to move the line pointers around. I'd envision an extra pointer in the page header, with a layout along the lines of fixed-size page header hint bits line pointers free space tuples proper special space with the CRC covering everything except the hint bits and perhaps the free space (depending on whether you wanted to depend on two more pointers to be right). We would have to move the line pointers anytime we needed to grow the hint-bit space, and there would be a straightforward tradeoff between how often to move the pointers versus how much potentially-wasted space we leave at the end of the hint area. I assume you don't want the hint bits in the line pointers because we would need to lock the page? Or we could put the hint bits after the pointers, which might be better because the hints would be smaller == cheaper to move. I don't see the value there because you would need to move the hint bits every time you added a new line pointer. The bigger problem is that you would need to lock the page to update the hint bits if they move around on the page. I also like the idea that we don't need to CRC check the line pointers because any corruption there is going to appear immediately. However, the bad news is that we wouldn't find the corruption until we try to access bad data and might crash. That sounds exactly like the corruption detection system we have now. If you think that behavior is acceptable, we can skip this whole discussion. Agreed, hence the bad part. -- Bruce Momjian br...@momjian.ushttp://momjian.us EnterpriseDB http://enterprisedb.com + If your life is a hard drive, Christ can be your backup. + -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Block-level CRC checks
Richard Huxton d...@archonet.com writes: So what is the cost of calculating the hint-bits for a whole block of tuples in one go vs reading that block from actual spinning disk? Potentially a couple of hundred times worse, if you're unlucky and each XID on the page requires visiting a different block of clog that's also not in memory. The average case probably isn't that bad, but I think we'd likely be talking at least a factor of two penalty --- you'd be hopelessly optimistic to assume you didn't need at least one clog visit per page. Also, if you want to assume that you're lucky and the XIDs mostly fall within a fairly recent range of clog pages, you're still not out of the woods. In that situation what you are talking about is a spectacular increase in the hit rate for cached clog pages --- which are already a known contention bottleneck in many scenarios. While on that topic - I'm assuming freezing requires substantially more effort than updating hint bits? It's a WAL-logged page change, so at minimum double the cost. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Block-level CRC checks
Bruce Momjian br...@momjian.us writes: Tom Lane wrote: I don't think relatively cheap is the right criterion here --- the question to me is how many assumptions are you making in order to compute the page's CRC. Each assumption degrades the reliability of the check, not to mention creating another maintenance hazard. I assume you don't want the hint bits in the line pointers because we would need to lock the page? No, I don't want them there because I don't want the CRC check to know so much about the page layout. Or we could put the hint bits after the pointers, which might be better because the hints would be smaller == cheaper to move. I don't see the value there because you would need to move the hint bits every time you added a new line pointer. No, we could add unused line pointers in multiples, exactly the same as we would add unused hint bits in multiples if we did it the other way. I don't know offhand which would be more efficient, but you can't just dismiss one without analysis. The bigger problem is that you would need to lock the page to update the hint bits if they move around on the page. We are already assuming that things aren't moving around when we update a hint bit now. That's what the requirement of shared buffer lock when calling tqual.c is for. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Block-level CRC checks
On Tue, Dec 1, 2009 at 10:47 PM, Tom Lane t...@sss.pgh.pa.us wrote: Bruce Momjian br...@momjian.us writes: Greg Stark wrote: It should be relatively cheap to skip the hint bits in the line pointers since they'll be the same bits of every 16-bit value for a whole range. Alternatively we could just CRC the tuples and assume a corrupted line pointer will show itself quickly. That would actually make it faster than a straight CRC of the whole block -- making lemonade out of lemons as it were. I don't think relatively cheap is the right criterion here --- the question to me is how many assumptions are you making in order to compute the page's CRC. Each assumption degrades the reliability of the check, not to mention creating another maintenance hazard. Well the only assumption here is that we know where the line pointers start and end. That sounds like the same level of assumption as your structure with the line pointers moving around. I agree with your general point though -- trying to skip the hint bits strewn around in the tuples means that every line pointer had better be correct or you'll be in trouble before you even get to the CRC check. Skipping them in the line pointers just means applying a hard-coded mask against each word in that region. It seems to me adding a third structure on the page and then requiring tqual to be able to find that doesn't significantly reduce the complexity over having tqual be able to find the line pointers. And it significantly increases the complexity of every other part of the system which has to deal with a third structure on the page. And adding and compacting the page becomes a lot more complex. I'm also I'm a bit leery about adding more line pointers than necessary because even a small number of line pointers will mean you're likely to often fit one fewer tuple on the page. -- greg -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Block-level CRC checks
On Dec 1, 2009, at 4:13 PM, Greg Stark wrote: On Tue, Dec 1, 2009 at 9:57 PM, Richard Huxton d...@archonet.com wrote: Why are we writing out the hint bits to disk anyway? Is it really so slow to calculate them on read + cache them that it's worth all this trouble? Are they not also to blame for the write my import data twice feature? It would be interesting to experiment with different strategies. But the results would depend a lot on workloads and I doubt one strategy is best for everyone. I agree that we'll always have the issue with freezing. But I also think it's time to revisit the whole idea of hint bits. AFAIK we only keep at maximum 2B transactions, and each one takes 2 bits in CLOG. So worst-case scenario, we're looking at 4G of clog. On modern hardware, that's not a lot. And that's also assuming that we don't do any kind of compression on that data (obviously we couldn't use just any old compression algorithm, but there's certainly tricks that could be used to reduce the size of this information). I know this is something that folks at EnterpriseDB have looked at, perhaps there's data they can share. It has often been suggested that we could set the hint bits but not dirty the page, so they would never be written out unless some other update hit the page. In most use cases that would probably result in the right thing happening where we avoid half the writes but still stop doing transaction status lookups relatively promptly. The scary thing is that there might be use cases such as static data loaded where the hint bits never get set and every scan of the page has to recheck those statuses until the tuples are frozen. (Not dirtying the page almost gets us out of the CRC problems -- it doesn't in our current setup because we don't take a lock when setting the hint bits, so you could set it on a page someone is in the middle of CRC checking and writing. There were other solutions proposed for that, including just making hint bits require locking the page or double buffering the write.) There does need to be something like the hint bits which does eventually have to be set because we can't keep transaction information around forever. Even if you keep the transaction information all the way back to the last freeze date (up to about 1GB and change I think) then the data has to be written twice, the second time is to freeze the transactions. In the worst case then reading a page requires a random page access (or two) from anywhere in that 1GB+ file for each tuple on the page (whether visible to us or not). -- greg -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers -- Jim C. Nasby, Database Architect j...@nasby.net 512.569.9461 (cell) http://jim.nasby.net -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Block-level CRC checks
Greg Stark gsst...@mit.edu writes: On Tue, Dec 1, 2009 at 10:47 PM, Tom Lane t...@sss.pgh.pa.us wrote: I don't think relatively cheap is the right criterion here --- the question to me is how many assumptions are you making in order to compute the page's CRC. Each assumption degrades the reliability of the check, not to mention creating another maintenance hazard. Well the only assumption here is that we know where the line pointers start and end. ... and what they contain. To CRC a subset of the page at all, we have to put some amount of faith into the page header's pointers. We can do weak checks on those, but only weak ones. If we process different parts of the page differently, we're increasing our trust in those pointers and reducing the quality of the CRC check. It seems to me adding a third structure on the page and then requiring tqual to be able to find that doesn't significantly reduce the complexity over having tqual be able to find the line pointers. And it significantly increases the complexity of every other part of the system which has to deal with a third structure on the page. And adding and compacting the page becomes a lot more complex. The page compaction logic amounts to a grand total of two not-very-long routines. The vast majority of the code impact from this would be from the problem of finding the out-of-line hint bits for a tuple, which as you say appears about equivalently hard either way. So I think keeping the CRC logic as simple as possible is good from both a reliability and performance standpoint. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Block-level CRC checks
On Wed, Dec 2, 2009 at 12:03 AM, Tom Lane t...@sss.pgh.pa.us wrote: Greg Stark gsst...@mit.edu writes: On Tue, Dec 1, 2009 at 10:47 PM, Tom Lane t...@sss.pgh.pa.us wrote: I don't think relatively cheap is the right criterion here --- the question to me is how many assumptions are you making in order to compute the page's CRC. Each assumption degrades the reliability of the check, not to mention creating another maintenance hazard. Well the only assumption here is that we know where the line pointers start and end. ... and what they contain. To CRC a subset of the page at all, we have to put some amount of faith into the page header's pointers. We can do weak checks on those, but only weak ones. If we process different parts of the page differently, we're increasing our trust in those pointers and reducing the quality of the CRC check. I'm not sure we're on the same page. As I understand it there are three proposals on the table now: 1) set aside a section of the page to contain only non-checksummed hint bits. That section has to be relocatable so the crc check would have to read the start and end address of it from the page header. 2) store the hint bits in the line pointers and skip checking the line pointers. In that case the crc check would skip any bytes between the start of the line pointer array and pd_lower (or pd_upper? no point in crc checking unused bytes is there?) 3) store the hint bits in the line pointers and apply a mask which masks out the 4 hint bits in each 32-bit word in the region between the start of the line pointers and pd_lower (or pd_upper again) These three options all seem to have the same level of interdependence for the crc check, namely they all depend one or two values in the page header to specify a range of bytes in the block. None of them depend on the contents of the line pointers themselves being correct, only the one or two fields in the header specifying which range of bytes the hint bits lie in. For what it's worth I don't think decreasing the quality of the crc check is actually valid. The bottom line is that in all of the above options if any pointer is invalid we'll be CRCing a different set of data from the set that originally went into calculating the stored CRC so we'll be effectively computing a random value which will have a 1/2^32 chance of being the value stored in the CRC field regardless of anything else. -- greg -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Block-level CRC checks
* Greg Stark gsst...@mit.edu [091201 20:14]: I'm not sure we're on the same page. As I understand it there are three proposals on the table now: 1) set aside a section of the page to contain only non-checksummed hint bits. That section has to be relocatable so the crc check would have to read the start and end address of it from the page header. 2) store the hint bits in the line pointers and skip checking the line pointers. In that case the crc check would skip any bytes between the start of the line pointer array and pd_lower (or pd_upper? no point in crc checking unused bytes is there?) 3) store the hint bits in the line pointers and apply a mask which masks out the 4 hint bits in each 32-bit word in the region between the start of the line pointers and pd_lower (or pd_upper again) I'm not intimately familiar with the innards of the pages, but I had *thought* that the original suggestion of moving the hint bits was purely to make sure that they are in the same filesystem block/disk sector as the CRC. That may not be possible, but *if* that's the case, you avoid the torn-page problem, with only 1 minimal assumption: - the FS-block/disk-sector will write whole blocks at a time, or likely be corrupt anyways With my understanding of disks and platters, I'ld assume that if you got a partial sector written, and something prevented it from being completely written, I'ld guess the part missing would be smeared with corruption And that would seem to hold with flash/SSD's too... a. -- Aidan Van Dyk Create like a god, ai...@highrise.ca command like a king, http://www.highrise.ca/ work like a slave. signature.asc Description: Digital signature
Re: [HACKERS] Block-level CRC checks
On Fri, 2008-10-17 at 12:26 -0300, Alvaro Herrera wrote: So this discussion died with no solution arising to the hint-bit-setting-invalidates-the-CRC problem. Apparently the only solution in sight is to WAL-log hint bits. Simon opines it would be horrible from a performance standpoint to WAL-log every hint bit set, and I think we all agree with that. So we need to find an alternative mechanism to WAL log hint bits. It occurred to me that maybe we don't need to WAL-log the CRC checks. Proposal * We reserve enough space on a disk block for a CRC check. When a dirty block is written to disk we calculate and annotate the CRC value, though this is *not* WAL logged. * In normal running we re-check the CRC when we read the block back into shared_buffers. * In recovery we will overwrite the last image of a block from WAL, so we ignore the block CRC check, since the WAL record was already CRC checked. If full_page_writes = off, we ignore and zero the block's CRC for any block touched during recovery. We do those things because the block CRC in the WAL is likely to be different to that on disk, due to hints. * We also re-check the CRC on a block immediately before we dirty the block (for any reason). This minimises the possibility of in-memory data corruption for blocks. So in the typical case all blocks moving from disk - memory and from clean - dirty are CRC checked. So in the case where we have full_page_writes = on then we have a good CRC every time. In the full_page_writes = off case we are exposed only on the blocks that changed during last checkpoint cycle and only if we crash. That seems good because most databases are up 99% of the time, so any corruptions are likely to occur in normal running, not as a result of crashes. This would be a run-time option. Like it? -- Simon Riggs www.2ndQuadrant.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers