Re: [HACKERS] Checksums, state of play
On 7 March 2012 20:56, Bruce Momjian br...@momjian.us wrote: Yep, good summary. Giving ourselves a few months to think about it and consider other failure cases will make this a great 9.3 addition. Recent Intel processors that support SSE 4.2, including those based on the core microarchitecture, can calculate a CRC-32C in hardware using a higher level instruction, similar to the AES related instructions that Intel chips have had for some time now. Perhaps we should consider using hardware acceleration where available. Some interesting details are available from here: http://lwn.net/Articles/292984/ -- Peter Geoghegan http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training and Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Checksums, state of play
On 8 March 2012 20:55, Peter Geoghegan pe...@2ndquadrant.com wrote: On 7 March 2012 20:56, Bruce Momjian br...@momjian.us wrote: Yep, good summary. Giving ourselves a few months to think about it and consider other failure cases will make this a great 9.3 addition. Recent Intel processors that support SSE 4.2, including those based on the core microarchitecture, can calculate a CRC-32C in hardware using a higher level instruction, similar to the AES related instructions that Intel chips have had for some time now. Perhaps we should consider using hardware acceleration where available. Some interesting details are available from here: http://lwn.net/Articles/292984/ That's what Facebook did to speed up MySQL. -- Thom -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Checksums, state of play
On Tue, Mar 6, 2012 at 2:27 PM, Bruce Momjian br...@momjian.us wrote: The feature is no where near complete, and we should not be designing features at this stage. I agree, on both counts. Although Simon did a good job pulling together something that basically works in a short amount of time, the edge cases still need a lot more thought, and work. Yesterday's discussion was mostly about turning the feature on and off, which certainly seems to be the most significant problem with the patch as it stands. But there are also a number of other things that have been discussed and not fully resolved, such as the performance impact of WAL-logging hint bit changes, the exact way we're going to sandwhich this into the page header, and the right way to handle the necessary buffer locking. I think all of those issues can be resolved but it's not going to happen in a day, and even once it does there will still be other, smaller things that need to be cleaned up here and there. Really measuring and fixing all of these issues will be a matter of months, not weeks. Simon seems to be proposing that, in lieu of spending too much more time fixing this, we just commit it and document the known limitations. I don't agree with that. In particular, I think the idea of committing a checksum patch that can produce false positives in the event of a torn page situation is a really bad idea. The whole point of the patch is to distinguish between hardware failure and software failure; if we can't reliably do that, I don't see this as being much of an advance over the status quo. I think we're going to find that the cost of WAL-logging hints is bad enough that people are only going to do it when they already suspect a problem and want confirmation. If they can't rely on that confirmation being real, as opposed to an outgrowth of a known limitation of the feature, I don't see the point. I'd much rather see this feature wait for 9.3 than ship something that's unreliable in this regard. So I think it's time to push this one out to 9.3. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Checksums, state of play
On Wed, Mar 7, 2012 at 5:28 PM, Robert Haas robertmh...@gmail.com wrote: On Tue, Mar 6, 2012 at 2:27 PM, Bruce Momjian br...@momjian.us wrote: The feature is no where near complete, and we should not be designing features at this stage. I agree, on both counts. Although Simon did a good job pulling together something that basically works in a short amount of time, the edge cases still need a lot more thought, and work. Yesterday's discussion was mostly about turning the feature on and off, which certainly seems to be the most significant problem with the patch as it stands. But there are also a number of other things that have been discussed and not fully resolved, such as the performance impact of WAL-logging hint bit changes, I think saying there are some performance doubts doesn't actually make it so. We know it will impact performance, and how. I don't see any big mystery. It's not pretty, but thats what it is. Prepared transactions are slow, but they're still there. the exact way we're going to sand which this into the page header, I'm OK with either pd_tli or pg_pagesize_version. We just need 16 contiguous bytes. and the right way to handle the necessary buffer locking. That is resolved, AFAIK Simon seems to be proposing that, in lieu of spending too much more time fixing this, we just commit it and document the known limitations. I don't agree with that. Neither do I. It's pretty clear from our last discussion that the fix proposed doesn't actually work fully so I don't think its going to be either more robust or more certain to give low false positives. So I don't think more time fixing this will actually improve the situation. I'm not suggesting I skip any work, I think the extra work is pointless. I do understand the issue of risk that exists, I just think there's other ways of ameliorating that other than heaping more software onto the problem. In particular, I think the idea of committing a checksum patch that can produce false positives in the event of a torn page situation is a really bad idea. Again, that isn't a correct description of the issue, so yes, if that were the case I would agree it would be a really bad idea. If we keep misdescribing risk situations of course people will sense high risk and want to hold back any such patch. So we need to be balanced and accurate about the risks. Since people have been pretty negative about this patch for some time I'm not really surprised they now feel its a high risk decision to accept it. If I knew nothing, I would think that also based upon what has been said. The whole point of the patch is to distinguish between hardware failure and software failure; if we can't reliably do that, I don't see this as being much of an advance over the status quo. I think we're going to find that the cost of WAL-logging hints is bad enough that people are only going to do it when they already suspect a problem and want confirmation. If they can't rely on that confirmation being real, as opposed to an outgrowth of a known limitation of the feature, I don't see the point. I'd much rather see this feature wait for 9.3 than ship something that's unreliable in this regard. So I think it's time to push this one out to 9.3. I accept this and was not expecting to commit anything now. I think this decision was actually made quite some time previously and reviewing these points again is just a further waste of time at this point. -- Simon Riggs http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Checksums, state of play
Robert Haas robertmh...@gmail.com writes: So I think it's time to push this one out to 9.3. Frankly, I thought it was probably too late for 9.2 when it was first proposed, and the number of issues that have been identified since then just confirm that feeling. There is another point beyond the ones in your summary: if we squeeze checksums into the page header without increasing its size, however we do it there will be no going back, because we will certainly have used up all the remaining wiggle room there. That means we'll be absolutely committed to this implementation, permanently (or at least until we invent the page format upgrade code that no one seems to want to build). I think we need to be a lot more certain that we're happy with the design and can't improve on it before we make that commitment. What I'd like to see is for Simon to keep hacking on the identified issues, with the hope of having a solid patch ready to commit early in the 9.3 cycle. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Checksums, state of play
On Wed, Mar 7, 2012 at 3:09 PM, Simon Riggs si...@2ndquadrant.com wrote: Neither do I. It's pretty clear from our last discussion that the fix proposed doesn't actually work fully so I don't think its going to be either more robust or more certain to give low false positives. So I don't think more time fixing this will actually improve the situation. I hope that's not true, and I certainly don't think it's true. Like Tom, I'd like to see you keep working on this (or maybe someone else will pick it up) for 9.3. I agree that our most recent discussing left off with a somewhat depressing conclusion, but I don't think that means we should give up; I think it just means that we need a better idea than the ones we've had so far. I guess it's possible that there is no better idea out there, but I think it's more likely that we just haven't thought of it yet. I feel like we are close to unraveling it, and just not quite there yet. I might be wrong. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Checksums, state of play
On Wed, Mar 07, 2012 at 03:33:34PM -0500, Robert Haas wrote: On Wed, Mar 7, 2012 at 3:09 PM, Simon Riggs si...@2ndquadrant.com wrote: Neither do I. It's pretty clear from our last discussion that the fix proposed doesn't actually work fully so I don't think its going to be either more robust or more certain to give low false positives. So I don't think more time fixing this will actually improve the situation. I hope that's not true, and I certainly don't think it's true. Like Tom, I'd like to see you keep working on this (or maybe someone else will pick it up) for 9.3. I agree that our most recent discussing left off with a somewhat depressing conclusion, but I don't think that means we should give up; I think it just means that we need a better idea than the ones we've had so far. I guess it's possible that there is no better idea out there, but I think it's more likely that we just haven't thought of it yet. I feel like we are close to unraveling it, and just not quite there yet. I might be wrong. Yep, good summary. Giving ourselves a few months to think about it and consider other failure cases will make this a great 9.3 addition. -- Bruce Momjian br...@momjian.ushttp://momjian.us EnterpriseDB http://enterprisedb.com + It's impossible for everything to be true. + -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Checksums, state of play
On Mon, Mar 5, 2012 at 11:29 PM, Josh Berkus j...@agliodbs.com wrote: 3. Pages with checksums set need to have a version marking to show that they are a later version of the page layout. That version number needs to be extensible to many later versions. Pages of multiple versions need to exist within the server to allow simple upgrades and migration. This is a statement of a problem; do you have a proposed solution for it? Yes. (3) and (4) are in some ways related, so the solution was presented further down the page. (3) and (4) are in conflict with each other, but there is a solution. We mark the block with a version number, but we don't make the checking dependant upon the version number. We simply avoid making any checks until the command to scan all blocks is complete, per point (2). That way we need to use 1 flag bit to mark the new version and zero flag bits to indicate checks should happen. The proposed solution is extensible and currently would allow many new versions. Thank you for your comments. Are all the other aspects of the proposal acceptable, or are there other options anybody would like to explore? We've discussed a range of possible options, so the proposal can be varied as needed. -- Simon Riggs http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Checksums, state of play
On Mon, Mar 5, 2012 at 10:03 AM, Simon Riggs si...@2ndquadrant.com wrote: 1. We don't need them because there will be something better in a later release. I don't think anybody disagrees that a better solution is possible in the future; doubts have been expressed as to what will be required and when that is likely to happen. Opinions differ. We can and should do something now unless there is reason not to. I don't think there's really a serious problem here. Everyone seems to agree that checksums are useful, and nobody even seems terribly unhappy with the fact that this is only a 16-bit checksum. On the flip side I think it's quite likely that we will never wish to support multiple checksum implementations, so for that reason I do believe that whatever we commit first in this area needs to be something we can live with for a very long time. It can have limitations of course, but we should be comfortable with the on-disk format and basic design of the feature. 2. Turning checksums on/off/on/off in rapid succession can cause false positive reports of checksum failure if crashes occur and are ignored. That may lead to the feature and PostgreSQL being held in disrepute. This I do think is a problem, although not for precisely the reason stated here. In my experience, in data corruption situations, the first thing customers do is blame PostgreSQL: they don't believe it's the hardware; they accuse us of having bugs in our code. Having a checksum feature would be valuable, because, first, we'd perhaps detect problems sooner and, second, people understand what checksums are and that checksum failures really shouldn't happen unless the hardware is bad. More generally, one of the purposes of checksums is to distinguish hardware failure from other possible causes of data corruption problems. If there are code paths where checksum failures can happy despite the hardware being good, I think that the patch will fail to accomplish its goal of giving us confidence that the hardware is bad. This can be resolved, if desired, by having having a two-stage enabling process where we must issue a command that scans every block in the database before checksum checking begins. VACUUM is easily modified to the task, we just need to agree that is suitable and agree syntax. A suggestion is VACUUM ENABLE CHECKSUMS; others are possible. If we're going to extend the syntax of VACUUM, we should use the extensible-options syntax there - i.e. something like VACUUM (CHECKSUMS ON). But overall I think this proposal lacks in detail. It seems to me that there are four states that a database (or table, if we made it more fine-grained) could be in: 1. No blocks have checksums. 2. Checksums are being added, but some blocks may not yet have them. Thus, it's not an error for a block to have no checksum, but if a block appears to have a checksum, it should be correct. All blocks are written with checksums. 3. Checksums are fully enabled. Checksum failures are still an error, and apparent lack of a checksum is now an error as well, since it must indicate something's been corrupted. All blocks are written with checksums. 4. Checksums are being removed, but some blocks may still have them. Thus, it's not an error for a block to have no checksum, but any still-remaining checksums should be correct (though possibly we ought not to complain if they aren't, to provide a recovery path for users who are turning checksums off because they're getting errors they don't want). Any block that's written is written without checksums. I think we need to be clear about how the system transitions between these states. In the current patch, AIUI, you can effectively go from 1-2 or 4-2 by setting page_checksums=on and from 2-4 by setting page_checksums=off, but there's no easy way to ensure that you've reached state 3 or that you've gotten back to state 1. Some variant of VACUUM seems like a good way of doing that, but it doesn't make sense for example to have page_checksums=off and do VACUUM (CHECKSUMS ON), or to have page_checksums=on and do VACUUM (CHECKSUMS OFF). I guess we could just reject those as error cases, but it would be nicer, I think, to have an interface with a higher degree of orthogonality. There's probably more than one way to do that, but my personal preference, as previously noted, is to make this a table-level option, rather than a GUC. Then, VACUUM (CHECKSUMS ON) can first change the pg_class entry to indicate that checksums are enabling-in-progress (i.e. 1-2), then scan the table, adding checksums, and then mark checksums as fully enabled (i.e. 2-3). VACUUM (CHECKSUMS OFF) can proceed in symmetric fashion, marking checksums as disabling-in-progress (3-4), then scanning the table and getting rid of them, and then marking them fully disabled (4-1). If a crash happens in the middle somewhere, the state of the table can get left as enabling-in-progress or disabling-in-progress, but a new VACUUM (CHECKSUMS
Re: [HACKERS] Checksums, state of play
On Tue, Mar 6, 2012 at 2:25 PM, Robert Haas robertmh...@gmail.com wrote: 4. Checksums are being removed, but some blocks may still have them. Thus, it's not an error for a block to have no checksum, but any still-remaining checksums should be correct (though possibly we ought not to complain if they aren't, to provide a recovery path for users who are turning checksums off because they're getting errors they don't want). Any block that's written is written without checksums. I agree its advantageous to have a means of removing pagesums from data blocks as a 4th state. I think we need a 5th state - pagesums disabled. Which allows an emergency disabling of the feature without rewriting the blocks. Obviously if the database is damaged and I/O devices going bad, trying to rescan database is likely to cause further problems. I think we need to be clear about how the system transitions between these states. In the current patch, AIUI, you can effectively go from 1-2 or 4-2 by setting page_checksums=on and from 2-4 by setting page_checksums=off, but there's no easy way to ensure that you've reached state 3 or that you've gotten back to state 1. Some variant of VACUUM seems like a good way of doing that, but it doesn't make sense for example to have page_checksums=off and do VACUUM (CHECKSUMS ON), or to have page_checksums=on and do VACUUM (CHECKSUMS OFF). I guess we could just reject those as error cases, but it would be nicer, I think, to have an interface with a higher degree of orthogonality. Right, a misunderstanding I think. If we have states set at database level then we'd not have a GUC as well. There's probably more than one way to do that, but my personal preference, as previously noted, is to make this a table-level option, rather than a GUC. Then, VACUUM (CHECKSUMS ON) can first change the pg_class entry to indicate that checksums are enabling-in-progress (i.e. 1-2), then scan the table, adding checksums, and then mark checksums as fully enabled (i.e. 2-3). VACUUM (CHECKSUMS OFF) can proceed in symmetric fashion, marking checksums as disabling-in-progress (3-4), then scanning the table and getting rid of them, and then marking them fully disabled (4-1). If a crash happens in the middle somewhere, the state of the table can get left as enabling-in-progress or disabling-in-progress, but a new VACUUM (CHECKSUMS X) can be used to finish the process, and we always know exactly where we're at. Any in-progress state needs to have checksums removed first, then re-added. I'll keep an open mind for now about database/table level. I'm not sure how possible/desirable each is. I generally agree with this outline, though I think that in lieu of a version number we could simply set a new pd_flags bit indicating that checksums are enabled. If we haven't fully enabled checksums yet, then the fact that this bit isn't set is not an error; but if checksums are fully enabled, then every page must have that bit set, and any page that doesn't is ipso facto corrupt. Whether to have it or not, if a corruption occurs during checksum-enabling then we could get a false reading. If we have a bit then the bit can be set wrong, so we could either make a check when it wasn't due, or skip a check we should have made. If we don't have a bit and so skip checksum checking during enabling process then we can get an error that isn't spotted by the checksum process. Given it can happen both ways, we should have a bit/ or not depending upon which is the least likely to be wrong. I would say having the bit provides the least likely way to get false readings. -- Simon Riggs http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Checksums, state of play
On Tue, Mar 6, 2012 at 2:25 PM, Robert Haas robertmh...@gmail.com wrote: For the reasons stated above, I believe pd_tli is less useful than pd_pagesize_version. I fear that if we repurpose pd_pagesize_version, we're going to make things very difficult for people who want to write block-inspection tools, like pg_filedump or pageinspect. Right now, it's possible to look at that offset within the block and know for certain what page version you're dealing with. If we repurpose it to hold checksum data, that will no longer be possible. Unlike pd_tli, pd_pagesize_version is validated every time we read a block. We've not changed the page format in 5 years. I really can't see what the value of having a constant stored on every data block, especially since you're now saying that we *shouldn't* bump the constant for this change. Surely if we are keeping the pd_pagesize_version field its obvious that we should increment it? If not, why the insistence on keeping the field if we aren't using it for its stated purpose? Do you know of any PostgreSQL variant that can set this byte range to different values? If so, I'd suggest we just declare the field user defined or some such so that others can use it for different things as well and then use pd_tli. IMHO if we keep use pd_tli but pd_pagesize_version then we should increment it. -- Simon Riggs http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Checksums, state of play
On Tue, Mar 6, 2012 at 3:31 PM, Simon Riggs si...@2ndquadrant.com wrote: I'll keep an open mind for now about database/table level. I'm not sure how possible/desirable each is. Table level sounds great, but how will it work with recovery? We don't have a relcache in Startup process. So either database or tablespace level seems doable. -- Simon Riggs http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Checksums, state of play
On Tue, Mar 6, 2012 at 10:31 AM, Simon Riggs si...@2ndquadrant.com wrote: On Tue, Mar 6, 2012 at 2:25 PM, Robert Haas robertmh...@gmail.com wrote: 4. Checksums are being removed, but some blocks may still have them. Thus, it's not an error for a block to have no checksum, but any still-remaining checksums should be correct (though possibly we ought not to complain if they aren't, to provide a recovery path for users who are turning checksums off because they're getting errors they don't want). Any block that's written is written without checksums. I agree its advantageous to have a means of removing pagesums from data blocks as a 4th state. I think we need a 5th state - pagesums disabled. Which allows an emergency disabling of the feature without rewriting the blocks. Obviously if the database is damaged and I/O devices going bad, trying to rescan database is likely to cause further problems. Maybe we should consider making this a separate flag that controls what we do when we detect a checksum failure -- ERROR, LOG, or just count it in the stats -- rather than an extra state. I think we need to be clear about how the system transitions between these states. In the current patch, AIUI, you can effectively go from 1-2 or 4-2 by setting page_checksums=on and from 2-4 by setting page_checksums=off, but there's no easy way to ensure that you've reached state 3 or that you've gotten back to state 1. Some variant of VACUUM seems like a good way of doing that, but it doesn't make sense for example to have page_checksums=off and do VACUUM (CHECKSUMS ON), or to have page_checksums=on and do VACUUM (CHECKSUMS OFF). I guess we could just reject those as error cases, but it would be nicer, I think, to have an interface with a higher degree of orthogonality. Right, a misunderstanding I think. If we have states set at database level then we'd not have a GUC as well. OK. If the flag is set at database level rather than table or cluster level, then shared catalogs become a bit of a wart. Maybe managably so, but it needs thought, at the least. There's probably more than one way to do that, but my personal preference, as previously noted, is to make this a table-level option, rather than a GUC. Then, VACUUM (CHECKSUMS ON) can first change the pg_class entry to indicate that checksums are enabling-in-progress (i.e. 1-2), then scan the table, adding checksums, and then mark checksums as fully enabled (i.e. 2-3). VACUUM (CHECKSUMS OFF) can proceed in symmetric fashion, marking checksums as disabling-in-progress (3-4), then scanning the table and getting rid of them, and then marking them fully disabled (4-1). If a crash happens in the middle somewhere, the state of the table can get left as enabling-in-progress or disabling-in-progress, but a new VACUUM (CHECKSUMS X) can be used to finish the process, and we always know exactly where we're at. Any in-progress state needs to have checksums removed first, then re-added. I don't think that's necessarily true. If the user begins disabling checksums (i.e. they are in state #4) and they change their mind, they should be able to switch back to enabling mode. I'll keep an open mind for now about database/table level. I'm not sure how possible/desirable each is. I think that per-table is clearly more desirable, but I haven't studied how invasive the code changes are. I suspect it is manageable but I might be building castles in the air. I generally agree with this outline, though I think that in lieu of a version number we could simply set a new pd_flags bit indicating that checksums are enabled. If we haven't fully enabled checksums yet, then the fact that this bit isn't set is not an error; but if checksums are fully enabled, then every page must have that bit set, and any page that doesn't is ipso facto corrupt. Whether to have it or not, if a corruption occurs during checksum-enabling then we could get a false reading. That is true, and I think unavoidable. While checksums are being enabled, we have to assume that a not-checksummed page just hasn't been updated yet. There's no way to distinguish those cases. If we have a bit then the bit can be set wrong, so we could either make a check when it wasn't due, or skip a check we should have made. If we don't have a bit and so skip checksum checking during enabling process then we can get an error that isn't spotted by the checksum process. Given it can happen both ways, we should have a bit/ or not depending upon which is the least likely to be wrong. I would say having the bit provides the least likely way to get false readings. I agree that having a bit is good. It's not really necessary, but it makes it easier to understand the logic, and allows us to catch and distinguish error cases better. Once checksums are fully enabled, all checksums must match, bit or no bit. But it's easier to understand what the intermediate states look like with
Re: [HACKERS] Checksums, state of play
On Tue, Mar 6, 2012 at 10:40 AM, Simon Riggs si...@2ndquadrant.com wrote: On Tue, Mar 6, 2012 at 2:25 PM, Robert Haas robertmh...@gmail.com wrote: For the reasons stated above, I believe pd_tli is less useful than pd_pagesize_version. I fear that if we repurpose pd_pagesize_version, we're going to make things very difficult for people who want to write block-inspection tools, like pg_filedump or pageinspect. Right now, it's possible to look at that offset within the block and know for certain what page version you're dealing with. If we repurpose it to hold checksum data, that will no longer be possible. Unlike pd_tli, pd_pagesize_version is validated every time we read a block. We've not changed the page format in 5 years. I really can't see what the value of having a constant stored on every data block, especially since you're now saying that we *shouldn't* bump the constant for this change. Surely if we are keeping the pd_pagesize_version field its obvious that we should increment it? If not, why the insistence on keeping the field if we aren't using it for its stated purpose? Do you know of any PostgreSQL variant that can set this byte range to different values? If so, I'd suggest we just declare the field user defined or some such so that others can use it for different things as well and then use pd_tli. IMHO if we keep use pd_tli but pd_pagesize_version then we should increment it. The fact that we haven't changed the page format in 5 years is a good thing, and I hope that we won't change it very often because it will require whole-cluster rewrites to take full advantage of whatever features are made available by the version bump, which is darn painful. However, I'm pretty sure that eventually we're going to want to bump it. Aside from checksums, the most imminent thing I can think of that might cause us to do that is this idea regarding XID wraparound: http://archives.postgresql.org/message-id/4f2fa541.8040...@enterprisedb.com However, even if we don't do that or we find some way to do it without bumping the page version, it seems likely to me that something else will come up eventually. The size of our page header doesn't thrill me, but the one byte we've allocated to storing the version is only a minor contributor and pretty cheap insurance against future needs. As to whether we should increment pd_pagesize_version, I'm not sure quite what you were saying about that (I think you may have an extra or missing word there), but I don't think it's necessary here. I believe we feel free to assign new flag bits without bumping the page size version, so we could define PD_HAS_CHECKSUM without doing so. Maybe your point is that we're changing the meaning of pd_tli and it seems ugly to do that without the bumping the page version, but I guess my point is that we're not changing it incompatibly. We really only need to bump the page version for changes where a newer version of the server would otherwise misinterpret an older page, which isn't a problem in this case because pd_tli is basically dead already. And, on a more practical level, Tom argued on the other thread that if we have a page upgrade facility, then we ought to store the minimum page version for every relation in a pg_class column, so we can keep track of when all pages of the older format are gone. That's infrastructure that this patch doesn't really need, and we can avoid having to build it by steering clear of the page versioning issue altogether, viewing this instead as an enhancement of the existing page format that doesn't break compatibility with older releases. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Checksums, state of play
On Tue, Mar 6, 2012 at 4:06 PM, Simon Riggs si...@2ndquadrant.com wrote: On Tue, Mar 6, 2012 at 3:31 PM, Simon Riggs si...@2ndquadrant.com wrote: I'll keep an open mind for now about database/table level. I'm not sure how possible/desirable each is. Table level sounds great, but how will it work with recovery? We don't have a relcache in Startup process. So either database or tablespace level seems doable. Even db or ts level is problematic. Options (1) Recovery ignores checksums until db in consistent state (2) Recovery ignores checksums until all databases are enabled, when we set flag in pg_control (3) Recovery checks blocks marked as having a checksum, no matter the overall state -- Simon Riggs http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Checksums, state of play
On Tue, Mar 6, 2012 at 4:42 PM, Robert Haas robertmh...@gmail.com wrote: As to whether we should increment pd_pagesize_version, I'm not sure quite what you were saying about that (I think you may have an extra or missing word there), but I don't think it's necessary here. I said this... On Tue, Mar 6, 2012 at 3:40 PM, Simon Riggs si...@2ndquadrant.com wrote: Do you know of any PostgreSQL variant that can set this byte range to different values? Not sure what the missing word is there, so I'll ask again. Has EDB or anybody else you know of has used the pd_pagesize_version field for something else, so you'd rather I didn't touch that? -- Simon Riggs http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Checksums, state of play
On Tue, Mar 6, 2012 at 12:00 PM, Simon Riggs si...@2ndquadrant.com wrote: On Tue, Mar 6, 2012 at 4:42 PM, Robert Haas robertmh...@gmail.com wrote: As to whether we should increment pd_pagesize_version, I'm not sure quite what you were saying about that (I think you may have an extra or missing word there), but I don't think it's necessary here. I said this... On Tue, Mar 6, 2012 at 3:40 PM, Simon Riggs si...@2ndquadrant.com wrote: Do you know of any PostgreSQL variant that can set this byte range to different values? Not sure what the missing word is there, so I'll ask again. Has EDB or anybody else you know of has used the pd_pagesize_version field for something else, so you'd rather I didn't touch that? To my knowledge, EDB has not ever stamped pages with a different pd_pagesize_version, or reused that field for any other purpose, but I'm not sure about Greenplum or others. I have seen at least one pg_filedump output where every page was stamped with the same very odd value there; I don't remember what the value was, and I don't know where the output I was sent came from, but it was enough to make me wonder if somebody's done that. That's not why I want to leave that field alone, though: I want to leave that field alone for backward and forward compatibility, so that any version of community PostgreSQL ever released - and any page inspection tools, current or future - can look at the low-order byte of that field and know what page version they've got. If we didn't have some other bytes in the page header that seem relatively useless (like pd_tli and the high bits of pd_pagesize_version), I'd be arguing for extending the page header rather than clobbering the version number. It just seems to me that the page version number is absolutely the most basic piece of information on the page, which you must have before you can interpret the rest of the page contents. For it to have any value in identifying past page format changes, future page format changes, or format changes by forks of our main code base, it's got to be present in every release and have the same meaning in each one. If we do a release where 2 can mean the checksum is 2 rather than the page was written by PostgreSQL 8.0 then we've forever lost the ability to decide, without the use of heuristics, what kind of page we've got. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Checksums, state of play
On 06.03.2012 19:00, Simon Riggs wrote: On Tue, Mar 6, 2012 at 4:42 PM, Robert Haasrobertmh...@gmail.com wrote: As to whether we should increment pd_pagesize_version, I'm not sure quite what you were saying about that (I think you may have an extra or missing word there), but I don't think it's necessary here. I said this... On Tue, Mar 6, 2012 at 3:40 PM, Simon Riggssi...@2ndquadrant.com wrote: Do you know of any PostgreSQL variant that can set this byte range to different values? Not sure what the missing word is there, so I'll ask again. Has EDB or anybody else you know of has used the pd_pagesize_version field for something else, so you'd rather I didn't touch that? The EDB page format is exactly the same as the community one. Thanks for asking. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Checksums, state of play
On Tue, Mar 6, 2012 at 11:50 AM, Simon Riggs si...@2ndquadrant.com wrote: On Tue, Mar 6, 2012 at 4:06 PM, Simon Riggs si...@2ndquadrant.com wrote: On Tue, Mar 6, 2012 at 3:31 PM, Simon Riggs si...@2ndquadrant.com wrote: I'll keep an open mind for now about database/table level. I'm not sure how possible/desirable each is. Table level sounds great, but how will it work with recovery? We don't have a relcache in Startup process. So either database or tablespace level seems doable. Even db or ts level is problematic. Options (1) Recovery ignores checksums until db in consistent state (2) Recovery ignores checksums until all databases are enabled, when we set flag in pg_control (3) Recovery checks blocks marked as having a checksum, no matter the overall state How about combining #1 and #3? If the database isn't consistent yet (and thus we can't look at pg_database) then we rely on the blocks themselves to tell us whether they have checksums. Once we reach consistency we can do better. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Checksums, state of play
On Tue, Mar 6, 2012 at 5:14 PM, Heikki Linnakangas heikki.linnakan...@enterprisedb.com wrote: Has EDB or anybody else you know of has used the pd_pagesize_version field for something else, so you'd rather I didn't touch that? The EDB page format is exactly the same as the community one. Thanks for asking. No problem, anytime. Just wanted to make sure downstream wasn't affected. -- Simon Riggs http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Checksums, state of play
On Tue, Mar 6, 2012 at 5:14 PM, Robert Haas robertmh...@gmail.com wrote: Options (1) Recovery ignores checksums until db in consistent state (2) Recovery ignores checksums until all databases are enabled, when we set flag in pg_control (3) Recovery checks blocks marked as having a checksum, no matter the overall state How about combining #1 and #3? If the database isn't consistent yet (and thus we can't look at pg_database) then we rely on the blocks themselves to tell us whether they have checksums. Once we reach consistency we can do better. We can change state then, but to what? We don't have a relcache. Maybe that puts us back at Square #1. Will think -- Simon Riggs http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Checksums, state of play
On Tue, Mar 6, 2012 at 12:23 PM, Simon Riggs si...@2ndquadrant.com wrote: On Tue, Mar 6, 2012 at 5:14 PM, Robert Haas robertmh...@gmail.com wrote: Options (1) Recovery ignores checksums until db in consistent state (2) Recovery ignores checksums until all databases are enabled, when we set flag in pg_control (3) Recovery checks blocks marked as having a checksum, no matter the overall state How about combining #1 and #3? If the database isn't consistent yet (and thus we can't look at pg_database) then we rely on the blocks themselves to tell us whether they have checksums. Once we reach consistency we can do better. We can change state then, but to what? We don't have a relcache. If the state is per-database or per-tablespace, you can read pg_database at that point and see what the flag says. If it's per-relation, then I agree: you still don't have enough information. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Checksums, state of play
On Tue, Mar 06, 2012 at 09:25:17AM -0500, Robert Haas wrote: 2. Turning checksums on/off/on/off in rapid succession can cause false positive reports of checksum failure if crashes occur and are ignored. That may lead to the feature and PostgreSQL being held in disrepute. This I do think is a problem, although not for precisely the reason stated here. In my experience, in data corruption situations, the first thing customers do is blame PostgreSQL: they don't believe it's the hardware; they accuse us of having bugs in our code. Having a checksum feature would be valuable, because, first, we'd perhaps detect problems sooner and, second, people understand what checksums are and that checksum failures really shouldn't happen unless the hardware is bad. More generally, one of the purposes of checksums is to distinguish hardware failure from other possible causes of data corruption problems. If there are code paths where checksum failures can happy despite the hardware being good, I think that the patch will fail to accomplish its goal of giving us confidence that the hardware is bad. I think the turning checksums on/off/on/off is really a killer problem, and obviously many of the actions needed to make it safe make the checksum feature itself less useful. One crazy idea would be to have a checksum _version_ number somewhere on the page and in pg_controldata. When you turn on checksums, you increment that value, and all new checksum pages get that checksum version; if you turn off checksums, we just don't check them anymore, but they might get incorrect due to a hint bit write and a crash. When you turn on checksums again, you increment the checksum version again, and only check pages having the _new_ checksum version. Yes, this does add additional storage requirements for the checksum, but I don't see another clean option. If you can spare one byte, that gives you 255 times to turn on checksums; after that, you have to dump/reload to use the checksum feature. -- Bruce Momjian br...@momjian.ushttp://momjian.us EnterpriseDB http://enterprisedb.com + It's impossible for everything to be true. + -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Checksums, state of play
On Mon, Mar 05, 2012 at 03:03:18PM +, Simon Riggs wrote: To avoid any confusion as to where this proposed feature is now, I'd like to summarise my understanding, make proposals and also request clear feedback on them. Checksums have a number of objections to them outstanding. 1. We don't need them because there will be something better in a later release. I don't think anybody disagrees that a better solution is possible in the future; doubts have been expressed as to what will be required and when that is likely to happen. Opinions differ. We can and should do something now unless there is reason not to. Obviously, one reason not to do this now is that we are way past time to be designing any feature. As much as I like how it has progressed and how it handles pg_upgrade issues, I don't think anyone can state that this feature is ready to go, and considering how far we are into the last commit-fest, I think we can fairly say this patch has gotten good review and return it with feedback. We can keep discussing it (and I just posted some ideas myself), but I don't think we can any longer pretend that this is going into Postgres 9.2. 2. Turning checksums on/off/on/off in rapid succession can cause false positive reports of checksum failure if crashes occur and are ignored. That may lead to the feature and PostgreSQL being held in disrepute. This can be resolved, if desired, by having having a two-stage enabling process where we must issue a command that scans every block in the database before checksum checking begins. VACUUM is easily modified to the task, we just need to agree that is suitable and agree syntax. A suggestion is VACUUM ENABLE CHECKSUMS; others are possible. There is no question that if this feature is done badly, it will make Postgres look bad, and no one wants that. As nice as this features is, making Postgres look bad can't justify it as it currently sits. -- Bruce Momjian br...@momjian.ushttp://momjian.us EnterpriseDB http://enterprisedb.com + It's impossible for everything to be true. + -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Checksums, state of play
On Tue, Mar 6, 2012 at 5:50 PM, Bruce Momjian br...@momjian.us wrote: One crazy idea would be to have a checksum _version_ number somewhere on the page and in pg_controldata. When you turn on checksums, you increment that value, and all new checksum pages get that checksum version; if you turn off checksums, we just don't check them anymore, but they might get incorrect due to a hint bit write and a crash. When you turn on checksums again, you increment the checksum version again, and only check pages having the _new_ checksum version. Yes, this does add additional storage requirements for the checksum, but I don't see another clean option. If you can spare one byte, that gives you 255 times to turn on checksums; after that, you have to dump/reload to use the checksum feature. I like the idea very much actually. But I'll let you argue the case for using pd_pagesize_version for that with your esteemed colleagues. It would be pretty safe to just let it wrap. -- Simon Riggs http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Checksums, state of play
Robert Haas robertmh...@gmail.com writes: That's not why I want to leave that field alone, though: I want to leave that field alone for backward and forward compatibility, so that any version of community PostgreSQL ever released - and any page inspection tools, current or future - can look at the low-order byte of that field and know what page version they've got. I've not been following this thread very closely, but FWIW I find the above argument extremely compelling. We could get away with relocating the version identifier in the narrow context of an upgrade from PG 9.x to 9.y, but the side effects for external tools such as pg_filedump would be disastrous. (And yeah, as maintainer for pg_filedump I'm rather biased.) regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Checksums, state of play
On Tue, Mar 06, 2012 at 06:00:13PM +, Simon Riggs wrote: On Tue, Mar 6, 2012 at 5:50 PM, Bruce Momjian br...@momjian.us wrote: One crazy idea would be to have a checksum _version_ number somewhere on the page and in pg_controldata. When you turn on checksums, you increment that value, and all new checksum pages get that checksum version; if you turn off checksums, we just don't check them anymore, but they might get incorrect due to a hint bit write and a crash. When you turn on checksums again, you increment the checksum version again, and only check pages having the _new_ checksum version. Yes, this does add additional storage requirements for the checksum, but I don't see another clean option. If you can spare one byte, that gives you 255 times to turn on checksums; after that, you have to dump/reload to use the checksum feature. I like the idea very much actually. But I'll let you argue the case for using pd_pagesize_version for that with your esteemed colleagues. It would be pretty safe to just let it wrap. How would we know there are not old unwritten pages sitting around? -- Bruce Momjian br...@momjian.ushttp://momjian.us EnterpriseDB http://enterprisedb.com + It's impossible for everything to be true. + -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Checksums, state of play
On Tue, Mar 6, 2012 at 5:56 PM, Bruce Momjian br...@momjian.us wrote: On Mon, Mar 05, 2012 at 03:03:18PM +, Simon Riggs wrote: To avoid any confusion as to where this proposed feature is now, I'd like to summarise my understanding, make proposals and also request clear feedback on them. Checksums have a number of objections to them outstanding. 1. We don't need them because there will be something better in a later release. I don't think anybody disagrees that a better solution is possible in the future; doubts have been expressed as to what will be required and when that is likely to happen. Opinions differ. We can and should do something now unless there is reason not to. Obviously, one reason not to do this now is that we are way past time to be designing any feature. As much as I like how it has progressed and how it handles pg_upgrade issues, I don't think anyone can state that this feature is ready to go, and considering how far we are into the last commit-fest, I think we can fairly say this patch has gotten good review and return it with feedback. We can keep discussing it (and I just posted some ideas myself), but I don't think we can any longer pretend that this is going into Postgres 9.2. Are you calling time on all patches in this CF, or just this one? From what you say, this seems to be the reason for your thinking: On Tue, Mar 6, 2012 at 5:50 PM, Bruce Momjian br...@momjian.us wrote: I think the turning checksums on/off/on/off is really a killer problem, and obviously many of the actions needed to make it safe make the checksum feature itself less useful. The problem is actually on/off/crash/on in quick succession which is much less likely. In the current patch that can be resolved by running a VACUUM to remove checksums if you want to turn them off completely and safely. That is easily documented as a procedure for people to follow, like creating types or setting up replication. The resolution suggested to that problem is to force a VACUUM to turn on or turn off. There's very little difference between those two things other than us forcing the user, which as you point out, being forced to do that could be worse than the problem we're trying to solve. From this discussion, ISTM that the force route has sufficient complexity that we could easily make it less robust and much more likely to receive criticism. You can break replication by doing on/off/on as well, for example. It's simply not a killer issue. Not if you have a reasonable grading of issues. It's a possibility of a usage annoyance, much less annoying than having an XML type that doesn't actually allow indexes without knowledge of C, a hash index that doesn't ever recover or an unlogged table whose data suddenly disappears after a crash, for example. ISTM pretty reasonable to document the issue clearly, inform people how it works and let them choose whether to use it or not. Stopping features from going out because some people might misuse them isn't a great plan. -- Simon Riggs http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Checksums, state of play
On Tue, Mar 06, 2012 at 01:52:31PM -0500, Bruce Momjian wrote: On Tue, Mar 06, 2012 at 06:00:13PM +, Simon Riggs wrote: On Tue, Mar 6, 2012 at 5:50 PM, Bruce Momjian br...@momjian.us wrote: One crazy idea would be to have a checksum _version_ number somewhere on the page and in pg_controldata. When you turn on checksums, you increment that value, and all new checksum pages get that checksum version; if you turn off checksums, we just don't check them anymore, but they might get incorrect due to a hint bit write and a crash. When you turn on checksums again, you increment the checksum version again, and only check pages having the _new_ checksum version. Yes, this does add additional storage requirements for the checksum, but I don't see another clean option. If you can spare one byte, that gives you 255 times to turn on checksums; after that, you have to dump/reload to use the checksum feature. I like the idea very much actually. But I'll let you argue the case for using pd_pagesize_version for that with your esteemed colleagues. It would be pretty safe to just let it wrap. How would we know there are not old unwritten pages sitting around? Perhaps a full xid wrap-around would allow us to re-use checksum counters. -- Bruce Momjian br...@momjian.ushttp://momjian.us EnterpriseDB http://enterprisedb.com + It's impossible for everything to be true. + -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Checksums, state of play
On Tue, Mar 06, 2012 at 07:09:23PM +, Simon Riggs wrote: On Tue, Mar 6, 2012 at 5:56 PM, Bruce Momjian br...@momjian.us wrote: On Mon, Mar 05, 2012 at 03:03:18PM +, Simon Riggs wrote: To avoid any confusion as to where this proposed feature is now, I'd like to summarise my understanding, make proposals and also request clear feedback on them. Checksums have a number of objections to them outstanding. 1. We don't need them because there will be something better in a later release. I don't think anybody disagrees that a better solution is possible in the future; doubts have been expressed as to what will be required and when that is likely to happen. Opinions differ. We can and should do something now unless there is reason not to. Obviously, one reason not to do this now is that we are way past time to be designing any feature. As much as I like how it has progressed and how it handles pg_upgrade issues, I don't think anyone can state that this feature is ready to go, and considering how far we are into the last commit-fest, I think we can fairly say this patch has gotten good review and return it with feedback. We can keep discussing it (and I just posted some ideas myself), but I don't think we can any longer pretend that this is going into Postgres 9.2. Are you calling time on all patches in this CF, or just this one? From what you say, this seems to be the reason for your thinking: I am calling time on any CF patch that requires redesign. I think the turning checksums on/off/on/off is really a killer problem, and obviously many of the actions needed to make it safe make the checksum feature itself less useful. The problem is actually on/off/crash/on in quick succession which is much less likely. True. In the current patch that can be resolved by running a VACUUM to remove checksums if you want to turn them off completely and safely. That is easily documented as a procedure for people to follow, like creating types or setting up replication. The resolution suggested to that problem is to force a VACUUM to turn on or turn off. There's very little difference between those two things other than us forcing the user, which as you point out, being forced to do that could be worse than the problem we're trying to solve. From this discussion, ISTM that the force route has sufficient complexity that we could easily make it less robust and much more likely to receive criticism. You can break replication by doing on/off/on as well, for example. It's simply not a killer issue. Not if you have a reasonable grading of issues. It's a possibility of a usage annoyance, much less annoying than having an XML type that doesn't actually allow indexes without knowledge of C, a hash index that doesn't ever recover or an unlogged table whose data suddenly disappears after a crash, for example. ISTM pretty reasonable to document the issue clearly, inform people how it works and let them choose whether to use it or not. Stopping features from going out because some people might misuse them isn't a great plan. The feature is no where near complete, and we should not be designing features at this stage. -- Bruce Momjian br...@momjian.ushttp://momjian.us EnterpriseDB http://enterprisedb.com + It's impossible for everything to be true. + -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Checksums, state of play
On Tue, Mar 6, 2012 at 12:50 PM, Bruce Momjian br...@momjian.us wrote: I think the turning checksums on/off/on/off is really a killer problem, and obviously many of the actions needed to make it safe make the checksum feature itself less useful. One crazy idea would be to have a checksum _version_ number somewhere on the page and in pg_controldata. When you turn on checksums, you increment that value, and all new checksum pages get that checksum version; if you turn off checksums, we just don't check them anymore, but they might get incorrect due to a hint bit write and a crash. When you turn on checksums again, you increment the checksum version again, and only check pages having the _new_ checksum version. Yes, this does add additional storage requirements for the checksum, but I don't see another clean option. If you can spare one byte, that gives you 255 times to turn on checksums; after that, you have to dump/reload to use the checksum feature. I don't see what problem that solves. It's just taking the problem we already have and a new failure mode (out of checksum versions) on top of it. If you see a page with checksum version 153 and the current checksum version is 152, then you know that either (a) it is the result of a previous iteration of turning checksums on or (b) the checksum version number got corrupted. This is the exact same problem we have with using a PD_HAS_CHECKSUM bit. If the bit is not set, then you know that either (a) it hasn't been checksummed yet or (b) the bit got corrupted. In either case, a single poorly placed bit-flip gives rise to the exact same confusion. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Checksums, state of play
On Tue, Mar 06, 2012 at 07:09:23PM +, Simon Riggs wrote: The problem is actually on/off/crash/on in quick succession which is much less likely. I must be missing something, but how about: if (!has_checksums page_loses_checksum_due_to_hint_bit_write) wal_log_the_hint_bit_write(); Greetings Marcin Mańk -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
[HACKERS] Checksums, state of play
To avoid any confusion as to where this proposed feature is now, I'd like to summarise my understanding, make proposals and also request clear feedback on them. Checksums have a number of objections to them outstanding. 1. We don't need them because there will be something better in a later release. I don't think anybody disagrees that a better solution is possible in the future; doubts have been expressed as to what will be required and when that is likely to happen. Opinions differ. We can and should do something now unless there is reason not to. 2. Turning checksums on/off/on/off in rapid succession can cause false positive reports of checksum failure if crashes occur and are ignored. That may lead to the feature and PostgreSQL being held in disrepute. This can be resolved, if desired, by having having a two-stage enabling process where we must issue a command that scans every block in the database before checksum checking begins. VACUUM is easily modified to the task, we just need to agree that is suitable and agree syntax. A suggestion is VACUUM ENABLE CHECKSUMS; others are possible. 3. Pages with checksums set need to have a version marking to show that they are a later version of the page layout. That version number needs to be extensible to many later versions. Pages of multiple versions need to exist within the server to allow simple upgrades and migration. 4. Checksums that are dependent upon a bit setting on the block are somewhat fragile. Requests have been made to add bits in certain positions and also to remove them again. No set of bits seems to please everyone. (3) and (4) are in conflict with each other, but there is a solution. We mark the block with a version number, but we don't make the checking dependant upon the version number. We simply avoid making any checks until the command to scan all blocks is complete, per point (2). That way we need to use 1 flag bit to mark the new version and zero flag bits to indicate checks should happen. (Various other permutations of solutions for (2), (3), (4) have been discussed and may also be still open) 5. The part of the page header that can be used as a checksum has been disputed. Using the 16 bits dedicated to a version number seems like the least useful consecutive 2 bytes of data in the page header. It can't be 16 bits because that wouldn't be an effective checksum for database blocks. We might prefer 32 bits, but that would require use of some other parts of the page header and possibly split that into two parts. Splitting the checksum into 2 parts will cause the code to be more complex and fragile. 6. Performance impacts. Measured to be a small regression. 7. Hint bit setting requires WAL logging. The worst case for that would be setting hints on newly loaded tables. Work has been done on other patches to remove that case. If those don't fly, this would be a cost paid by those that wish to take advantage of this feature. If there are other points I've missed for whatever reason, please add them here again for clarity. My own assessment of the above is that the checksum feature can be added to 9.2, as long as we agree the changes above and then proceed to implement them and also that no further serious problems emerge. -- Simon Riggs http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Checksums, state of play
3. Pages with checksums set need to have a version marking to show that they are a later version of the page layout. That version number needs to be extensible to many later versions. Pages of multiple versions need to exist within the server to allow simple upgrades and migration. This is a statement of a problem; do you have a proposed solution for it? -- Josh Berkus PostgreSQL Experts Inc. http://pgexperts.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers