Re: [zfs-discuss] Dealing with Single Bit Flips - WAS: Cause for data corruption?
[slightly different angle below...] Nathan Kroenert wrote: Hey, Bob, Though I have already got the answer I was looking for here, I thought I'd at least take the time to provide my point of view as to my *why*... First: I don't think any of us have forgotten the goodness that ZFS's checksum *can* bring. I'm also keenly aware that we have some customers running HDS / EMC boxes who disable the ZFS checksum by default because they 'don't want to have files break due to a single bit flip...' and they really don't care where the flip happens, and they don't want to 'waste' disks or bandwidth allowing ZFS to do it's own protection when they already pay for it inside their zillion dollar disk box. (Some say waste, some call it insurance... ;). Oracle users in particular seem to have this mindset, though that's another thread entirely. :) If you look at the zfs-discuss archives, you will find anecdotes of failing raid arrays (yes, even expensive ones) and SAN switches causing corruption which was detected by ZFS. A telltale sign of borken hardware is someone complaining that ZFS checksums are borken, only to find out their hardware is at fault. As for Oracle, modern releases of the Oracle database also have checksumming enabled by default, so there is some merit to the argument that ZFS checksums are redundant. IMNSHO, ZFS is not being designed to replace ASM. I'd suspect we don't hear people whining about single bit flips, because they would not know if it's happening unless the app sitting on top had it's own protection. Or - if the error is obvious, or crashes their system... Or if they were running ZFS, but at this stage, we cannot delineate between single bit or massively crapped out errors, so what's to say we are NOT seeing it? Also - Don't assume bit rot on disk is the only way we can get single bit errors. Considering that until very recently (and quite likely even now to a reasonable extent), most CPU's did not have data protection in *every* place data transited through, single bit flips are still a very real possibility, and becoming more likely as process shrinks continue. Granted, on CPU's with Register Parity protection, undetected doubles are more likely to 'slip under the radar', as registers are typically protected with parity at best, if at all... A single bit in the parity protected register will be detected, a double won't. It depends on the processor. Most of the modern SPARC processors have extensive error detection and correction inside. But processors are still different than memories in that the time a datum resides in a single location is quite short. We worry more about random data losses when the datum is stored in one place for a long time, which is why you see different sorts of data protection at the different layers of a system design. To put this in more mathematical terms, there is a failure rate for each failure mode, but your exposure to the failure mode is time bounded. It does seem that some of us are getting a little caught up in disks and their magnificence in what they write to the platter and read back, and overlooking the potential value of a simple (though potentially computationally expensive) circus trick, which might, just might, make your broken 1TB archive useful again... I don't think it's a good idea for us to assume that it's OK to 'leave out' potential goodness for the masses that want to use ZFS in non-enterprise environments like laptops / home PC's, or use commodity components in conjunction with the Big Stuff... (Like white box PC's connected to an EMC or HDS box... ) Anyhoo - I'm glad we have pretty much already done this work once before. It gives me hope that we'll see it make a comeback. ;) (And I look forward to Jeff Co developing a hyper cool way of generating 12800 checksums using all 64 threads of a Niagara 2, using the same source data in cache, so we don't need to hit memory, so that it happens in the blink of an eye. or two. ok - maybe three... ;) Maybe we could also use the SPU's as well... OK - So, I'm possibly dreaming here, but hell, if I'm dreaming, why not dream big. :) I sense that the requested behaviour here is to be able to get to the corrupted contents of a file, even if we know it is corrupted. I think this is a good idea because: 1. The block is what is corrupted, not necessarily my file. A single block may contain several files which are grouped together, checksummed, and written to disk. 2. The current behaviour of returning EIO when read()ing a file up to the (possible) corruption point is rather irritating, but probably the right thing to do. Since we know the files affected, we could write a savior, providing we can get some reasonable response other than EIO. As Jeff points out, I'm not sure that automatic repair is the right answer, but a manual savior might work better than
Re: [zfs-discuss] Dealing with Single Bit Flips - WAS: Cause for data corruption?
On Tue, 4 Mar 2008, Richard Elling wrote: Also note: the checksums don't have enough information to recreate the data for very many bit changes. Hashes might, but I don't know anyone using sha256. It is indeed important to recognize that the checksums are a way to detect that the data is incorrect rather than a way to tell that the data is correct. There may be several permutations of wrong data which can result in the same checksum, but the probability of encountering those permutations due to natural causes is quite small. Bob == Bob Friesenhahn [EMAIL PROTECTED], http://www.simplesystems.org/users/bfriesen/ GraphicsMagick Maintainer,http://www.GraphicsMagick.org/ ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Dealing with Single Bit Flips - WAS: Cause for data corruption?
Jeff Bonwick wrote: All that said, I'm still occasionally tempted to bring it back. It may become more relevant with flash memory as a storage medium. Would it be worth considering bring it back as part of zdb rather than part of the core zio layer ? -- Darren J Moffat ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Dealing with Single Bit Flips - WAS: Cause for data corruption?
Darren J Moffat wrote: Jeff Bonwick wrote: All that said, I'm still occasionally tempted to bring it back. It may become more relevant with flash memory as a storage medium. Would it be worth considering bring it back as part of zdb rather than part of the core zio layer ? I'm not convinced that single bit flips are the common failure mode for disks. Most enterprise class disks already have enough ECC to correct at least 8 bytes per block. By the time the disk sends something back that it couldn't correct, there is no telling how many bits have been flipped, but I'll bet a steak dinner it is more than one. There may be some benefit for path failures, but I've not seen any measured data on those failure modes. For paths which have framing checksums, we would expect them to be detected there. -- richard ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Dealing with Single Bit Flips - WAS: Cause for data corruption?
On Mon, 3 Mar 2008, Darren J Moffat wrote: I'm not convinced that single bit flips are the common failure mode for disks. Most enterprise class disks already have enough ECC to correct at least 8 bytes per block. and for consumer rather than enterprise class disks ? You are assuming that the ECC used for consumer disks is substantially different than that used for enterprise disks. That is likely not the case since ECC is provided by a chip which costs a few dollars. The only reason to use a lesser grade algorithm would be to save a small bit of storage space. Consumer disks use essentially the same media as enterprise disks. Consumer disks store a higher bit density on similar media. Consumer disks have less precise/consistent head controllers than enterprise disks. Consumer disks are less well-specified than enterprise disks. Due to the higher bit density we can expect more wrong bits to be read since we are pushing the media harder. Due to less consistent head controllers we can expect more incidences of reading or writing the wrong track or writing something which can't be read. Consumer disks are often used in an environment where they may be physically disturbed while they are writing or reading the data. Enterprise disks are usually used in very stable environments. The upshot of this is that we can expect more unrecoverable errors, but it seems unlikely that there will be more single bit errors recoverable at the ZFS level. Bob == Bob Friesenhahn [EMAIL PROTECTED], http://www.simplesystems.org/users/bfriesen/ GraphicsMagick Maintainer,http://www.GraphicsMagick.org/ ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Dealing with Single Bit Flips - WAS: Cause for data corruption?
Bob Friesenhahn wrote: On Mon, 3 Mar 2008, Darren J Moffat wrote: I'm not convinced that single bit flips are the common failure mode for disks. Most enterprise class disks already have enough ECC to correct at least 8 bytes per block. and for consumer rather than enterprise class disks ? You are assuming that the ECC used for consumer disks is substantially different than that used for enterprise disks. That is likely not the case since ECC is provided by a chip which costs a few dollars. The only reason to use a lesser grade algorithm would be to save a small bit of storage space. Consumer disks use essentially the same media as enterprise disks. Consumer disks store a higher bit density on similar media. Consumer disks have less precise/consistent head controllers than enterprise disks. Consumer disks are less well-specified than enterprise disks. Due to the higher bit density we can expect more wrong bits to be read since we are pushing the media harder. Due to less consistent head controllers we can expect more incidences of reading or writing the wrong track or writing something which can't be read. Consumer disks are often used in an environment where they may be physically disturbed while they are writing or reading the data. Enterprise disks are usually used in very stable environments. The upshot of this is that we can expect more unrecoverable errors, but it seems unlikely that there will be more single bit errors recoverable at the ZFS level. I agree, and am waiting to get the proceedings from FAST08 which has some interesting papers in the list. A while back I blogged about an Adaptec online seminar which addressed this topic. Rather than repeating what they said, I left a pointer and a recommendation. http://blogs.sun.com/relling/entry/adaptec_webinar_on_disks_and Also, note that the published reliability data from disk vendors is constantly changing. For laptop drives, we're seeing less MTBF or UER and more head landings specs. It seems that an important failure mode for laptop disks is wear out at the landing site. This is due to power management powering or spinning down the disk. We don't tend to see this failure mode in servers or RAID arrays. -- richard ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Dealing with Single Bit Flips - WAS: Cause for data corruption?
Bob Friesenhahn wrote: On Tue, 4 Mar 2008, Nathan Kroenert wrote: It does seem that some of us are getting a little caught up in disks and their magnificence in what they write to the platter and read back, and overlooking the potential value of a simple (though potentially computationally expensive) circus trick, which might, just might, make your broken 1TB archive useful again... The circus trick can be handled via a user-contributed utility. In fact, people can compete with their various repair utilities. There are only 1048576 1-bit permuations to try, and then the various two-bit permutations can be tried. That does not sound 'easy', and I consider that ZFS should be... :) and IMO it's something that should really be built in, not attacked with an addon. I had (as did Jeff in his initial response) considered that we only need to actually try to flip 128KB worth of bits once... That many flips means that we in a way 'processing' some 128GB in the worse case when re-generating checksums. Internal to a CPU, depending on Cache Aliasing, competing workloads, threadedness, etc, this could be dramatically variable... something I guess the ZFS team would want to keep out of the 'standard' filesystem operation... hm. :\ I don't think it's a good idea for us to assume that it's OK to 'leave out' potential goodness for the masses that want to use ZFS in non-enterprise environments like laptops / home PC's, or use commodity components in conjunction with the Big Stuff... (Like white box PC's connected to an EMC or HDS box... ) It seems that goodness for the masses has not been left out. The forthcoming ability to request duplicate ZFS blocks is very good news indeed. We are entering an age where the entry level SATA disk is 1TB and users have more space than they know what to do with. A little replication gives these users something useful to do with their new disk while avoiding the need for unreliable circus tricks to recover data. ZFS goes far beyond MS-DOS's recover command (which should have been called destroy). I never have enough space on my laptop... I guess I'm a freak. But - I am sure that we are *both* right for some subsets of ZFS users, and that the more choice we have built into the filesystem, the better. Thanks again for the comments! Nathan. ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Dealing with Single Bit Flips - WAS: Cause for data corruption?
Nathan Kroenert [EMAIL PROTECTED] writes: Bob Friesenhahn wrote: On Tue, 4 Mar 2008, Nathan Kroenert wrote: It does seem that some of us are getting a little caught up in disks and their magnificence in what they write to the platter and read back, and overlooking the potential value of a simple (though potentially computationally expensive) circus trick, which might, just might, make your broken 1TB archive useful again... The circus trick can be handled via a user-contributed utility. In fact, people can compete with their various repair utilities. There are only 1048576 1-bit permuations to try, and then the various two-bit permutations can be tried. That does not sound 'easy', and I consider that ZFS should be... :) and IMO it's something that should really be built in, not attacked with an addon. I had (as did Jeff in his initial response) considered that we only need to actually try to flip 128KB worth of bits once... That many flips means that we in a way 'processing' some 128GB in the worse case when re-generating checksums. Internal to a CPU, depending on Cache Aliasing, competing workloads, threadedness, etc, this could be dramatically variable... something I guess the ZFS team would want to keep out of the 'standard' filesystem operation... hm. :\ Maybe an option to scrub... something that says work on bitflips for bad blocks, or work on bitflips for bad blocks in this file Boyd ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Dealing with Single Bit Flips - WAS: Cause for data corruption?
Hey, Bob My perspective on Big reasons for it *to* be integrated would be: - It's tested - By the folks charged with making ZFS good - It's kept in sync with the differing Zpool versions - It's documented - When the system *is* patched, any changes the patch brings are synced with the recovery mechanism - Being integrated, it has options that can be persistently set if required - It's there when you actually need it - It could be integrated with Solaris FMA to take some funky actions based on the nature of the failure, including cool messages telling you what you need to run to attempt a repair etc - It's integrated (recursive, self fulfilling benefit... ;) As for the separate utility for different failure modes, I agree, *development* of these might be faster if everyone chases their own pet failure mode and contributes it, but I still think getting them integrated either as optional actions on error, or as part of zdb or other would be far better than having to go looking for the utility and 'give it a whirl'. But - I'm sure that's a personal preference, and I'm sure that there are those that would love the opportunity to roll their own. OK - I'm going to shutup now. I think I have done this to death, and I don't want to end up in everyone's kill filter. Cheers! Nathan. Bob Friesenhahn wrote: On Tue, 4 Mar 2008, Nathan Kroenert wrote: The circus trick can be handled via a user-contributed utility. In fact, people can compete with their various repair utilities. There are only 1048576 1-bit permuations to try, and then the various two-bit permutations can be tried. That does not sound 'easy', and I consider that ZFS should be... :) and IMO it's something that should really be built in, not attacked with an addon. There are several reasons why this sort of thing should not be in ZFS itself. A big reason is that if it is in ZFS itself, it can only be updated via an OS patch or upgrade, along with a required reboot. If it is in a utility, it can be downloaded and used as the user sees fit without any additional disruption to the system. While some errors are random, others follow well defined patterns, so it may be that one utility is better than another or that user-provided options can help achieve success faster. Bob == Bob Friesenhahn [EMAIL PROTECTED], http://www.simplesystems.org/users/bfriesen/ GraphicsMagick Maintainer,http://www.GraphicsMagick.org/ ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Dealing with Single Bit Flips - WAS: Cause for data corruption?
Nathan: yes. Flipping each bit and recomputing the checksum is not only possible, we actually did it in early versions of the code. The problem is that it's really expensive. For a 128K block, that's a million bits, so you have to re-run the checksum a million times, on 128K of data. That's 128GB of data to churn through. So Bob: you're right too. It's generally much cheaper to retry the I/O, try another disk, try a ditto block, etc. That said, when all else fails, a 128GB computation is a lot cheaper than a restore from tape. At some point it becomes a bit philosophical. Suppose the block in question is a single user data block. How much of the machine should you be willing to dedicate to getting that block back? I mean, suppose you knew that it was theoretically possible, but would consume 500 hours of CPU time during which everything else would be slower -- and the affected app's read() system call would hang for 500 hours. What is the right policy? There's no one right answer. If we were to introduce a feature like this, we'd need some admin-settable limit on how much time to dedicate to it. For some checksum functions like fletcher2 and fletcher4, it is possible to do much better than brute force because you can compute an incremental update -- that is, you can compute the effect of changing the nth bit without rerunning the entire checksum. This is, however, not possible with SHA-256 or any other secure hash. We ended up taking that code out because single-bit errors didn't seem to arise in practice, and in testing, the error correction had a rather surprising unintended side effect: it masked bugs in the code! The nastiest kind of bug in ZFS is something we call a future leak, which is when some change from txg (transaction group) 37 ends up going out as part of txg 36. It normally wouldn't matter, except if you lost power before txg 37 was committed to disk. On reboot you'd have inconsistent on-disk state (all of 36 plus random bits of 37). We developed coding practices and stress tests to catch future leaks, and as I know we've never actually shipped one. But they are scary. If you *do* have a future leak, it's not uncommon for it to be a very small change -- perhaps incrementing a counter in some on-disk structure. The thing is, if the counter is going from even to odd, that's exactly a one-bit change. The single-bit error correction logic would happily detect these and fix them up -- not at all what you want when testing! (Of course, we could turn it off during testing -- but then we wouldn't be testing it.) All that said, I'm still occasionally tempted to bring it back. It may become more relevant with flash memory as a storage medium. Jeff On Sun, Mar 02, 2008 at 05:28:48PM -0600, Bob Friesenhahn wrote: On Mon, 3 Mar 2008, Nathan Kroenert wrote: Speaking of expensive, but interesting things we could do - From the little I know of ZFS's checksum, it's NOT like the ECC checksum we use in memory in that it's not something we can use to determine which bit flipped in the event that there was a single bit flip in the data. (I could be completely wrong here... but...) It seems that the emphasis on single-bit errors may be misplaced. Is there evidence which suggests that single-bit errors are much more common than multiple bit errors? What is the chance we could put a little more resilience into ZFS such that if we do get a checksum error, we systematically flip each bit in sequence and check the checksum to see if we could in fact proceed (including writing the data back correctly.). It is easier to retry the disk read another 100 times or store the data in multiple places. Or build into the checksum something analogous to ECC so we can choose to use NON-ZFS protected disks and paths, but still have single bit flip protection... Disk drives commonly use an algorithm like Reed Solomon (http://en.wikipedia.org/wiki/Reed-Solomon_error_correction) which provides forward-error correction. This is done in hardware. Doing the same in software is likely to be very slow. What do others on the list think? Do we have enough folks using ZFS on HDS / EMC / other hardware RAID(X) environments that might find this useful? It seems that since ZFS is intended to support extremely large storage pools, available energy should be spent ensuring that the storage pool remains healthy or can be repaired. Loss of individual file blocks is annoying, but loss of entire storage pools is devastating. Since raw disk is cheap (and backups are expensive), it makes sense to write more redundant data rather than to minimize loss through exotic algorithms. Even if RAID is not used, redundant copies may be used on the same disk to help protect against block read errors. Bob == Bob Friesenhahn [EMAIL PROTECTED], http://www.simplesystems.org/users/bfriesen/ GraphicsMagick Maintainer,