Re: [zfs-discuss] Dealing with Single Bit Flips - WAS: Cause for data corruption?

2008-03-04 Thread Richard Elling
[slightly different angle below...]

Nathan Kroenert wrote:
 Hey, Bob,

 Though I have already got the answer I was looking for here, I thought 
 I'd at least take the time to provide my point of view as to my *why*...

 First: I don't think any of us have forgotten the goodness that ZFS's 
 checksum *can* bring.

 I'm also keenly aware that we have some customers running HDS / EMC 
 boxes who disable the ZFS checksum by default because they 'don't want 
 to have files break due to a single bit flip...' and they really don't 
 care where the flip happens, and they don't want to 'waste' disks or 
 bandwidth allowing ZFS to do it's own protection when they already pay 
 for it inside their zillion dollar disk box. (Some say waste, some call 
 it insurance... ;). Oracle users in particular seem to have this 
 mindset, though that's another thread entirely. :)
   

If you look at the zfs-discuss archives, you will find anecdotes
of failing raid arrays (yes, even expensive ones) and SAN switches
causing corruption which was detected by ZFS.  A telltale sign of
borken hardware is someone complaining that ZFS checksums are
borken, only to find out their hardware is at fault.

As for Oracle, modern releases of the Oracle database also have
checksumming enabled by default, so there is some merit to the
argument that ZFS checksums are redundant.  IMNSHO, ZFS is
not being designed to replace ASM.

 I'd suspect we don't hear people whining about single bit flips, because 
 they would not know if it's happening unless the app sitting on top had 
 it's own protection. Or - if the error is obvious, or crashes their 
 system... Or if they were running ZFS, but at this stage, we cannot 
 delineate between single bit or massively crapped out errors, so what's 
 to say we are NOT seeing it?

 Also - Don't assume bit rot on disk is the only way we can get single 
 bit errors.

 Considering that until very recently (and quite likely even now to a 
 reasonable extent), most CPU's did not have data protection in *every* 
 place data transited through, single bit flips are still a very real 
 possibility, and becoming more likely as process shrinks continue. 
 Granted, on CPU's with Register Parity protection, undetected doubles 
 are more likely to 'slip under the radar', as registers are typically 
 protected with parity at best, if at all... A single bit in the parity 
 protected register will be detected, a double won't.
   

It depends on the processor.  Most of the modern SPARC processors
have extensive error detection and correction inside.  But processors
are still different than memories in that the time a datum resides in a
single location is quite short.  We worry more about random data
losses when the datum is stored in one place for a long time, which
is why you see different sorts of data protection at the different layers
of a system design.  To put this in more mathematical terms, there is
a failure rate for each failure mode, but your exposure to the failure
mode is time bounded.

 It does seem that some of us are getting a little caught up in disks and 
 their magnificence in what they write to the platter and read back, and 
 overlooking the potential value of a simple (though potentially 
 computationally expensive) circus trick, which might, just might, make 
 your broken 1TB archive useful again...

 I don't think it's a good idea for us to assume that it's OK to 'leave 
 out' potential goodness for the masses that want to use ZFS in 
 non-enterprise environments like laptops / home PC's, or use commodity 
 components in conjunction with the Big Stuff... (Like white box PC's 
 connected to an EMC or HDS box... )

 Anyhoo - I'm glad we have pretty much already done this work once 
 before. It gives me hope that we'll see it make a comeback. ;)

 (And I look forward to Jeff  Co developing a hyper cool way of 
 generating 12800 checksums using all 64 threads of a Niagara 2, 
 using the same source data in cache, so we don't need to hit memory, so 
 that it happens in the blink of an eye. or two. ok - maybe three... ;) 
 Maybe we could also use the SPU's as well... OK - So, I'm possibly 
 dreaming here, but hell, if I'm dreaming, why not dream big. :)
   

I sense that the requested behaviour here is to be able to
get to the corrupted contents of a file, even if we know it
is corrupted.  I think this is a good idea because:

1. The block is what is corrupted, not necessarily my file.
   A single block may contain several files which are grouped
   together, checksummed, and written to disk.

2.  The current behaviour of returning EIO when read()ing a
   file up to the (possible) corruption point is rather irritating,
   but probably the right thing to do.  Since we know the
   files affected, we could write a savior, providing we can
   get some reasonable response other than EIO.
   As Jeff points out, I'm not sure that automatic repair is
   the right answer, but a manual savior might work better
   than 

Re: [zfs-discuss] Dealing with Single Bit Flips - WAS: Cause for data corruption?

2008-03-04 Thread Bob Friesenhahn
On Tue, 4 Mar 2008, Richard Elling wrote:

 Also note: the checksums don't have enough information to
 recreate the data for very many bit changes.  Hashes might,
 but I don't know anyone using sha256.

It is indeed important to recognize that the checksums are a way to 
detect that the data is incorrect rather than a way to tell that the 
data is correct.  There may be several permutations of wrong data 
which can result in the same checksum, but the probability of 
encountering those permutations due to natural causes is quite small.

Bob
==
Bob Friesenhahn
[EMAIL PROTECTED], http://www.simplesystems.org/users/bfriesen/
GraphicsMagick Maintainer,http://www.GraphicsMagick.org/

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Dealing with Single Bit Flips - WAS: Cause for data corruption?

2008-03-03 Thread Darren J Moffat
Jeff Bonwick wrote:
 All that said, I'm still occasionally tempted to bring it back.
 It may become more relevant with flash memory as a storage medium.

Would it be worth considering bring it back as part of zdb rather than 
part of the core zio layer ?

-- 
Darren J Moffat
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Dealing with Single Bit Flips - WAS: Cause for data corruption?

2008-03-03 Thread Richard Elling
Darren J Moffat wrote:
 Jeff Bonwick wrote:
   
 All that said, I'm still occasionally tempted to bring it back.
 It may become more relevant with flash memory as a storage medium.
 

 Would it be worth considering bring it back as part of zdb rather than 
 part of the core zio layer ?

   

I'm not convinced that single bit flips are the common
failure mode for disks.  Most enterprise class disks already
have enough ECC to correct at least 8 bytes per block.
By the time the disk sends something back that it couldn't
correct, there is no telling how many bits have been flipped,
but I'll bet a steak dinner it is more than one.

There may be some benefit for path failures, but I've not
seen any measured data on those failure modes.  For paths
which have framing checksums, we would expect them to
be detected there.
 -- richard

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Dealing with Single Bit Flips - WAS: Cause for data corruption?

2008-03-03 Thread Bob Friesenhahn
On Mon, 3 Mar 2008, Darren J Moffat wrote:

 I'm not convinced that single bit flips are the common
 failure mode for disks.  Most enterprise class disks already
 have enough ECC to correct at least 8 bytes per block.

 and for consumer rather than enterprise  class disks ?

You are assuming that the ECC used for consumer disks is 
substantially different than that used for enterprise disks.  That 
is likely not the case since ECC is provided by a chip which costs a 
few dollars.  The only reason to use a lesser grade algorithm would be 
to save a small bit of storage space.

Consumer disks use essentially the same media as enterprise disks.

Consumer disks store a higher bit density on similar media.

Consumer disks have less precise/consistent head controllers than 
enterprise disks.

Consumer disks are less well-specified than enterprise disks.

Due to the higher bit density we can expect more wrong bits to be read 
since we are pushing the media harder.  Due to less consistent head 
controllers we can expect more incidences of reading or writing the 
wrong track or writing something which can't be read.  Consumer disks 
are often used in an environment where they may be physically 
disturbed while they are writing or reading the data.  Enterprise 
disks are usually used in very stable environments.

The upshot of this is that we can expect more unrecoverable errors, 
but it seems unlikely that there will be more single bit errors 
recoverable at the ZFS level.

Bob
==
Bob Friesenhahn
[EMAIL PROTECTED], http://www.simplesystems.org/users/bfriesen/
GraphicsMagick Maintainer,http://www.GraphicsMagick.org/

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Dealing with Single Bit Flips - WAS: Cause for data corruption?

2008-03-03 Thread Richard Elling
Bob Friesenhahn wrote:
 On Mon, 3 Mar 2008, Darren J Moffat wrote:

   
 I'm not convinced that single bit flips are the common
 failure mode for disks.  Most enterprise class disks already
 have enough ECC to correct at least 8 bytes per block.
   
 and for consumer rather than enterprise  class disks ?
 

 You are assuming that the ECC used for consumer disks is 
 substantially different than that used for enterprise disks.  That 
 is likely not the case since ECC is provided by a chip which costs a 
 few dollars.  The only reason to use a lesser grade algorithm would be 
 to save a small bit of storage space.

 Consumer disks use essentially the same media as enterprise disks.

 Consumer disks store a higher bit density on similar media.

 Consumer disks have less precise/consistent head controllers than 
 enterprise disks.

 Consumer disks are less well-specified than enterprise disks.

 Due to the higher bit density we can expect more wrong bits to be read 
 since we are pushing the media harder.  Due to less consistent head 
 controllers we can expect more incidences of reading or writing the 
 wrong track or writing something which can't be read.  Consumer disks 
 are often used in an environment where they may be physically 
 disturbed while they are writing or reading the data.  Enterprise 
 disks are usually used in very stable environments.

 The upshot of this is that we can expect more unrecoverable errors, 
 but it seems unlikely that there will be more single bit errors 
 recoverable at the ZFS level.
   

I agree, and am waiting to get the proceedings from FAST08
which has some interesting papers in the list.

A while back I blogged about an Adaptec online seminar
which addressed this topic.  Rather than repeating what they
said, I left a pointer and a recommendation.
http://blogs.sun.com/relling/entry/adaptec_webinar_on_disks_and

Also, note that the published reliability data from disk vendors
is constantly changing.  For laptop drives, we're seeing less
MTBF or UER and more head landings specs.  It seems that
an important failure mode for laptop disks is wear out at the
landing site.  This is due to power management powering or
spinning down the disk.  We don't tend to see this failure
mode in servers or RAID arrays.
 -- richard

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Dealing with Single Bit Flips - WAS: Cause for data corruption?

2008-03-03 Thread Nathan Kroenert
Bob Friesenhahn wrote:
 On Tue, 4 Mar 2008, Nathan Kroenert wrote:

 It does seem that some of us are getting a little caught up in disks 
 and their magnificence in what they write to the platter and read 
 back, and overlooking the potential value of a simple (though 
 potentially computationally expensive) circus trick, which might, just 
 might, make your broken 1TB archive useful again...
 
 The circus trick can be handled via a user-contributed utility.  In 
 fact, people can compete with their various repair utilities.  There are 
 only 1048576 1-bit permuations to try, and then the various two-bit 
 permutations can be tried.

That does not sound 'easy', and I consider that ZFS should be... :) and 
IMO it's something that should really be built in, not attacked with an 
addon.

I had (as did Jeff in his initial response) considered that we only need 
to actually try to flip 128KB worth of bits once... That many flips 
means that we in a way 'processing' some 128GB in the worse case when 
re-generating checksums.  Internal to a CPU, depending on Cache 
Aliasing, competing workloads, threadedness, etc, this could be 
dramatically variable... something I guess the ZFS team would want to 
keep out of the 'standard' filesystem operation... hm. :\

 I don't think it's a good idea for us to assume that it's OK to 'leave 
 out' potential goodness for the masses that want to use ZFS in 
 non-enterprise environments like laptops / home PC's, or use commodity 
 components in conjunction with the Big Stuff... (Like white box PC's 
 connected to an EMC or HDS box... )
 
 It seems that goodness for the masses has not been left out.  The 
 forthcoming ability to request duplicate ZFS blocks is very good news 
 indeed.  We are entering an age where the entry level SATA disk is 1TB 
 and users have more space than they know what to do with.  A little 
 replication gives these users something useful to do with their new disk 
 while avoiding the need for unreliable circus tricks to recover data.  
 ZFS goes far beyond MS-DOS's recover command (which should have been 
 called destroy).

I never have enough space on my laptop... I guess I'm a freak.

But - I am sure that we are *both* right for some subsets of ZFS users, 
and that the more choice we have built into the filesystem, the better.

Thanks again for the comments!

Nathan.




___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Dealing with Single Bit Flips - WAS: Cause for data corruption?

2008-03-03 Thread Boyd Adamson
Nathan Kroenert [EMAIL PROTECTED] writes:
 Bob Friesenhahn wrote:
 On Tue, 4 Mar 2008, Nathan Kroenert wrote:

 It does seem that some of us are getting a little caught up in disks 
 and their magnificence in what they write to the platter and read 
 back, and overlooking the potential value of a simple (though 
 potentially computationally expensive) circus trick, which might, just 
 might, make your broken 1TB archive useful again...
 
 The circus trick can be handled via a user-contributed utility.  In 
 fact, people can compete with their various repair utilities.  There are 
 only 1048576 1-bit permuations to try, and then the various two-bit 
 permutations can be tried.

 That does not sound 'easy', and I consider that ZFS should be... :) and 
 IMO it's something that should really be built in, not attacked with an 
 addon.

 I had (as did Jeff in his initial response) considered that we only need 
 to actually try to flip 128KB worth of bits once... That many flips 
 means that we in a way 'processing' some 128GB in the worse case when 
 re-generating checksums.  Internal to a CPU, depending on Cache 
 Aliasing, competing workloads, threadedness, etc, this could be 
 dramatically variable... something I guess the ZFS team would want to 
 keep out of the 'standard' filesystem operation... hm. :\

Maybe an option to scrub... something that says work on bitflips for
bad blocks, or work on bitflips for bad blocks in this file

Boyd
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Dealing with Single Bit Flips - WAS: Cause for data corruption?

2008-03-03 Thread Nathan Kroenert
Hey, Bob

My perspective on Big reasons for it *to* be integrated would be:
  - It's tested - By the folks charged with making ZFS good
  - It's kept in sync with the differing Zpool versions
  - It's documented
  - When the system *is* patched, any changes the patch brings are 
synced with the recovery mechanism
  - Being integrated, it has options that can be persistently set if 
required
  - It's there when you actually need it
  - It could be integrated with Solaris FMA to take some funky actions 
based on the nature of the failure, including cool messages telling you 
what you need to run to attempt a repair etc
  - It's integrated (recursive, self fulfilling benefit... ;)

As for the separate utility for different failure modes, I agree, 
*development* of these might be faster if everyone chases their own pet 
failure mode and contributes it, but I still think getting them 
integrated either as optional actions on error, or as part of zdb or 
other would be far better than having to go looking for the utility and 
'give it a whirl'.

But - I'm sure that's a personal preference, and I'm sure that there are 
those that would love the opportunity to roll their own.

OK - I'm going to shutup now. I think I have done this to death, and I 
don't want to end up in everyone's kill filter.

Cheers!

Nathan.



Bob Friesenhahn wrote:
 On Tue, 4 Mar 2008, Nathan Kroenert wrote:
 The circus trick can be handled via a user-contributed utility.  In fact, 
 people can compete with their various repair utilities.  There are only 
 1048576 1-bit permuations to try, and then the various two-bit permutations 
 can be tried.
 That does not sound 'easy', and I consider that ZFS should be... :) and IMO 
 it's something that should really be built in, not attacked with an addon.
 
 There are several reasons why this sort of thing should not be in ZFS 
 itself.  A big reason is that if it is in ZFS itself, it can only be 
 updated via an OS patch or upgrade, along with a required reboot.  If 
 it is in a utility, it can be downloaded and used as the user sees fit 
 without any additional disruption to the system.  While some errors 
 are random, others follow well defined patterns, so it may be that one 
 utility is better than another or that user-provided options can help 
 achieve success faster.
 
 Bob
 ==
 Bob Friesenhahn
 [EMAIL PROTECTED], http://www.simplesystems.org/users/bfriesen/
 GraphicsMagick Maintainer,http://www.GraphicsMagick.org/
 
 ___
 zfs-discuss mailing list
 zfs-discuss@opensolaris.org
 http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Dealing with Single Bit Flips - WAS: Cause for data corruption?

2008-03-02 Thread Jeff Bonwick
Nathan: yes.  Flipping each bit and recomputing the checksum is not only
possible, we actually did it in early versions of the code.  The problem
is that it's really expensive.  For a 128K block, that's a million bits,
so you have to re-run the checksum a million times, on 128K of data.
That's 128GB of data to churn through.

So Bob: you're right too.  It's generally much cheaper to retry the I/O,
try another disk, try a ditto block, etc.  That said, when all else fails,
a 128GB computation is a lot cheaper than a restore from tape.  At some
point it becomes a bit philosophical.  Suppose the block in question is
a single user data block.  How much of the machine should you be willing
to dedicate to getting that block back?  I mean, suppose you knew that
it was theoretically possible, but would consume 500 hours of CPU time
during which everything else would be slower -- and the affected app's
read() system call would hang for 500 hours.  What is the right policy?
There's no one right answer.  If we were to introduce a feature like this,
we'd need some admin-settable limit on how much time to dedicate to it.

For some checksum functions like fletcher2 and fletcher4, it is possible
to do much better than brute force because you can compute an incremental
update -- that is, you can compute the effect of changing the nth bit
without rerunning the entire checksum.  This is, however, not possible
with SHA-256 or any other secure hash.

We ended up taking that code out because single-bit errors didn't seem
to arise in practice, and in testing, the error correction had a rather
surprising unintended side effect: it masked bugs in the code!

The nastiest kind of bug in ZFS is something we call a future leak,
which is when some change from txg (transaction group) 37 ends up
going out as part of txg 36.  It normally wouldn't matter, except if
you lost power before txg 37 was committed to disk.  On reboot you'd
have inconsistent on-disk state (all of 36 plus random bits of 37).
We developed coding practices and stress tests to catch future leaks,
and as I know we've never actually shipped one.  But they are scary.

If you *do* have a future leak, it's not uncommon for it to be a very
small change -- perhaps incrementing a counter in some on-disk structure.
The thing is, if the counter is going from even to odd, that's exactly
a one-bit change.  The single-bit error correction logic would happily
detect these and fix them up -- not at all what you want when testing!
(Of course, we could turn it off during testing -- but then we wouldn't
be testing it.)

All that said, I'm still occasionally tempted to bring it back.
It may become more relevant with flash memory as a storage medium.

Jeff

On Sun, Mar 02, 2008 at 05:28:48PM -0600, Bob Friesenhahn wrote:
 On Mon, 3 Mar 2008, Nathan Kroenert wrote:
  Speaking of expensive, but interesting things we could do -
 
  From the little I know of ZFS's checksum, it's NOT like the ECC
  checksum we use in memory in that it's not something we can use to
  determine which bit flipped in the event that there was a single bit
  flip in the data. (I could be completely wrong here... but...)
 
 It seems that the emphasis on single-bit errors may be misplaced.  Is 
 there evidence which suggests that single-bit errors are much more 
 common than multiple bit errors?
 
  What is the chance we could put a little more resilience into ZFS such
  that if we do get a checksum error, we systematically flip each bit in
  sequence and check the checksum to see if we could in fact proceed
  (including writing the data back correctly.).
 
 It is easier to retry the disk read another 100 times or store the 
 data in multiple places.
 
  Or build into the checksum something analogous to ECC so we can choose
  to use NON-ZFS protected disks and paths, but still have single bit flip
  protection...
 
 Disk drives commonly use an algorithm like Reed Solomon 
 (http://en.wikipedia.org/wiki/Reed-Solomon_error_correction) which 
 provides forward-error correction.  This is done in hardware.  Doing 
 the same in software is likely to be very slow.
 
  What do others on the list think? Do we have enough folks using ZFS on
  HDS / EMC / other hardware RAID(X) environments that might find this useful?
 
 It seems that since ZFS is intended to support extremely large storage 
 pools, available energy should be spent ensuring that the storage pool 
 remains healthy or can be repaired.  Loss of individual file blocks is 
 annoying, but loss of entire storage pools is devastating.
 
 Since raw disk is cheap (and backups are expensive), it makes sense to 
 write more redundant data rather than to minimize loss through exotic 
 algorithms.  Even if RAID is not used, redundant copies may be used on 
 the same disk to help protect against block read errors.
 
 Bob
 ==
 Bob Friesenhahn
 [EMAIL PROTECTED], http://www.simplesystems.org/users/bfriesen/
 GraphicsMagick Maintainer,