> > Au contraire:  I estimate its worth quite
> accurately from the undetected error rates reported
> in the CERN "Data Integrity" paper published last
> April (first hit if you Google 'cern "data
> integrity"').
> >
> > > While I have yet to see any checksum error
> reported
> > > by ZFS on
> > > Symmetrix arrays or FC/SAS arrays with some other
> > > "cheap" HW I've seen
> > > many of them
> >
> > While one can never properly diagnose anecdotal
> issues off the cuff in a Web forum, given CERN's
> experience you should probably check your
> configuration very thoroughly for things like
> marginal connections:  unless you're dealing with a
> far larger data set than CERN was, you shouldn't have
> seen 'many' checksum errors.
> 
> Well single bit error rates may be rare in normal
> operation hard
> drives, but from a systems perspective, data can be
> corrupted anywhere
> between disk and CPU.

The CERN study found that such errors (if they found any at all, which they 
couldn't really be sure of) were far less common than the manufacturer's spec 
for plain old detectable but unrecoverable bit errors or to the one hardware 
problem that they discovered (a disk firmware bug that appeared related to the 
unusual demands and perhaps negligent error reporting of their RAID controller 
and caused errors at a rate about an order of magnitude higher than the nominal 
spec for detectable but unrecoverable errors).

This suggests that in a ZFS-style installation without a hardware RAID 
controller they would have experienced at worst a bit error about every 10^14 
bits or 12 TB (the manufacturer's spec rate for detectable but unrecoverable 
errors) - though some studies suggest that the actual incidence of 'bit rot' is 
considerably lower than such specs.  Furthermore, simply scrubbing the disk in 
the background (as I believe some open-source LVMs are starting to do and for 
that matter some disks are starting to do themselves) would catch virtually all 
such errors in a manner that would allow a conventional RAID to correct them, 
leaving a residue of something more like one error per PB that ZFS could catch 
better than anyone else save WAFL.

  I know you're not interested
> in anecdotal
> evidence,

It's less that I'm not interested in it than that I don't find it very 
convincing when actual quantitative evidence is available that doesn't seem to 
support its importance.  I know very well that things like lost and wild writes 
occur, as well as the kind of otherwise undetected bus errors that you 
describe, but the available evidence seems to suggest that they occur in such 
small numbers that catching them is of at most secondary importance compared to 
many other issues.  All other things being equal, I'd certainly pick a file 
system that could do so, but when other things are *not* equal I don't think it 
would be a compelling attraction.

 but I had a box that was randomly
> corrupting blocks during
> DMA.  The errors showed up when doing a ZFS scrub and
> I caught the
> problem in time.

Yup - that's exactly the kind of error that ZFS and WAFL do a perhaps uniquely 
good job of catching.  Of course, buggy hardware can cause errors that trash 
your data in RAM beyond any hope of detection by ZFS, but (again, other things 
being equal) I agree that the more ways you have to detect them, the better.  
That said, it would be interesting to know who made this buggy hardware.

...

> Like others have said for big business; as a consumer
> I can reasonably
> comforably buy off the shelf cheap controllers and
> disks, and know
> that should any part of the system be flaky enough to
> cause data
> corruption the software layer will catch it which
> both saves money and
> creates peace of mind.

CERN was using relatively cheap disks and found that they were more than 
adequate (at least for any normal consumer use) without that additional level 
of protection:  the incidence of errors, even including the firmware errors 
which presumably would not have occurred in a normal consumer installation 
lacking hardware RAID, was on the order of 1 per TB - and given that it's 
really, really difficult for a consumer to come anywhere near that much data 
without most of it being video files (which just laugh and keep playing when 
they discover small errors) that's pretty much tantamount to saying that 
consumers would encounter no *noticeable* errors at all.

Your position is similar to that of an audiophile enthused about a measurable 
but marginal increase in music quality and trying to convince the hoi polloi 
that no other system will do:  while other audiophiles may agree with you, most 
people just won't consider it important - and in fact won't even be able to 
distinguish it at all.

- bill
 
 
This message posted from opensolaris.org
_______________________________________________
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Reply via email to