On Thu, Jun 18, 2015 at 6:32 PM, Joel Sing <[email protected]> wrote: > > Re adding some form of checksumming, it only seems to make sense in the case > of RAID 1 where you can decide that the data on a disk is invalid, then fail > the read and pull the data from another drive. That coupled with block > level "healing" or similar could be interesting. Otherwise checksumming on > its own is not overly useful at this level - you would simply fail a read, > which then results in potentially worse than bit-flipping at higher layers.
Honestly speaking, I value reliability in extreme way so I would rather prefer failed read than silent bit-flip propagated up to the application level. > If you wanted to investigate this I would suggest considering it as an option > to the existing RAID 1 implementation. The bulk of it would be calculating > and adding a checksum to each write and offsetting each block accordingly, > along with verification on read. The failure modes would need to be thought > through and handled - the re-reading from a different disk is already there, > however what you then do with the failure is an open question (failing the > chunk entirely is the heavy handed but already supported approach). I will see what I can do with it over the summer. There is indeed a lot of questions which will need to be solved, but let's see if I'm able to come up with some patch as a proof of concept first. Just for now thinking about propagating failures in some form into sensors value and showing to user without hard chunk failure -- at least up to some predefined threshold. Also what's I'm used to use is kind of scrub, which looks like is kind of supported in kernel, at least there are some signs of scrub support there, but I'm yet to find out for what reason since bioctl does not know about it (yet?). Anyway, correct scrubbing will require whole drive zeroing first, but then it may be enforced just by dd drive to /dev/null probably -- just a wild guess. Thanks for your information! Karel
