On Thu, Jun 18, 2015 at 6:32 PM, Joel Sing <[email protected]> wrote:
>
> Re adding some form of checksumming, it only seems to make sense in the case
> of RAID 1 where you can decide that the data on a disk is invalid, then fail
> the read and pull the data from another drive. That coupled with block
> level "healing" or similar could be interesting. Otherwise checksumming on
> its own is not overly useful at this level - you would simply fail a read,
> which then results in potentially worse than bit-flipping at higher layers.

Honestly speaking, I value reliability in extreme way so I would
rather prefer failed read than silent bit-flip propagated up to the
application level.

> If you wanted to investigate this I would suggest considering it as an option
> to the existing RAID 1 implementation. The bulk of it would be calculating
> and adding a checksum to each write and offsetting each block accordingly,
> along with verification on read. The failure modes would need to be thought
> through and handled - the re-reading from a different disk is already there,
> however what you then do with the failure is an open question (failing the
> chunk entirely is the heavy handed but already supported approach).

I will see what I can do with it over the summer. There is indeed a
lot of questions which will need to be solved, but let's see if I'm
able to come up with some patch as a proof of concept first. Just for
now thinking about propagating failures in some form into sensors
value and showing to user without hard chunk failure -- at least up to
some predefined threshold. Also what's I'm used to use is kind of
scrub, which looks like is kind of supported in kernel, at least there
are some signs of scrub support there, but I'm yet to find out for
what reason since bioctl does not know about it (yet?). Anyway,
correct scrubbing will require whole drive zeroing first, but then it
may be enforced just by dd drive to /dev/null probably -- just a wild
guess.

Thanks for your information!
Karel

Reply via email to