On Friday 10 July 2015 22:01:43 Karel Gardas wrote:
> On Fri, Jul 10, 2015 at 9:34 PM, Chris Cappuccio <ch...@nmedia.net> wrote:
> > My first impression, offlining the drive after a single chunk failure
> > may be too aggressive as some errors are a result of issues other than
> > drive failures.
> 
> Indeed, it may look as too aggressive, but is my analysis written in
> comment correct? I mean: if there is a write error for whatever reason
> to one or more chunk(s) and if we completely ignore it since at least
> one write succeed, then arrays is in incorrect state where some
> drive(s) hold(s) correct data and another drive(s) hold(s) previous
> data. Since reading is done in round-robin fashion, then there is a
> chance that you will read old data in the future. If this is correct,
> then I think it calls for fix.

Your analysis is incorrect - offlining of chunks is handled via sr_ccb_done(). 
If lower level I/O indicates an error occurred then the chunk is marked 
offline, 
providing that the discipline has redundancy (for example, we do not offline 
chunks for RAID 0 or CRYPTO - it usually just makes things worse). This 
applies to both read and write operations. 

> If you do not like off-lining drive(s) just after 1 failed read, then
> perhaps correct may be to restart whole work unit and enforce writing
> again? We can even have some threshold where we may stop and consider
> the problematic block really not writeable at the end. Is something
> like that better solution?

We already offline after a single read or write failure occurs - it would be 
possible to implement some form of retry algorithm, however at some point we 
have to trust the lower layers (VFS, disk controller driver, disk hardware, 
etc).

Reply via email to