Re: [PATCH] fix write error handling on SR RAID1

Joel Sing Sat, 11 Jul 2015 06:46:03 -0700

On Friday 10 July 2015 22:01:43 Karel Gardas wrote:
> On Fri, Jul 10, 2015 at 9:34 PM, Chris Cappuccio <ch...@nmedia.net> wrote:
> > My first impression, offlining the drive after a single chunk failure
> > may be too aggressive as some errors are a result of issues other than
> > drive failures.
> 
> Indeed, it may look as too aggressive, but is my analysis written in
> comment correct? I mean: if there is a write error for whatever reason
> to one or more chunk(s) and if we completely ignore it since at least
> one write succeed, then arrays is in incorrect state where some
> drive(s) hold(s) correct data and another drive(s) hold(s) previous
> data. Since reading is done in round-robin fashion, then there is a
> chance that you will read old data in the future. If this is correct,
> then I think it calls for fix.


Your analysis is incorrect - offlining of chunks is handled via sr_ccb_done(). 
If lower level I/O indicates an error occurred then the chunk is marked 
offline, 
providing that the discipline has redundancy (for example, we do not offline 
chunks for RAID 0 or CRYPTO - it usually just makes things worse). This 
applies to both read and write operations. 

> If you do not like off-lining drive(s) just after 1 failed read, then
> perhaps correct may be to restart whole work unit and enforce writing
> again? We can even have some threshold where we may stop and consider
> the problematic block really not writeable at the end. Is something
> like that better solution?

We already offline after a single read or write failure occurs - it would be 
possible to implement some form of retry algorithm, however at some point we 
have to trust the lower layers (VFS, disk controller driver, disk hardware, 
etc).

Re: [PATCH] fix write error handling on SR RAID1

Reply via email to