Re: raid1 error handling and faulty drives

2008-02-26 Thread Philip Molter

Neil Brown wrote:

I've recently become aware that we really need FAILFAST - possibly for
all IO from RAID1/5.  Modern drives don't need any retry at the OS
level - if the retry in the firmware cannot get the data, nothing will.



Thanks.  I agree that we do need something along these lines.  It
might be a while before I can give the patch the brainspace it
deserves as I am travelling this fortnight.

NeilBrown


Neil,

Was anything ever done with this idea?  I can throw my hat into the 
this-is-a-big-problem ring.  I oftentimes have a RAID1 disk fail on a 
heavy-I/O system and the system basically needs to be power-cycled if it 
is to come back up within the hour (any OS access to that RAID will hang).


Any word would be appreciated.

Philip
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: raid1 error handling and faulty drives

2007-09-07 Thread Mike Accetta
Neil Brown writes:

 On Wednesday September 5, [EMAIL PROTECTED] wrote:
  ...
  
  2) It adds a threshold on the level of recent error acivity which is
 acceptable in a given interval, all configured through /sys.  If a
 mirror has generated more errors in this interval than the threshold,
 it is kicked out of the array.
 
 This is probably a good idea.  It bothers me a little to require 2
 separate numbers in sysfs...
 
 When we get a read error, we quiesce the device, the try to sort out
 the read errors, so we effectively handle them in batches.  Maybe we
 should just set a number of seconds, and if there are a 3 or more
 batches in that number of seconds, we kick the drive... just a thought.
 

I think I was just trying to be as flexible as possible.  If we were to
use one number, I'd do the opposite and fix the interval but allow the
threshold to be configured just because I would tend to think about a
disk being bad in terms of it having a more than an expected number of
errors in some fixed interval rather than because it had a fixed number
of errors in less than some expected interval.  Mathematically the
approaches ought to be equivalent.

  One would think that #2 should not be necessary as the raid1 retry
  logic already attempts to rewrite and then reread bad sectors and fails
  the drive if it cannot do both.  However, what we observe is that the
  re-write step succeeds as does the re-read but the drive is really no
  more healthy.  Maybe the re-read is not actually going out to the media
  in this case due to some caching effect?
 
 I have occasionally wondered if a cache would defeat this test.  I
 wonder if we can push a FORCE MEDIA ACCESS flag down with that
 read.  I'll ask.

I looked around for something like this but it doesn't appear to
be implemented that I could see.  I couldn't even find an explicit
mention of read caching in any drive specs to begin with.  Read-ahead
seems to be the closest concept.

 Thanks.  I agree that we do need something along these lines.  It
 might be a while before I can give the patch the brainspace it
 deserves as I am travelling this fortnight.

Looking forward to further discussion.  Thank you!
--
Mike Accetta

ECI Telecom Ltd.
Transport Networking Division, US (previously Laurel Networks)
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: raid1 error handling and faulty drives

2007-09-05 Thread Neil Brown
On Wednesday September 5, [EMAIL PROTECTED] wrote:
 
 I've been looking at ways to minimize the impact of a faulty drive in
 a raid1 array.  Our major problem is that a faulty drive can absorb
 lots of wall clock time in error recovery within the device driver
 (SATA libata error handler in this case), during which any further raid
 activity is blocked and the system effectively hangs.  This tends to
 negate the high availability advantage of placing the file system on a
 RAID array in the first place.
 
 We've had one particularly bad drive, for example, which could sync
 without indicating any write errors but as soon as it became active in
 the array would start yielding read errors.  It this particular case it
 would take 30 minutes or more for the process to progress to a point
 where some fatal error would occur to kick the drive out of the array
 and return the system to normal opreation.
 
 For SATA, this effect can be partially mitigated by reducing the default
 30 second timeout at the SCSI layer (/sys/block/sda/device/timeout).
 However, the system stills spends 45 seconds or so per retry in the
 driver issuing various reset operations in an attempt to recover from
 the error before returning control to the SCSI layer.
 
 I've been experimenting with a patch which makes two basic changes.
 
 1) It issues the first read request against a mirror with more than 1 drive
active using the BIO_RW_FAILFAST flag to short-circuit the SCSI layer from 
re-trying the failed operation in the low level device driver the default 5
times.

I've recently become aware that we really need FAILFAST - possibly for
all IO from RAID1/5.  Modern drives don't need any retry at the OS
level - if the retry in the firmware cannot get the data, nothing will.

 
 2) It adds a threshold on the level of recent error acivity which is
acceptable in a given interval, all configured through /sys.  If a
mirror has generated more errors in this interval than the threshold,
it is kicked out of the array.

This is probably a good idea.  It bothers me a little to require 2
separate numbers in sysfs...

When we get a read error, we quiesce the device, the try to sort out
the read errors, so we effectively handle them in batches.  Maybe we
should just set a number of seconds, and if there are a 3 or more
batches in that number of seconds, we kick the drive... just a thought.

 
 One would think that #2 should not be necessary as the raid1 retry
 logic already attempts to rewrite and then reread bad sectors and fails
 the drive if it cannot do both.  However, what we observe is that the
 re-write step succeeds as does the re-read but the drive is really no
 more healthy.  Maybe the re-read is not actually going out to the media
 in this case due to some caching effect?

I have occasionally wondered if a cache would defeat this test.  I
wonder if we can push a FORCE MEDIA ACCESS flag down with that
read.  I'll ask.

 
 This patch (against 2.6.20) still contains some debugging printk's but
 should be otherwise functional.  I'd be interested in any feedback on
 this specific approach and would also be happy if this served to foster
 an error recovery discussion which came up with some even better approach.

Thanks.  I agree that we do need something along these lines.  It
might be a while before I can give the patch the brainspace it
deserves as I am travelling this fortnight.

NeilBrown
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html