On Wednesday September 5, [EMAIL PROTECTED] wrote:
I've been looking at ways to minimize the impact of a faulty drive in
a raid1 array. Our major problem is that a faulty drive can absorb
lots of wall clock time in error recovery within the device driver
(SATA libata error handler in this case), during which any further raid
activity is blocked and the system effectively hangs. This tends to
negate the high availability advantage of placing the file system on a
RAID array in the first place.
We've had one particularly bad drive, for example, which could sync
without indicating any write errors but as soon as it became active in
the array would start yielding read errors. It this particular case it
would take 30 minutes or more for the process to progress to a point
where some fatal error would occur to kick the drive out of the array
and return the system to normal opreation.
For SATA, this effect can be partially mitigated by reducing the default
30 second timeout at the SCSI layer (/sys/block/sda/device/timeout).
However, the system stills spends 45 seconds or so per retry in the
driver issuing various reset operations in an attempt to recover from
the error before returning control to the SCSI layer.
I've been experimenting with a patch which makes two basic changes.
1) It issues the first read request against a mirror with more than 1 drive
active using the BIO_RW_FAILFAST flag to short-circuit the SCSI layer from
re-trying the failed operation in the low level device driver the default 5
times.
I've recently become aware that we really need FAILFAST - possibly for
all IO from RAID1/5. Modern drives don't need any retry at the OS
level - if the retry in the firmware cannot get the data, nothing will.
2) It adds a threshold on the level of recent error acivity which is
acceptable in a given interval, all configured through /sys. If a
mirror has generated more errors in this interval than the threshold,
it is kicked out of the array.
This is probably a good idea. It bothers me a little to require 2
separate numbers in sysfs...
When we get a read error, we quiesce the device, the try to sort out
the read errors, so we effectively handle them in batches. Maybe we
should just set a number of seconds, and if there are a 3 or more
batches in that number of seconds, we kick the drive... just a thought.
One would think that #2 should not be necessary as the raid1 retry
logic already attempts to rewrite and then reread bad sectors and fails
the drive if it cannot do both. However, what we observe is that the
re-write step succeeds as does the re-read but the drive is really no
more healthy. Maybe the re-read is not actually going out to the media
in this case due to some caching effect?
I have occasionally wondered if a cache would defeat this test. I
wonder if we can push a FORCE MEDIA ACCESS flag down with that
read. I'll ask.
This patch (against 2.6.20) still contains some debugging printk's but
should be otherwise functional. I'd be interested in any feedback on
this specific approach and would also be happy if this served to foster
an error recovery discussion which came up with some even better approach.
Thanks. I agree that we do need something along these lines. It
might be a while before I can give the patch the brainspace it
deserves as I am travelling this fortnight.
NeilBrown
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at http://vger.kernel.org/majordomo-info.html