> Richard Elling wrote:
> Perhaps I am not being clear.  If a disk is really dead, then
> there are several different failure modes that can be responsible.
> For example, if a disk does not respond to selection, then it
> is diagnosed as failed very quickly. But that is not the TLER
> case.  The TLER case is when the disk cannot read from
> media without error, so it will continue to retry... perhaps
> forever or until reset. If a disk does not complete an I/O operation
> in (default) 60 seconds (for sd driver), then it will be reset and
> the I/O operation retried.
I suspect you're being clear - it's just that I'm building the runway 
ahead of the plane as I take off. 8-)

So one case is the disk hits an error on a sector read, and retries, 
potentially for a long, long time. The SD driver waits 60 seconds, then
resets the disk if one hasn't change the timeout. Presumably the 
I/O operation is retried for the portion of the read that's on that disk.
(...er, that's what I think would happen, anyway; is that right?)

What happens then depends on whether the disk in question does
return good on the retried operation. In the case of
(a) retry gives good result; does the driver/zfs mark the block as 
problematic and move it to a different physical sector? Or just note
that there was a problem there? This is what I'd all the soft error
once scenario.
(b) retry takes more than 60 seconds again; I'm not clear what
the driver does here. N in T? Two tries and I remap you? Semi-
infinite loop of retries? This is what I'd call the soft error forever
scenario. And this is the error that would distinguish between 
TLER/ERC/CCTL disks and desktop disks.

> If a disk returns bogus data (failed ZFS checksum),
> then the N in T algorithm may kick in. I have seen this
> failure mode many times.
This is yet another error, not related to TLER/ERC/CCTL.
In this case, the disk returns data that is wrong.  In my 
limited understanding, this is what would happen in a scrub 
operation, where a soft error has happened, or where incorrect
data has been correctly written on the disk. The disk does not
detect an error and simply go off into the tall grass counting
its toes forever, but instead promptly returns bad data. 

Both desktop and raid version disks would be OK in zfs with
this error, in that the error would be handled by the error
paths I already (if dimly!) comprehend.

> A similar observation is that the error rate (errors/bit) has not
> changed, but the number of bits continues to increase.
Yes. The paper notes that the bit error rate has improved by two 
orders of magnitude, but the number of bits has kept slightly ahead.

The killer is that the time required to fill a replacement disk with 
new, correct data has distinctly not kept pace with BER or capacity.
that leads to long repair operations and increases the time available
and therefore the probability of a successive failure happening.
> >> For disks which don't return when there is an error, you can
> >> reasonably expect that T will be a long time (multiples of 60
> >> seconds) and therefore the N in T threshold will not be triggered.
> > The scenario I had in mind was two disks ready to fail, either
> > soft (long time to return data) or hard (bang! That sector/block
> > or disk is not coming back, period). The first fails and starts
> > trying to recover in desktop-disk fashion, maybe taking hours.
> Yes, this is the case for TLER. The only way around
> this is to use disks that return failures when they occur.
OK. From the above suppositions, if we had a desktop (infinitely
long retry on fail) disk and a soft-fail error in a sector, then the 
disk would effectively hang each time the sector was accessed.
This would lead to 
(1) ZFS->SD-> disk read of failing sector
(2) disk does not reply within 60 seconds (default) 
(3) disk is reset by SD
(4) operation is retried by SD(?)
(5) disk does not reply within 60 seconds (default)
(6) disk is reset by SD ?

then what? If I'm reading you correctly, the following string of
events happens:

> The drivers will retry and fail the I/O. By default, for SATA
> disks using the sd driver, there are 5 retries of 60 seconds.
> After 5 minutes, the I/O will be declared failed and that info
> is passed back up the stack to ZFS, which will start its
> recovery.  This is why the T part of N in T doesn't work so
> well for the TLER case.

Hmmm... actually, it may be just fine for my personal wants. 
If I had a desktop drive which went unresponsive for 60 seconds
on an I/O soft error, then the timeout would be five minutes. 
at that time, zfs would... check me here... mark the block as
failed, and try to relocate the block on the disk. If that worked
fine, the previous sectors would be marked as unusable, and 
work goes on, but with the actions noted in the logs. 

If the relocation didn't work, eventually zfs(?) SD(?) would decide
that the disk was unusable, and ... yelp for help?... start rebuilding?
roll in the hot spares?... send in the clowns?

I want zfs for background scrubbing and am only minimally worried
about speed and throughput. So taking five minutes to recover
from a disk failure is not a big issue to me. I just want to not lose 
bits once I put them into the zfs bucket. 

Again, I apologize for the Ned-and-the-first-reader questions. I'm 
trying to locate what happens in the manual and code, but I'm 
kind of building the runway ahead of the plane taking off. I 
really very much appreciate your taking time to help me 
understand.

> I don't think the second disk scenario adds value to
> this analysis.
Only that it is the motivator for wanting recovery to be as short
as possible. If each disk has a bit error rate of one bit per X 
seconds/days/years, then the probability of losing data can be 
expressed as a function of how long the array spends between
the occurrence of the first error and the time until you have 
resolved the issue back to full redundancy. This would be the 
time to do any rebuild/resilvering/reintroduction of disks to 
get back to stable operation.

> The diagnosis engines and sd driver are open source
> :-)
Yeah... all's you gotta to is be able to read and comprehend
OS and drive source code in the language. I'm working on that.
8-)

> Interesting.  If you have thoughts along this line, fm-discuss or
> driver-discuss can be a better forum than zfs-discuss (ZFS is
> a consumer of time-related failure notifications).
I'll get to that one day. 

Thanks again.

R.G.
-- 
This message posted from opensolaris.org
_______________________________________________
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Reply via email to