Re: [zfs-discuss] x4500 failed disk, not sure if hot spare took over correctly

Eric Schrock Mon, 11 Jan 2010 19:29:07 -0800

On Jan 11, 2010, at 6:35 PM, Paul B. Henson wrote:

> On Mon, 11 Jan 2010, Eric Schrock wrote:
> 
>> No, there is no way to tell if a pool has DTL (dirty time log) entries.
> 
> Hmm, I hadn't heard that term before, but based on a quick search I take it
> that's the list of data in the pool that is not fully redundant? So if a
> 2-way mirror vdev lost a half, everything written after the loss would be
> on the DTL, and if the same device came back, recovery would entail just
> running through the DTL and writing out what it missed? Although presumably
> if the failed device was replaced with another device entirely all of the
> data would need to be written out.
> 
> I'm not quite sure that answered my question. My original question was, for
> example, given a 2-way mirror, one half fails. There is a hot spare
> available, which is pulled in, and while the pool isn't optimal, it does
> have the same number of devices that it's supposed to. On the other hand,
> the same mirror loses a device, there's no hot spare, and the pool is short
> one device. My understanding is that in both scenarios the pool status
> would be "DEGRADED", but it seems there's an important difference. In the
> first case, another device could fail, and the pool would still be ok. In
> the second, another device failing would result in complete loss of data.
> 
> While you can tell the difference between these two different states by
> looking at the detailed output and seeing if a hot spare is in use, I was
> just saying that it would be nice for the short status to have some
> distinction between "device failed, hot spare in use" and "device failed,
> keep fingers crossed" ;).
> 
> Back to your answer, if the existance of DTL entries means the pool doesn't
> have full redundancy for some data, and you can't tell if a pool has DTL
> entries, are you saying there's no way to tell if the current state of your
> pool could survive a device failure? If a resilver successfully completes,
> barring another device failure, doesn't that mean the pool is restored to
> full redundancy? I feel like I must be misunderstanding something :(.


DTLs are a more specific answer to your question.  It implies that a toplevel 
vdev has a known time when there is invalid data for it or one of its children. 
 This may because a device failed and is accumulating DTL time, a new replacing 
or spare vdev was attached, or it may be because a device was unplugged and 
then plugged back in.  Your example (hot spares) is but one of the ways in 
which this can happen, but in any of the cases it implies that data is not 
fully replicated.

There is obviously a way to detect this in the kernel, it's simply not exported 
to userland in any useful way.  The reason I focused on DTLs is that if any 
mechanism were provided to distinguish a pool lacking full redundancy, it would 
be based on DTLs - nothing else makes sense.

- Eric

> 
> Thanks...
> 
> 
> -- 
> Paul B. Henson  |  (909) 979-6361  |  http://www.csupomona.edu/~henson/
> Operating Systems and Network Analyst  |  hen...@csupomona.edu
> California State Polytechnic University  |  Pomona CA 91768

--
Eric Schrock, Fishworks                        http://blogs.sun.com/eschrock



_______________________________________________
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] x4500 failed disk, not sure if hot spare took over correctly

Reply via email to