Re: [zfs-discuss] How recoverable is an 'unrecoverable error'?

Jens Elkner Wed, 15 Apr 2009 20:15:37 -0700

On Wed, Apr 15, 2009 at 10:32:13PM +0800, Uwe Dippel wrote:
  
> status: One or more devices has experienced an unrecoverable error.  An
>   attempt was made to correct the error.  Applications are unaffected.
...
> errors: No known data errors
> 
> Now I wonder where that error came from. It was just a single checksum


Hmmm, had ~ 2 weeks ago also a curious thing with an StorEdge 3510
(2x2Gbps FC MP, 1 Controller, 2x6HDDs mirrored and exported as a
single device, no ZIL etc. tricks) connected to a X4600:

Since grill party time has started, the 3510 decided at a room temp of
33°C to go "offline" and take part on the party ;-). Result was that
during the offline time everything blocked (i.e. didn't got timeout or
error), which tried to access a ZFS on that pool - wrt. the POV more or
less expected. After the 3510 came back, a 'zpool status ..' showed 
something like this:

        NAME                                     STATE    READ WRITE CKSUM
        pool2                                    FAULTED  289K 4.03M 0
          c4t600C0FF000000000099C790E0144EC00d0  FAULTED  289K 4.03M 0  too 
many errors

errors: Permanent errors have been detected in the following files:

        pool2/home/stud/inf/foobar:<0x0>

Still everything was blocking. After a 'zpool clear' all ZFS ( ~ 2300 on
that pool) expect the listed one were accessable, but the status message
kept unchanged. Curious, thought that blocking/waiting for the device to
come back and the ZFS transaction stuff is actually made for a situation
like this, aka "re-commit" un-ACKed actions ...
Anyway, finally scrubbing the pool brought it back to normal ONLINE state
without any errors.  To be sure I compared the ZFS in question with the
backup from some hours ago - no difference. So same question made in the
subject.

BTW: Some days later we had an even bigger grill party  (~ 38°C) - this
time the X4xxx machines in this room decided to go offline and take part
as well (v4xx's kept running ;-)).
So first the 3510 and some time later the X4600. This time the pool
was after going back online in DEGRADED state, had some more errors like
the above one and:

        <metadata>:<0x103>
        <metadata>:<0x4007>
                ...

Clearing and scrubbing it brought it again back to normal ONLINE state
without any errors. Spot check on the noted files with errors showed
no damage ...

Everything nice (wrt. data loss), but curious ...

Regards,
jel.
-- 
Otto-von-Guericke University     http://www.cs.uni-magdeburg.de/
Department of Computer Science   Geb. 29 R 027, Universitaetsplatz 2
39106 Magdeburg, Germany         Tel: +49 391 67 12768
_______________________________________________
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] How recoverable is an 'unrecoverable error'?

Reply via email to