Ok. I think I answered my own question. ZFS _didn't_ realize that the
disk was bad/stale. I power-cycled the failed drive (external) to see if
it would come back up and/or run diagnostics on it. As soon as I did
that, ZFS put the disk ONLINE and started using it again! Observe:
bash-3.00# zpool status
pool: pool1
state: ONLINE
status: One or more devices has experienced an unrecoverable error. An
attempt was made to correct the error. Applications are unaffected.
action: Determine if the device needs to be replaced, and clear the errors
using 'zpool clear' or replace the device with 'zpool replace'.
see: http://www.sun.com/msg/ZFS-8000-9P
scrub: none requested
config:
NAME STATE READ WRITE CKSUM
pool1 ONLINE 0 0 0
raidz1 ONLINE 0 0 0
c0t9d0 ONLINE 0 0 0
c0t10d0 ONLINE 0 0 0
c0t11d0 ONLINE 0 0 0
c0t12d0 ONLINE 0 0 0
c2t0d0 ONLINE 0 0 0
c2t1d0 ONLINE 0 0 0
c2t2d0 ONLINE 2.11K 20.09 0
errors: No known data errors
Now I _really_ have a problem. I can't offline the disk myself:
bash-3.00# zpool offline pool1 c2t2d0
cannot offline c2t2d0: no valid replicas
I don't understand why, as 'zpool status' says all the other drives are OK.
What's worse, if I just power off the drive in question (trying to get
back to where I started) the zpool hangs completely! I let it go for
about 7 minutes thinking maybe there was some timeout, but still
nothing. Any command that would access the zpool (including 'zpool
status') hangs. The only way to fix is to power the external disk back
on upon which everything starts working like nothing has happened.
Nothing gets logged other than lots of these only while the drive is
powered off:
Feb 12 11:49:32 maxwell scsi: [ID 107833 kern.warning] WARNING:
/[EMAIL PROTECTED],4000/[EMAIL PROTECTED]/SUNW,[EMAIL PROTECTED]/[EMAIL
PROTECTED],0 (sd32):
Feb 12 11:49:32 maxwell disk not responding to selection
Feb 12 11:49:32 maxwell scsi: [ID 107833 kern.warning] WARNING:
/[EMAIL PROTECTED],4000/[EMAIL PROTECTED]/SUNW,[EMAIL PROTECTED]/[EMAIL
PROTECTED],0 (sd32):
Feb 12 11:49:32 maxwell offline or reservation conflict
Feb 12 11:49:32 maxwell scsi: [ID 107833 kern.warning] WARNING:
/[EMAIL PROTECTED],4000/[EMAIL PROTECTED]/SUNW,[EMAIL PROTECTED]/[EMAIL
PROTECTED],0 (sd32):
Feb 12 11:49:32 maxwell i/o to invalid geometry
What's going on here? What can I do to make ZFS let go of the bad drive?
This is a production machine and I'm getting concerned. I _really_ don't
like the fact that ZFS is using a suspect drive, but I can't seem to
make it stop!
Thanks,
-Brian
Brian H. Nelson wrote:
> This is Solaris 10U3 w/127111-05.
>
> It appears that one of the disks in my zpool died yesterday. I got
> several SCSI errors finally ending with 'device not responding to
> selection'. That seems to be all well and good. ZFS figured it out and
> the pool is degraded:
>
> maxwell /var/adm >zpool status
> pool: pool1
> state: DEGRADED
> status: One or more devices could not be opened. Sufficient replicas
> exist for
> the pool to continue functioning in a degraded state.
> action: Attach the missing device and online it using 'zpool online'.
> see: http://www.sun.com/msg/ZFS-8000-D3
> scrub: none requested
> config:
>
> NAME STATE READ WRITE CKSUM
> pool1 DEGRADED 0 0 0
> raidz1 DEGRADED 0 0 0
> c0t9d0 ONLINE 0 0 0
> c0t10d0 ONLINE 0 0 0
> c0t11d0 ONLINE 0 0 0
> c0t12d0 ONLINE 0 0 0
> c2t0d0 ONLINE 0 0 0
> c2t1d0 ONLINE 0 0 0
> c2t2d0 UNAVAIL 1.88K 17.98 0 cannot open
>
> errors: No known data errors
>
>
> My question is why does ZFS keep attempting to open the dead device? At
> least that's what I assume is happening. About every minute, I get eight
> of these entries in the messages log:
>
> Feb 12 10:15:54 maxwell scsi: [ID 107833 kern.warning] WARNING:
> /[EMAIL PROTECTED],4000/[EMAIL PROTECTED]/SUNW,[EMAIL PROTECTED]/[EMAIL
> PROTECTED],0 (sd32):
> Feb 12 10:15:54 maxwell disk not responding to selection
>
> I also got a number of these thrown in for good measure:
>
> Feb 11 22:21:58 maxwell scsi: [ID 107833 kern.warning] WARNING:
> /[EMAIL PROTECTED],4000/[EMAIL PROTECTED]/SUNW,[EMAIL PROTECTED]/[EMAIL
> PROTECTED],0 (sd32):
> Feb 11 22:21:58 maxwell SYNCHRONIZE CACHE command failed (5)
>
>
> Since the disk died last night (at about 11:20pm EST) I now have over
> 15K of similar entries in my log. What gives? Is this expected behavior?
> If ZFS knows the device is having problems, why does it not just leave
> it alone and wait for user intervention?
>
> Also, I noticed that the 'action' says to attach the device and 'zpool
> online' it. Am I correct in assuming that a 'zpool replace' is what
> would really be needed, as the data on the disk will be outdated?
>
> Thanks,
> -Brian
>
>
--
---------------------------------------------------
Brian H. Nelson Youngstown State University
System Administrator Media and Academic Computing
bnelson[at]cis.ysu.edu
---------------------------------------------------
_______________________________________________
zfs-discuss mailing list
[email protected]
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss