Re: [zfs-discuss] Need help with a dead disk
[EMAIL PROTECTED] said: One thought I had was to unconfigure the bad disk with cfgadm. Would that force the system back into the 'offline' response? In my experience (X4100 internal drive), that will make ZFS stop trying to use it. It's also a good idea to do this before you hot-unplug the bad drive to replace it with a new one. Regards, Marion ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Need help with a dead disk
Here's a bit more info. The drive appears to have failed at 22:19 EST but it wasn't until 1:30 EST the next day that the system finally decided that it was bad. (Why?) Here's some relevant log stuff (with lots of repeated 'device not responding' errors removed) I don't know if it will be useful: Feb 11 22:19:09 maxwell scsi: [ID 107833 kern.warning] WARNING: /[EMAIL PROTECTED],4000/[EMAIL PROTECTED]/SUNW,[EMAIL PROTECTED]/[EMAIL PROTECTED],0 (sd32): Feb 11 22:19:09 maxwell SCSI transport failed: reason 'incomplete': retrying command Feb 11 22:19:10 maxwell scsi: [ID 107833 kern.warning] WARNING: /[EMAIL PROTECTED],4000/[EMAIL PROTECTED]/SUNW,[EMAIL PROTECTED]/[EMAIL PROTECTED],0 (sd32): Feb 11 22:19:10 maxwell disk not responding to selection ... Feb 11 22:21:08 maxwell scsi: [ID 107833 kern.warning] WARNING: /[EMAIL PROTECTED],4000/[EMAIL PROTECTED]/SUNW,[EMAIL PROTECTED] (isp0): Feb 11 22:21:08 maxwell SCSI Cable/Connection problem. Feb 11 22:21:08 maxwell scsi: [ID 107833 kern.notice] Hardware/Firmware error. Feb 11 22:21:08 maxwell scsi: [ID 107833 kern.warning] WARNING: /[EMAIL PROTECTED],4000/[EMAIL PROTECTED]/SUNW,[EMAIL PROTECTED] (isp0): Feb 11 22:21:08 maxwell Fatal error, resetting interface, flg 16 ... (Why did this take so long?) Feb 12 01:30:05 maxwell scsi: [ID 107833 kern.warning] WARNING: /[EMAIL PROTECTED],4000/[EMAIL PROTECTED]/SUNW,[EMAIL PROTECTED]/[EMAIL PROTECTED],0 (sd32): Feb 12 01:30:05 maxwell offline ... Feb 12 01:30:22 maxwell fmd: [ID 441519 daemon.error] SUNW-MSG-ID: ZFS-8000-D3, TYPE: Fault, VER: 1, SEVERITY: Major Feb 12 01:30:22 maxwell EVENT-TIME: Tue Feb 12 01:30:22 EST 2008 Feb 12 01:30:22 maxwell PLATFORM: SUNW,Ultra-250, CSN: -, HOSTNAME: maxwell Feb 12 01:30:22 maxwell SOURCE: zfs-diagnosis, REV: 1.0 Feb 12 01:30:22 maxwell EVENT-ID: 7f48f376-2eb1-ccaf-afc5-e56f5bf4576f Feb 12 01:30:22 maxwell DESC: A ZFS device failed. Refer to http://sun.com/msg/ZFS-8000-D3 for more information. Feb 12 01:30:22 maxwell AUTO-RESPONSE: No automated response will occur. Feb 12 01:30:22 maxwell IMPACT: Fault tolerance of the pool may be compromised. Feb 12 01:30:22 maxwell REC-ACTION: Run 'zpool status -x' and replace the bad device. One thought I had was to unconfigure the bad disk with cfgadm. Would that force the system back into the 'offline' response? Thanks, -Brian Brian H. Nelson wrote: Ok. I think I answered my own question. ZFS _didn't_ realize that the disk was bad/stale. I power-cycled the failed drive (external) to see if it would come back up and/or run diagnostics on it. As soon as I did that, ZFS put the disk ONLINE and started using it again! Observe: bash-3.00# zpool status pool: pool1 state: ONLINE status: One or more devices has experienced an unrecoverable error. An attempt was made to correct the error. Applications are unaffected. action: Determine if the device needs to be replaced, and clear the errors using 'zpool clear' or replace the device with 'zpool replace'. see: http://www.sun.com/msg/ZFS-8000-9P scrub: none requested config: NAME STATE READ WRITE CKSUM pool1ONLINE 0 0 0 raidz1 ONLINE 0 0 0 c0t9d0 ONLINE 0 0 0 c0t10d0 ONLINE 0 0 0 c0t11d0 ONLINE 0 0 0 c0t12d0 ONLINE 0 0 0 c2t0d0 ONLINE 0 0 0 c2t1d0 ONLINE 0 0 0 c2t2d0 ONLINE 2.11K 20.09 0 errors: No known data errors Now I _really_ have a problem. I can't offline the disk myself: bash-3.00# zpool offline pool1 c2t2d0 cannot offline c2t2d0: no valid replicas I don't understand why, as 'zpool status' says all the other drives are OK. What's worse, if I just power off the drive in question (trying to get back to where I started) the zpool hangs completely! I let it go for about 7 minutes thinking maybe there was some timeout, but still nothing. Any command that would access the zpool (including 'zpool status') hangs. The only way to fix is to power the external disk back on upon which everything starts working like nothing has happened. Nothing gets logged other than lots of these only while the drive is powered off: Feb 12 11:49:32 maxwell scsi: [ID 107833 kern.warning] WARNING: /[EMAIL PROTECTED],4000/[EMAIL PROTECTED]/SUNW,[EMAIL PROTECTED]/[EMAIL PROTECTED],0 (sd32): Feb 12 11:49:32 maxwell disk not responding to selection Feb 12 11:49:32 maxwell scsi: [ID 107833 kern.warning] WARNING: /[EMAIL PROTECTED],4000/[EMAIL PROTECTED]/SUNW,[EMAIL PROTECTED]/[EMAIL PROTECTED],0 (sd32): Feb 12 11:49:32 maxwell offline or reservation conflict Feb 12 11:49:32 maxwell scsi: [ID 107833 kern.warning] WARNING: /[EMAIL PROTECTED],4000/[EMAIL PROTECTED]/SUNW,[EMAIL PROTECTED]/[EMAIL PROTECTED],0
[zfs-discuss] Need help with a dead disk (was: ZFS keeps trying to open a dead disk: lots of logging)
Ok. I think I answered my own question. ZFS _didn't_ realize that the disk was bad/stale. I power-cycled the failed drive (external) to see if it would come back up and/or run diagnostics on it. As soon as I did that, ZFS put the disk ONLINE and started using it again! Observe: bash-3.00# zpool status pool: pool1 state: ONLINE status: One or more devices has experienced an unrecoverable error. An attempt was made to correct the error. Applications are unaffected. action: Determine if the device needs to be replaced, and clear the errors using 'zpool clear' or replace the device with 'zpool replace'. see: http://www.sun.com/msg/ZFS-8000-9P scrub: none requested config: NAME STATE READ WRITE CKSUM pool1ONLINE 0 0 0 raidz1 ONLINE 0 0 0 c0t9d0 ONLINE 0 0 0 c0t10d0 ONLINE 0 0 0 c0t11d0 ONLINE 0 0 0 c0t12d0 ONLINE 0 0 0 c2t0d0 ONLINE 0 0 0 c2t1d0 ONLINE 0 0 0 c2t2d0 ONLINE 2.11K 20.09 0 errors: No known data errors Now I _really_ have a problem. I can't offline the disk myself: bash-3.00# zpool offline pool1 c2t2d0 cannot offline c2t2d0: no valid replicas I don't understand why, as 'zpool status' says all the other drives are OK. What's worse, if I just power off the drive in question (trying to get back to where I started) the zpool hangs completely! I let it go for about 7 minutes thinking maybe there was some timeout, but still nothing. Any command that would access the zpool (including 'zpool status') hangs. The only way to fix is to power the external disk back on upon which everything starts working like nothing has happened. Nothing gets logged other than lots of these only while the drive is powered off: Feb 12 11:49:32 maxwell scsi: [ID 107833 kern.warning] WARNING: /[EMAIL PROTECTED],4000/[EMAIL PROTECTED]/SUNW,[EMAIL PROTECTED]/[EMAIL PROTECTED],0 (sd32): Feb 12 11:49:32 maxwell disk not responding to selection Feb 12 11:49:32 maxwell scsi: [ID 107833 kern.warning] WARNING: /[EMAIL PROTECTED],4000/[EMAIL PROTECTED]/SUNW,[EMAIL PROTECTED]/[EMAIL PROTECTED],0 (sd32): Feb 12 11:49:32 maxwell offline or reservation conflict Feb 12 11:49:32 maxwell scsi: [ID 107833 kern.warning] WARNING: /[EMAIL PROTECTED],4000/[EMAIL PROTECTED]/SUNW,[EMAIL PROTECTED]/[EMAIL PROTECTED],0 (sd32): Feb 12 11:49:32 maxwell i/o to invalid geometry What's going on here? What can I do to make ZFS let go of the bad drive? This is a production machine and I'm getting concerned. I _really_ don't like the fact that ZFS is using a suspect drive, but I can't seem to make it stop! Thanks, -Brian Brian H. Nelson wrote: This is Solaris 10U3 w/127111-05. It appears that one of the disks in my zpool died yesterday. I got several SCSI errors finally ending with 'device not responding to selection'. That seems to be all well and good. ZFS figured it out and the pool is degraded: maxwell /var/adm zpool status pool: pool1 state: DEGRADED status: One or more devices could not be opened. Sufficient replicas exist for the pool to continue functioning in a degraded state. action: Attach the missing device and online it using 'zpool online'. see: http://www.sun.com/msg/ZFS-8000-D3 scrub: none requested config: NAME STATE READ WRITE CKSUM pool1DEGRADED 0 0 0 raidz1 DEGRADED 0 0 0 c0t9d0 ONLINE 0 0 0 c0t10d0 ONLINE 0 0 0 c0t11d0 ONLINE 0 0 0 c0t12d0 ONLINE 0 0 0 c2t0d0 ONLINE 0 0 0 c2t1d0 ONLINE 0 0 0 c2t2d0 UNAVAIL 1.88K 17.98 0 cannot open errors: No known data errors My question is why does ZFS keep attempting to open the dead device? At least that's what I assume is happening. About every minute, I get eight of these entries in the messages log: Feb 12 10:15:54 maxwell scsi: [ID 107833 kern.warning] WARNING: /[EMAIL PROTECTED],4000/[EMAIL PROTECTED]/SUNW,[EMAIL PROTECTED]/[EMAIL PROTECTED],0 (sd32): Feb 12 10:15:54 maxwell disk not responding to selection I also got a number of these thrown in for good measure: Feb 11 22:21:58 maxwell scsi: [ID 107833 kern.warning] WARNING: /[EMAIL PROTECTED],4000/[EMAIL PROTECTED]/SUNW,[EMAIL PROTECTED]/[EMAIL PROTECTED],0 (sd32): Feb 11 22:21:58 maxwell SYNCHRONIZE CACHE command failed (5) Since the disk died last night (at about 11:20pm EST) I now have over 15K of similar entries in my log. What gives? Is this expected behavior? If ZFS knows the device is having problems, why does it not just leave it
Re: [zfs-discuss] Need help with a dead disk
Hmm... this won't help you, but I think I'm having similar problems with an iSCSI target device. If I offline the target, zfs hangs for just over 5 minutes before it realises the device is unavailable, and even then it doesn't report the problem until I repeat the zpool status command. What I see here every time is: - iSCSI device disconnected - zpool status, and all file i/o appears to hang for 5 mins - zpool status then finishes (reporting pools ok), and i/o carries on. - Immediately running zpool status again correctly shows the device as faulty and the pool as degraded. It seems either ZFS or the Solaris driver stack has a problem when devices go offline. Both of us have seen zpool status hang for huge amounts of time when there's a problem with a drive. Not something that inspires confidence in a raid system. This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss