We recently installed a 24 disk SATA array with an LSI controller attached to a box running Solaris X86 10 Release 4. The drives were set up in one big pool with raidz, and it worked great for about a month. On the 4th, we had the system kernel panic and crash, and it's now behaving very badly. Here's what diagnostic data I've been able to collect so far:
In the messages file: Nov 4 13:24:11 mondo4 savecore: [ID 570001 auth.error] reboot after panic: ZFS: I/O failure (write on <unknown> off 0: zio ffffffff97c86a00 [L0 DMU dnode] 4000L/1000P DVA[0]=<0 :d08cf11b800:1800> DVA[1]=<0:1020a711c800:1800> fletcher4 lzjb LE contiguous birth=731555 fill=32 Nov 4 13:24:11 mondo4 savecore: [ID 570001 auth.error] reboot after panic: ZFS: I/O failure (write on <unknown> off 0: zio ffffffff97c86a00 [L0 DMU dnode] 4000L/1000P DVA[0]=<0 :d08cf11b800:1800> DVA[1]=<0:1020a711c800:1800> fletcher4 lzjb LE contiguous birth=731555 fill=32 Nov 4 13:24:06 mondo4 savecore: [ID 748169 auth.error] saving system crash dump in /var/crash/mondo4/*.0 Nov 4 13:24:06 mondo4 savecore: [ID 748169 auth.error] saving system crash dump in /var/crash/mondo4/*.0 And yes, we've got the core files. The box came back up and seemed to run okay for a couple days, but we noticed today that things were very very odd. We noticed that doing a df on the filesystem hung, and that ls would hang on the local box as well. Looking at the output of dmesg, we see a lot of messages that look like: Nov 8 03:58:22 mondo4 scsi: [ID 107833 kern.notice] Requested Block: 1450319385 Error Block: 1450319385 Nov 8 03:58:22 mondo4 scsi: [ID 107833 kern.notice] Requested Block: 1450319385 Error Block: 1450319385 Nov 8 03:58:22 mondo4 scsi: [ID 107833 kern.notice] Vendor: ATA Serial Number: Nov 8 03:58:22 mondo4 scsi: [ID 107833 kern.notice] Vendor: ATA Serial Number: Nov 8 03:58:22 mondo4 scsi: [ID 107833 kern.notice] Sense Key: Unit Attention Nov 8 03:58:22 mondo4 scsi: [ID 107833 kern.notice] Sense Key: Unit Attention Nov 8 03:58:22 mondo4 scsi: [ID 107833 kern.notice] ASC: 0x29 (power on, reset, or bus reset occurred), ASCQ: 0x0, FRU: 0x0 Nov 8 03:58:22 mondo4 scsi: [ID 107833 kern.notice] ASC: 0x29 (power on, reset, or bus reset occurred), ASCQ: 0x0, FRU: 0x0 Nov 8 04:13:59 mondo4 scsi: [ID 107833 kern.notice] Requested Block: 1450487074 Error Block: 1450487074 Nov 8 04:13:59 mondo4 scsi: [ID 107833 kern.notice] Requested Block: 1450487074 Error Block: 1450487074 Nov 8 04:13:59 mondo4 scsi: [ID 107833 kern.notice] Vendor: ATA Serial Number: Nov 8 04:13:59 mondo4 scsi: [ID 107833 kern.notice] Vendor: ATA Serial Number: Nov 8 04:13:59 mondo4 scsi: [ID 107833 kern.notice] Sense Key: Unit Attention Nov 8 04:13:59 mondo4 scsi: [ID 107833 kern.notice] Sense Key: Unit Attention Nov 8 04:13:59 mondo4 scsi: [ID 107833 kern.notice] ASC: 0x29 (power on, reset, or bus reset occurred), ASCQ: 0x0, FRU: 0x0 Nov 8 04:13:59 mondo4 scsi: [ID 107833 kern.notice] ASC: 0x29 (power on, reset, or bus reset occurred), ASCQ: 0x0, FRU: 0x0 Finally trying to do a zpool status yields: [EMAIL PROTECTED]:/# zpool status -v pool: LogData state: ONLINE status: One or more devices has experienced an unrecoverable error. An attempt was made to correct the error. Applications are unaffected. action: Determine if the device needs to be replaced, and clear the errors using 'zpool clear' or replace the device with 'zpool replace'. see: http://www.sun.com/msg/ZFS-8000-9P scrub: none requested At which point the shell hangs, and cannot be control-c'd. Any thoughts on how to proceed? I'm guessing we have a bad disk, but I'm not sure. Anything you can recommend to diagnose this would be welcome. --Mike _______________________________________________ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss