Re: [zfs-discuss] zpool getting in a stuck state?
Hi Jeremy, I had a loosely similar problem with my 2009.06 box. In my case (which may not be yours), working with support we found a bug that was causing my pool to hang. I also got erroneous errors when I did a scrub ( 3 x 5 disk raidz). I am using the same LSI controller. A sure fire way to kill the box was to setup a file system as an iSCSI target, and write a lot of data to it, around 1-2MB/s. It would usually die inside of a few hours. NFS writing was not as bad, but within a day it would panic there too. The solution for me was to upgrade to 124. Since the upgrade three weeks ago, I have had no problems. Again, I don't know if this would fix your problem, but it may be worth a try. Just don't upgrade your ZFS version, and you will be able to roll back to 2009.06 at any time. -Scott -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] zpool getting in a stuck state?
On Oct 26, 2009, at 4:00 PM, Cindy Swearingen wrote: Hi Jeremy, Can you use the command below and send me the output, please? Thanks, Cindy # mdb -k ::stacks -m zfs Ok, it did it again. I replaced the drive and it's currently resilvering (13.66% done, 135h33m to go it says) and the output of that command is this: ::stacks -m zfs THREAD STATESOBJCOUNT ff02f0078a80 SLEEPMUTEX 31 swtch+0x147 turnstile_block+0x764 mutex_vector_enter+0x261 zfs_zget+0x47 zfs_root+0x57 fsop_root+0x2e traverse+0x61 lookuppnvp+0x423 lookuppnat+0x12c lookupnameat+0x91 lookupname+0x28 chroot+0x30 _sys_sysenter_post_swapgs+0x14b ff02efa33500 SLEEPCV 18 swtch+0x147 cv_wait+0x61 dbuf_read+0x237 dmu_buf_hold+0x96 zap_lockdir+0x67 zap_lookup_norm+0x55 zap_lookup+0x2d zfs_match_find+0xfd zfs_dirent_lock+0x3d1 zfs_dirlook+0xd9 zfs_lookup+0x104 fop_lookup+0xed lookuppnvp+0x3a3 lookuppnat+0x12c lookupnameat+0x91 cstatat_getvp+0x164 cstatat64_32+0x82 stat64_32+0x31 _sys_sysenter_post_swapgs+0x14b ff02da1a91e0 SLEEPCV 9 swtch+0x147 cv_wait+0x61 zio_wait+0x5d dbuf_read+0x1e8 dmu_buf_hold+0x96 zap_lockdir+0x67 zap_lookup_norm+0x55 zap_lookup+0x2d zfs_match_find+0xfd zfs_dirent_lock+0x3d1 zfs_dirlook+0xd9 zfs_lookup+0x104 fop_lookup+0xed lookuppnvp+0x3a3 lookuppnat+0x12c lookupnameat+0x91 cstatat_getvp+0x164 cstatat64_32+0x82 stat64_32+0x31 _sys_sysenter_post_swapgs+0x14b ff02d8c46ac0 SLEEPCV 7 swtch+0x147 cv_wait+0x61 zio_wait+0x5d dbuf_read+0x1e8 dbuf_findbp+0xe7 dbuf_hold_impl+0x81 dbuf_findbp+0xcf dbuf_hold_impl+0x81 dbuf_hold+0x2e dnode_hold_impl+0xb5 dnode_hold+0x2b dmu_bonus_hold+0x36 zfs_zget+0x5a zfs_root+0x57 fsop_root+0x2e traverse+0x61 lookuppnvp+0x423 lookuppnat+0x12c lookupnameat+0x91 lookupname+0x28 chroot+0x30 _sys_sysenter_post_swapgs+0x14b ff02da1a2720 SLEEPCV 6 swtch+0x147 cv_wait+0x61 txg_wait_open+0x7a dmu_tx_wait+0xb3 dmu_tx_assign+0x4b zfs_inactive+0xa8 fop_inactive+0xaf vn_rele+0x5f closef+0x75 closeandsetf+0x44a close+0x18 _sys_sysenter_post_swapgs+0x14b ff000f61bc60 SLEEPCV 5 swtch+0x147 cv_wait+0x61 txg_thread_wait+0x5f txg_quiesce_thread+0x94 thread_start+8 ff02d8514aa0 SLEEPCV 5 swtch+0x147 cv_wait+0x61 zio_wait+0x5d dbuf_read+0x1e8 dbuf_findbp+0xe7 dbuf_hold_impl+0x81 dbuf_findbp+0xcf dbuf_hold_impl+0x81 dbuf_findbp+0xcf dbuf_hold_impl+0x81 dbuf_hold+0x2e dnode_hold_impl+0xb5 dnode_hold+0x2b dmu_bonus_hold+0x36 zfs_zget+0x5a zfs_root+0x57 fsop_root+0x2e traverse+0x61 lookuppnvp+0x423 lookuppnat+0x12c lookupnameat+0x91 lookupname+0x28 chroot+0x30 _sys_sysenter_post_swapgs+0x14b ff02d89b4c20 SLEEPCV 3 swtch+0x147 cv_wait+0x61 txg_wait_synced+0x7f dmu_tx_wait+0xcd zfs_create+0x44d fop_create+0xfc
Re: [zfs-discuss] zpool getting in a stuck state?
Jeremy, I generally suspect device failures in this case and if possible, review the contents of /var/adm/messages and fmdump -eV to see if the pool hang could be attributed to failed or failing devices. Cindy On 10/26/09 17:28, Jeremy Kitchen wrote: Cindy Swearingen wrote: Hi Jeremy, Can you use the command below and send me the output, please? Thanks, Cindy # mdb -k ::stacks -m zfs ack! it *just* fully died. I've had our noc folks reset the machine and I will get this info to you as soon as it happens again (I'm fairly certain it will, if not on this specific machine, one of our other machines!) -Jeremy ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] zpool getting in a stuck state?
Cindy Swearingen wrote: Jeremy, I generally suspect device failures in this case and if possible, review the contents of /var/adm/messages and fmdump -eV to see if the pool hang could be attributed to failed or failing devices. perusing /var/adm/messages, I see: Oct 22 05:06:11 homiebackup10 scsi: [ID 365881 kern.info] /p...@0,0/pci8086,4...@1/pci1000,3...@0 (mpt1): Oct 22 05:06:11 homiebackup10 Log info 0x3108 received for target 5. Oct 22 05:06:11 homiebackup10 scsi_status=0x0, ioc_status=0x804b, scsi_state=0x0 Oct 22 05:06:19 homiebackup10 scsi: [ID 365881 kern.info] /p...@0,0/pci8086,4...@1/pci1000,3...@0 (mpt1): Oct 22 05:06:19 homiebackup10 Log info 0x3108 received for target 5. Oct 22 05:06:19 homiebackup10 scsi_status=0x0, ioc_status=0x804b, scsi_state=0x1 Oct 22 05:06:19 homiebackup10 scsi: [ID 365881 kern.info] /p...@0,0/pci8086,4...@1/pci1000,3...@0 (mpt1): Oct 22 05:06:19 homiebackup10 Log info 0x3108 received for target 5. Oct 22 05:06:19 homiebackup10 scsi_status=0x0, ioc_status=0x804b, scsi_state=0x0 lots of messages like that just prior to rsync warnings: Oct 22 05:55:29 homiebackup10 rsyncd[29746]: [ID 702911 daemon.warning] rsync: connection unexpectedly closed (0 bytes received so far) [receiver] Oct 22 05:55:29 homiebackup10 rsyncd[29746]: [ID 702911 daemon.warning] rsync error: error in rsync protocol data stream (code 12) at io.c(453) [receiver=2.6.9] Oct 22 06:10:29 homiebackup10 rsyncd[178]: [ID 702911 daemon.warning] rsync: connection unexpectedly closed (0 bytes received so far) [receiver] Oct 22 06:10:29 homiebackup10 rsyncd[178]: [ID 702911 daemon.warning] rsync error: error in rsync protocol data stream (code 12) at io.c(453) [receiver=2.6.9] Oct 22 06:25:27 homiebackup10 rsyncd[776]: [ID 702911 daemon.warning] rsync: connection unexpectedly closed (0 bytes received so far) [receiver] I think the rsync warnings are indicative of the pool being hung. So it would seem that the bus is freaking out and then the pool dies and that's that? The strange thing is that this machine is way underloaded compared to another one we have (which has 5 shelves, so ~150TB of storage attached) which hasn't really had any problems like this. We had issues with that one when rebuilding drives, but it's been pretty stable since. looking at fmdump -eV, I see lots and lots of these: Oct 24 2009 05:02:54.098815545 ereport.io.scsi.cmd.disk.tran nvlist version: 0 class = ereport.io.scsi.cmd.disk.tran ena = 0x882108543f200401 detector = (embedded nvlist) nvlist version: 0 version = 0x0 scheme = dev device-path = /p...@0,0/pci8086,4...@5/pci1000,3...@0/s...@30,0 (end detector) driver-assessment = retry op-code = 0x28 cdb = 0x28 0x0 0x51 0x9c 0xa5 0x80 0x0 0x0 0x80 0x0 pkt-reason = 0x4 pkt-state = 0x0 pkt-stats = 0x10 __ttl = 0x1 __tod = 0x4ae2ecee 0x5e3ce39 always with the same device name. So, it would appear that the drive at that location is probably broken, and zfs just isn't detecting it properly? Also, I'm wondering if this is related to the thread just recently titled [zfs-discuss] SNV_125 MPT warning in logfile, as we're using the same controller that person mentions. We're going to order some beefier controllers with the next shipment, any suggestions on what to get? If we find that the new controllers work much better, we may even go as far as replacing the ones in the existing machines (or at least any machines experiencing these issues). We're not married to LSI, but we use LSI controllers in our webservers for the most part and they're pretty solid there (though admittedly those are hardware raid, rather than JBOD) Thanks so much for your help! -Jeremy signature.asc Description: OpenPGP digital signature ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] zpool getting in a stuck state?
Jeremy Kitchen wrote: Cindy Swearingen wrote: Jeremy, I generally suspect device failures in this case and if possible, review the contents of /var/adm/messages and fmdump -eV to see if the pool hang could be attributed to failed or failing devices. perusing /var/adm/messages, I see: Oct 22 05:06:11 homiebackup10 scsi: [ID 365881 kern.info] /p...@0,0/pci8086,4...@1/pci1000,3...@0 (mpt1): Oct 22 05:06:11 homiebackup10 Log info 0x3108 received for target 5. Oct 22 05:06:11 homiebackup10 scsi_status=0x0, ioc_status=0x804b, scsi_state=0x0 Oct 22 05:06:19 homiebackup10 scsi: [ID 365881 kern.info] /p...@0,0/pci8086,4...@1/pci1000,3...@0 (mpt1): Oct 22 05:06:19 homiebackup10 Log info 0x3108 received for target 5. Oct 22 05:06:19 homiebackup10 scsi_status=0x0, ioc_status=0x804b, scsi_state=0x1 Oct 22 05:06:19 homiebackup10 scsi: [ID 365881 kern.info] /p...@0,0/pci8086,4...@1/pci1000,3...@0 (mpt1): Oct 22 05:06:19 homiebackup10 Log info 0x3108 received for target 5. Oct 22 05:06:19 homiebackup10 scsi_status=0x0, ioc_status=0x804b, scsi_state=0x0 lots of messages like that just prior to rsync warnings: Oct 22 05:55:29 homiebackup10 rsyncd[29746]: [ID 702911 daemon.warning] rsync: connection unexpectedly closed (0 bytes received so far) [receiver] Oct 22 05:55:29 homiebackup10 rsyncd[29746]: [ID 702911 daemon.warning] rsync error: error in rsync protocol data stream (code 12) at io.c(453) [receiver=2.6.9] Oct 22 06:10:29 homiebackup10 rsyncd[178]: [ID 702911 daemon.warning] rsync: connection unexpectedly closed (0 bytes received so far) [receiver] Oct 22 06:10:29 homiebackup10 rsyncd[178]: [ID 702911 daemon.warning] rsync error: error in rsync protocol data stream (code 12) at io.c(453) [receiver=2.6.9] Oct 22 06:25:27 homiebackup10 rsyncd[776]: [ID 702911 daemon.warning] rsync: connection unexpectedly closed (0 bytes received so far) [receiver] I think the rsync warnings are indicative of the pool being hung. So it would seem that the bus is freaking out and then the pool dies and that's that? The strange thing is that this machine is way underloaded compared to another one we have (which has 5 shelves, so ~150TB of storage attached) which hasn't really had any problems like this. We had issues with that one when rebuilding drives, but it's been pretty stable since. looking at fmdump -eV, I see lots and lots of these: Oct 24 2009 05:02:54.098815545 ereport.io.scsi.cmd.disk.tran nvlist version: 0 class = ereport.io.scsi.cmd.disk.tran ena = 0x882108543f200401 detector = (embedded nvlist) nvlist version: 0 version = 0x0 scheme = dev device-path = /p...@0,0/pci8086,4...@5/pci1000,3...@0/s...@30,0 (end detector) driver-assessment = retry op-code = 0x28 cdb = 0x28 0x0 0x51 0x9c 0xa5 0x80 0x0 0x0 0x80 0x0 pkt-reason = 0x4 pkt-state = 0x0 pkt-stats = 0x10 __ttl = 0x1 __tod = 0x4ae2ecee 0x5e3ce39 so doing some more reading here on the list and mucking about a bit more, I've come across this in the fmdump log: Oct 22 2009 05:03:56.687818542 ereport.fs.zfs.io nvlist version: 0 class = ereport.fs.zfs.io ena = 0x99eb889c6fe1 detector = (embedded nvlist) nvlist version: 0 version = 0x0 scheme = zfs pool = 0x90ed10dfd0191c3b vdev = 0xf41193d6d1deedc2 (end detector) pool = raid3155 pool_guid = 0x90ed10dfd0191c3b pool_context = 0 pool_failmode = wait vdev_guid = 0xf41193d6d1deedc2 vdev_type = disk vdev_path = /dev/dsk/c6t5d0s0 vdev_devid = id1,s...@n5000c50010a7666b/a parent_guid = 0xcbaa8ea60a3c133 parent_type = raidz zio_err = 5 zio_offset = 0xab2901da00 zio_size = 0x200 zio_objset = 0x4b zio_object = 0xa26ef4 zio_level = 0 zio_blkid = 0xf __ttl = 0x1 __tod = 0x4ae04a2c 0x28ff472e c6t5d0 is in the problem pool (raid3155) so I've gone ahead and offlined the drive and will be replacing it shortly. Hopefully that will take care of the problem! If this doesn't solve the problem, do you have any suggestions on what more I can look at to try to figure out what's wrong? Is there some sort of setting I can set which will prevent the zpool from hanging up the entire system in the event of a single drive failure like this? It's really annoying to not be able to log into the machine (and having to forcefully reboot the machine) when this happens. Thanks again for your help! -Jeremy signature.asc Description: OpenPGP digital signature ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] zpool getting in a stuck state?
Hi Jeremy, The ereport.io.scsi.cmd.disk.tran is describing connections problems to the /p...@0,0/pci8086,4...@5/pci1000,3...@0/s...@30,0 device. I think the .tran suffix is for transient. ZFS might be reporting problems with device as well, but if the zpool/zfs commands are hanging, then it might be difficult to get this confirmation. The zpool status command will report device problems. When a device in a pool fails, then I/O to the pool is blocked, reads might be successful. See the failmode property description in zpool.1m. Is this pool redundant? If so, you can attempt to offline this device until it is replaced. If you have another device available, you might replace the suspect drive and see if that solves the pool hang problem. Cindy On 10/27/09 12:04, Jeremy Kitchen wrote: Cindy Swearingen wrote: Jeremy, I generally suspect device failures in this case and if possible, review the contents of /var/adm/messages and fmdump -eV to see if the pool hang could be attributed to failed or failing devices. perusing /var/adm/messages, I see: Oct 22 05:06:11 homiebackup10 scsi: [ID 365881 kern.info] /p...@0,0/pci8086,4...@1/pci1000,3...@0 (mpt1): Oct 22 05:06:11 homiebackup10 Log info 0x3108 received for target 5. Oct 22 05:06:11 homiebackup10 scsi_status=0x0, ioc_status=0x804b, scsi_state=0x0 Oct 22 05:06:19 homiebackup10 scsi: [ID 365881 kern.info] /p...@0,0/pci8086,4...@1/pci1000,3...@0 (mpt1): Oct 22 05:06:19 homiebackup10 Log info 0x3108 received for target 5. Oct 22 05:06:19 homiebackup10 scsi_status=0x0, ioc_status=0x804b, scsi_state=0x1 Oct 22 05:06:19 homiebackup10 scsi: [ID 365881 kern.info] /p...@0,0/pci8086,4...@1/pci1000,3...@0 (mpt1): Oct 22 05:06:19 homiebackup10 Log info 0x3108 received for target 5. Oct 22 05:06:19 homiebackup10 scsi_status=0x0, ioc_status=0x804b, scsi_state=0x0 lots of messages like that just prior to rsync warnings: Oct 22 05:55:29 homiebackup10 rsyncd[29746]: [ID 702911 daemon.warning] rsync: connection unexpectedly closed (0 bytes received so far) [receiver] Oct 22 05:55:29 homiebackup10 rsyncd[29746]: [ID 702911 daemon.warning] rsync error: error in rsync protocol data stream (code 12) at io.c(453) [receiver=2.6.9] Oct 22 06:10:29 homiebackup10 rsyncd[178]: [ID 702911 daemon.warning] rsync: connection unexpectedly closed (0 bytes received so far) [receiver] Oct 22 06:10:29 homiebackup10 rsyncd[178]: [ID 702911 daemon.warning] rsync error: error in rsync protocol data stream (code 12) at io.c(453) [receiver=2.6.9] Oct 22 06:25:27 homiebackup10 rsyncd[776]: [ID 702911 daemon.warning] rsync: connection unexpectedly closed (0 bytes received so far) [receiver] I think the rsync warnings are indicative of the pool being hung. So it would seem that the bus is freaking out and then the pool dies and that's that? The strange thing is that this machine is way underloaded compared to another one we have (which has 5 shelves, so ~150TB of storage attached) which hasn't really had any problems like this. We had issues with that one when rebuilding drives, but it's been pretty stable since. looking at fmdump -eV, I see lots and lots of these: Oct 24 2009 05:02:54.098815545 ereport.io.scsi.cmd.disk.tran nvlist version: 0 class = ereport.io.scsi.cmd.disk.tran ena = 0x882108543f200401 detector = (embedded nvlist) nvlist version: 0 version = 0x0 scheme = dev device-path = /p...@0,0/pci8086,4...@5/pci1000,3...@0/s...@30,0 (end detector) driver-assessment = retry op-code = 0x28 cdb = 0x28 0x0 0x51 0x9c 0xa5 0x80 0x0 0x0 0x80 0x0 pkt-reason = 0x4 pkt-state = 0x0 pkt-stats = 0x10 __ttl = 0x1 __tod = 0x4ae2ecee 0x5e3ce39 always with the same device name. So, it would appear that the drive at that location is probably broken, and zfs just isn't detecting it properly? Also, I'm wondering if this is related to the thread just recently titled [zfs-discuss] SNV_125 MPT warning in logfile, as we're using the same controller that person mentions. We're going to order some beefier controllers with the next shipment, any suggestions on what to get? If we find that the new controllers work much better, we may even go as far as replacing the ones in the existing machines (or at least any machines experiencing these issues). We're not married to LSI, but we use LSI controllers in our webservers for the most part and they're pretty solid there (though admittedly those are hardware raid, rather than JBOD) Thanks so much for your help! -Jeremy ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] zpool getting in a stuck state?
Jeremy, I can't comment on your hardware because I'm not familiar with it. If you have a storage pool with ZFS redundancy and one device fails or begins failing, then the pool keeps going, in a degraded mode but is generally available. You can try setting the failmode property to continue, which would allow reads to continue in case of a device failure, might prevent the pool from hanging. If offlining the disk or replacing the disk doesn't help, let us know. Cindy On 10/27/09 13:13, Jeremy Kitchen wrote: Jeremy Kitchen wrote: Cindy Swearingen wrote: Jeremy, I generally suspect device failures in this case and if possible, review the contents of /var/adm/messages and fmdump -eV to see if the pool hang could be attributed to failed or failing devices. perusing /var/adm/messages, I see: Oct 22 05:06:11 homiebackup10 scsi: [ID 365881 kern.info] /p...@0,0/pci8086,4...@1/pci1000,3...@0 (mpt1): Oct 22 05:06:11 homiebackup10 Log info 0x3108 received for target 5. Oct 22 05:06:11 homiebackup10 scsi_status=0x0, ioc_status=0x804b, scsi_state=0x0 Oct 22 05:06:19 homiebackup10 scsi: [ID 365881 kern.info] /p...@0,0/pci8086,4...@1/pci1000,3...@0 (mpt1): Oct 22 05:06:19 homiebackup10 Log info 0x3108 received for target 5. Oct 22 05:06:19 homiebackup10 scsi_status=0x0, ioc_status=0x804b, scsi_state=0x1 Oct 22 05:06:19 homiebackup10 scsi: [ID 365881 kern.info] /p...@0,0/pci8086,4...@1/pci1000,3...@0 (mpt1): Oct 22 05:06:19 homiebackup10 Log info 0x3108 received for target 5. Oct 22 05:06:19 homiebackup10 scsi_status=0x0, ioc_status=0x804b, scsi_state=0x0 lots of messages like that just prior to rsync warnings: Oct 22 05:55:29 homiebackup10 rsyncd[29746]: [ID 702911 daemon.warning] rsync: connection unexpectedly closed (0 bytes received so far) [receiver] Oct 22 05:55:29 homiebackup10 rsyncd[29746]: [ID 702911 daemon.warning] rsync error: error in rsync protocol data stream (code 12) at io.c(453) [receiver=2.6.9] Oct 22 06:10:29 homiebackup10 rsyncd[178]: [ID 702911 daemon.warning] rsync: connection unexpectedly closed (0 bytes received so far) [receiver] Oct 22 06:10:29 homiebackup10 rsyncd[178]: [ID 702911 daemon.warning] rsync error: error in rsync protocol data stream (code 12) at io.c(453) [receiver=2.6.9] Oct 22 06:25:27 homiebackup10 rsyncd[776]: [ID 702911 daemon.warning] rsync: connection unexpectedly closed (0 bytes received so far) [receiver] I think the rsync warnings are indicative of the pool being hung. So it would seem that the bus is freaking out and then the pool dies and that's that? The strange thing is that this machine is way underloaded compared to another one we have (which has 5 shelves, so ~150TB of storage attached) which hasn't really had any problems like this. We had issues with that one when rebuilding drives, but it's been pretty stable since. looking at fmdump -eV, I see lots and lots of these: Oct 24 2009 05:02:54.098815545 ereport.io.scsi.cmd.disk.tran nvlist version: 0 class = ereport.io.scsi.cmd.disk.tran ena = 0x882108543f200401 detector = (embedded nvlist) nvlist version: 0 version = 0x0 scheme = dev device-path = /p...@0,0/pci8086,4...@5/pci1000,3...@0/s...@30,0 (end detector) driver-assessment = retry op-code = 0x28 cdb = 0x28 0x0 0x51 0x9c 0xa5 0x80 0x0 0x0 0x80 0x0 pkt-reason = 0x4 pkt-state = 0x0 pkt-stats = 0x10 __ttl = 0x1 __tod = 0x4ae2ecee 0x5e3ce39 so doing some more reading here on the list and mucking about a bit more, I've come across this in the fmdump log: Oct 22 2009 05:03:56.687818542 ereport.fs.zfs.io nvlist version: 0 class = ereport.fs.zfs.io ena = 0x99eb889c6fe1 detector = (embedded nvlist) nvlist version: 0 version = 0x0 scheme = zfs pool = 0x90ed10dfd0191c3b vdev = 0xf41193d6d1deedc2 (end detector) pool = raid3155 pool_guid = 0x90ed10dfd0191c3b pool_context = 0 pool_failmode = wait vdev_guid = 0xf41193d6d1deedc2 vdev_type = disk vdev_path = /dev/dsk/c6t5d0s0 vdev_devid = id1,s...@n5000c50010a7666b/a parent_guid = 0xcbaa8ea60a3c133 parent_type = raidz zio_err = 5 zio_offset = 0xab2901da00 zio_size = 0x200 zio_objset = 0x4b zio_object = 0xa26ef4 zio_level = 0 zio_blkid = 0xf __ttl = 0x1 __tod = 0x4ae04a2c 0x28ff472e c6t5d0 is in the problem pool (raid3155) so I've gone ahead and offlined the drive and will be replacing it shortly. Hopefully that will take care of the problem! If this doesn't solve the problem, do you have any suggestions on what more I can look at to try to figure out what's wrong? Is there some sort of setting I can set which will prevent the zpool from hanging up the entire system in the
Re: [zfs-discuss] zpool getting in a stuck state?
Jeremy Kitchen wrote: Hey folks! We're using zfs-based file servers for our backups and we've been having some issues as of late with certain situations causing zfs/zpool commands to hang. anyone? this is happening right now and because we're doing a restore I can't reboot the machine, so it's a prime opportunity to get debugging information if it'll help. Thanks! -Jeremy signature.asc Description: OpenPGP digital signature ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] zpool getting in a stuck state?
Hi Jeremy, Can you use the command below and send me the output, please? Thanks, Cindy # mdb -k ::stacks -m zfs On 10/26/09 11:58, Jeremy Kitchen wrote: Jeremy Kitchen wrote: Hey folks! We're using zfs-based file servers for our backups and we've been having some issues as of late with certain situations causing zfs/zpool commands to hang. anyone? this is happening right now and because we're doing a restore I can't reboot the machine, so it's a prime opportunity to get debugging information if it'll help. Thanks! -Jeremy ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] zpool getting in a stuck state?
Cindy Swearingen wrote: Hi Jeremy, Can you use the command below and send me the output, please? Thanks, Cindy # mdb -k ::stacks -m zfs ack! it *just* fully died. I've had our noc folks reset the machine and I will get this info to you as soon as it happens again (I'm fairly certain it will, if not on this specific machine, one of our other machines!) -Jeremy signature.asc Description: OpenPGP digital signature ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] zpool getting in a stuck state?
Hey folks! We're using zfs-based file servers for our backups and we've been having some issues as of late with certain situations causing zfs/ zpool commands to hang. Currently, it appears that raid3155 is in this broken state: r...@homiebackup10:~# ps auxwww | grep zfs root 15873 0.0 0.0 4216 1236 pts/2S 15:56:54 0:00 grep zfs root 13678 0.0 0.1 7516 2176 ?S 14:18:00 0:00 zfs list - t filesystem raid3155/angels root 13691 0.0 0.1 7516 2188 ?S 14:18:04 0:00 zfs list - t filesystem raid3155/blazers root 13731 0.0 0.1 7516 2200 ?S 14:18:20 0:00 zfs list - t filesystem raid3155/broncos root 13792 0.0 0.1 7516 2220 ?S 14:18:51 0:00 zfs list - t filesystem raid3155/diamondbacks root 13910 0.0 0.1 7516 2216 ?S 14:19:52 0:00 zfs list - t filesystem raid3155/knicks root 13911 0.0 0.1 7516 2196 ?S 14:19:53 0:00 zfs list - t filesystem raid3155/lions root 13916 0.0 0.1 7516 2220 ?S 14:19:55 0:00 zfs list - t filesystem raid3155/magic root 13933 0.0 0.1 7516 2232 ?S 14:20:01 0:00 zfs list - t filesystem raid3155/mariners root 13966 0.0 0.1 7516 2212 ?S 14:20:11 0:00 zfs list - t filesystem raid3155/mets root 13971 0.0 0.1 7516 2208 ?S 14:20:21 0:00 zfs list - t filesystem raid3155/niners root 13982 0.0 0.1 7516 2220 ?S 14:20:32 0:00 zfs list - t filesystem raid3155/padres root 14064 0.0 0.1 7516 2220 ?S 14:21:03 0:00 zfs list - t filesystem raid3155/redwings root 14123 0.0 0.1 7516 2212 ?S 14:21:20 0:00 zfs list - t filesystem raid3155/seahawks root 14323 0.0 0.1 7420 2184 ?S 14:22:51 0:00 zfs allow zfsrcv create,mount,receive,share raid3155 root 15245 0.0 0.1 7468 2256 ?S 15:17:59 0:00 zfs create raid3155/angels root 15250 0.0 0.1 7468 2244 ?S 15:18:03 0:00 zfs create raid3155/blazers root 15256 0.0 0.1 7468 2248 ?S 15:18:19 0:00 zfs create raid3155/broncos root 15284 0.0 0.1 7468 2256 ?S 15:18:51 0:00 zfs create raid3155/diamondbacks root 15322 0.0 0.1 7468 2260 ?S 15:19:51 0:00 zfs create raid3155/knicks root 15332 0.0 0.1 7468 2260 ?S 15:19:53 0:00 zfs create raid3155/magic root 15333 0.0 0.1 7468 2236 ?S 15:19:53 0:00 zfs create raid3155/lions root 15345 0.0 0.1 7468 2264 ?S 15:20:01 0:00 zfs create raid3155/mariners root 15355 0.0 0.1 7468 2260 ?S 15:20:10 0:00 zfs create raid3155/mets root 15363 0.0 0.1 7468 2252 ?S 15:20:20 0:00 zfs create raid3155/niners root 15368 0.0 0.1 7468 2256 ?S 15:20:33 0:00 zfs create raid3155/padres root 15384 0.0 0.1 7468 2256 ?S 15:21:01 0:00 zfs create raid3155/redwings root 15389 0.0 0.1 7468 2264 ?S 15:21:20 0:00 zfs create raid3155/seahawks attempting to do a zpool list hangs, as does attempting to do a zpool status raid3155. Rebooting the system (forcefully) seems to 'fix' the problem, but once it comes back up, doing a zpool list or zpool status shows no issues with any of the drives. (after a reboot): r...@homiebackup10:~# zpool list NAME SIZE USED AVAILCAP HEALTH ALTROOT raid3066 32.5T 18.1T 14.4T55% ONLINE - raid3154 32.5T 18.2T 14.3T55% ONLINE - raid3155 32.5T 18.7T 13.8T57% ONLINE - raid3156 32.5T 22.0T 10.5T67% ONLINE - rpool 59.5G 14.1G 45.4G23% ONLINE - We are using silmech storform iserv r505 machines with 3x silmech storform D55J jbod sas expanders connected to LSI Logic SAS1068E B3 esas cards all containing 1.5TB seagate 7200.11 sata hard drives. We make a single striped raidz2 pool out of each chassis giving us ~29TB of storage out of each 'brick' and we use rsync to copy the data from the machines to be backed up. They're currently running OpenSolaris 2009.06 (snv_111b) We have had issues with the backplanes on these machines, but this particular machine has been up and running for nearly a year without any problems. It's currently at about 50% capacity on all pools. I'm not really sure how to proceed from here as far as getting debug information while it's hung like this. I saw someone with similar issues post a few days ago but don't see any replies. The thread title is [zfs-discuss] Problem with resilvering and faulty disk. We've been seeing that issue as well while rebuilding these drives. Any assistance with this would be greatly appreciated, and any information you folks might need to help troubleshoot this issue I can provide, just let me know what you need! -Jeremy ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss