Re: [zfs-discuss] zpool getting in a stuck state?

2009-10-28 Thread Scott Meilicke
Hi Jeremy,

I had a loosely similar problem with my 2009.06 box. In my case (which may not 
be yours), working with support we found a bug that was causing my pool to 
hang. I also got erroneous errors when I did a scrub ( 3 x 5 disk raidz). I am 
using the same LSI controller. A sure fire way to kill the box was to setup a 
file system as an iSCSI target, and write a lot of data to it, around 1-2MB/s. 
It would usually die inside of a few hours. NFS writing was not as bad, but 
within a day it would panic there too.

The solution for me was to upgrade to 124. Since the upgrade three weeks ago, I 
have had no problems.

Again, I don't know if this would fix your problem, but it may be worth a try. 
Just don't upgrade your ZFS version, and you will be able to roll back to 
2009.06 at any time.

-Scott
-- 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] zpool getting in a stuck state?

2009-10-28 Thread Jeremy Kitchen


On Oct 26, 2009, at 4:00 PM, Cindy Swearingen wrote:


Hi Jeremy,

Can you use the command below and send me the output, please?

Thanks,

Cindy

# mdb -k
 ::stacks -m zfs


Ok, it did it again.  I replaced the drive and it's currently  
resilvering (13.66% done, 135h33m to go it says) and the output of  
that command is this:


 ::stacks -m zfs
THREAD   STATESOBJCOUNT
ff02f0078a80 SLEEPMUTEX  31
 swtch+0x147
 turnstile_block+0x764
 mutex_vector_enter+0x261
 zfs_zget+0x47
 zfs_root+0x57
 fsop_root+0x2e
 traverse+0x61
 lookuppnvp+0x423
 lookuppnat+0x12c
 lookupnameat+0x91
 lookupname+0x28
 chroot+0x30
 _sys_sysenter_post_swapgs+0x14b

ff02efa33500 SLEEPCV 18
 swtch+0x147
 cv_wait+0x61
 dbuf_read+0x237
 dmu_buf_hold+0x96
 zap_lockdir+0x67
 zap_lookup_norm+0x55
 zap_lookup+0x2d
 zfs_match_find+0xfd
 zfs_dirent_lock+0x3d1
 zfs_dirlook+0xd9
 zfs_lookup+0x104
 fop_lookup+0xed
 lookuppnvp+0x3a3
 lookuppnat+0x12c
 lookupnameat+0x91
 cstatat_getvp+0x164
 cstatat64_32+0x82
 stat64_32+0x31
 _sys_sysenter_post_swapgs+0x14b

ff02da1a91e0 SLEEPCV  9
 swtch+0x147
 cv_wait+0x61
 zio_wait+0x5d
 dbuf_read+0x1e8
 dmu_buf_hold+0x96
 zap_lockdir+0x67
 zap_lookup_norm+0x55
 zap_lookup+0x2d
 zfs_match_find+0xfd
 zfs_dirent_lock+0x3d1
 zfs_dirlook+0xd9
 zfs_lookup+0x104
 fop_lookup+0xed
 lookuppnvp+0x3a3
 lookuppnat+0x12c
 lookupnameat+0x91
 cstatat_getvp+0x164
 cstatat64_32+0x82
 stat64_32+0x31
 _sys_sysenter_post_swapgs+0x14b

ff02d8c46ac0 SLEEPCV  7
 swtch+0x147
 cv_wait+0x61
 zio_wait+0x5d
 dbuf_read+0x1e8
 dbuf_findbp+0xe7
 dbuf_hold_impl+0x81
 dbuf_findbp+0xcf
 dbuf_hold_impl+0x81
 dbuf_hold+0x2e
 dnode_hold_impl+0xb5
 dnode_hold+0x2b
 dmu_bonus_hold+0x36
 zfs_zget+0x5a
 zfs_root+0x57
 fsop_root+0x2e
 traverse+0x61
 lookuppnvp+0x423
 lookuppnat+0x12c
 lookupnameat+0x91
 lookupname+0x28
 chroot+0x30
 _sys_sysenter_post_swapgs+0x14b

ff02da1a2720 SLEEPCV  6
 swtch+0x147
 cv_wait+0x61
 txg_wait_open+0x7a
 dmu_tx_wait+0xb3
 dmu_tx_assign+0x4b
 zfs_inactive+0xa8
 fop_inactive+0xaf
 vn_rele+0x5f
 closef+0x75
 closeandsetf+0x44a
 close+0x18
 _sys_sysenter_post_swapgs+0x14b

ff000f61bc60 SLEEPCV  5
 swtch+0x147
 cv_wait+0x61
 txg_thread_wait+0x5f
 txg_quiesce_thread+0x94
 thread_start+8

ff02d8514aa0 SLEEPCV  5
 swtch+0x147
 cv_wait+0x61
 zio_wait+0x5d
 dbuf_read+0x1e8
 dbuf_findbp+0xe7
 dbuf_hold_impl+0x81
 dbuf_findbp+0xcf
 dbuf_hold_impl+0x81
 dbuf_findbp+0xcf
 dbuf_hold_impl+0x81
 dbuf_hold+0x2e
 dnode_hold_impl+0xb5
 dnode_hold+0x2b
 dmu_bonus_hold+0x36
 zfs_zget+0x5a
 zfs_root+0x57
 fsop_root+0x2e
 traverse+0x61
 lookuppnvp+0x423
 lookuppnat+0x12c
 lookupnameat+0x91
 lookupname+0x28
 chroot+0x30
 _sys_sysenter_post_swapgs+0x14b

ff02d89b4c20 SLEEPCV  3
 swtch+0x147
 cv_wait+0x61
 txg_wait_synced+0x7f
 dmu_tx_wait+0xcd
 zfs_create+0x44d
 fop_create+0xfc
 

Re: [zfs-discuss] zpool getting in a stuck state?

2009-10-27 Thread Cindy Swearingen

Jeremy,

I generally suspect device failures in this case and if possible,
review the contents of /var/adm/messages and fmdump -eV to see
if the pool hang could be attributed to failed or failing devices.

Cindy



On 10/26/09 17:28, Jeremy Kitchen wrote:

Cindy Swearingen wrote:

Hi Jeremy,

Can you use the command below and send me the output, please?

Thanks,

Cindy

# mdb -k

::stacks -m zfs


ack!  it *just* fully died.  I've had our noc folks reset the machine
and I will get this info to you as soon as it happens again (I'm fairly
certain it will, if not on this specific machine, one of our other
machines!)

-Jeremy



___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] zpool getting in a stuck state?

2009-10-27 Thread Jeremy Kitchen
Cindy Swearingen wrote:
 Jeremy,
 
 I generally suspect device failures in this case and if possible,
 review the contents of /var/adm/messages and fmdump -eV to see
 if the pool hang could be attributed to failed or failing devices.

perusing /var/adm/messages, I see:

Oct 22 05:06:11 homiebackup10 scsi: [ID 365881 kern.info]
/p...@0,0/pci8086,4...@1/pci1000,3...@0 (mpt1):
Oct 22 05:06:11 homiebackup10   Log info 0x3108 received for target 5.
Oct 22 05:06:11 homiebackup10   scsi_status=0x0, ioc_status=0x804b,
scsi_state=0x0
Oct 22 05:06:19 homiebackup10 scsi: [ID 365881 kern.info]
/p...@0,0/pci8086,4...@1/pci1000,3...@0 (mpt1):
Oct 22 05:06:19 homiebackup10   Log info 0x3108 received for target 5.
Oct 22 05:06:19 homiebackup10   scsi_status=0x0, ioc_status=0x804b,
scsi_state=0x1
Oct 22 05:06:19 homiebackup10 scsi: [ID 365881 kern.info]
/p...@0,0/pci8086,4...@1/pci1000,3...@0 (mpt1):
Oct 22 05:06:19 homiebackup10   Log info 0x3108 received for target 5.
Oct 22 05:06:19 homiebackup10   scsi_status=0x0, ioc_status=0x804b,
scsi_state=0x0

lots of messages like that just prior to rsync warnings:

Oct 22 05:55:29 homiebackup10 rsyncd[29746]: [ID 702911 daemon.warning]
rsync: connection unexpectedly closed (0 bytes received so far) [receiver]
Oct 22 05:55:29 homiebackup10 rsyncd[29746]: [ID 702911 daemon.warning]
rsync error: error in rsync protocol data stream (code 12) at io.c(453)
[receiver=2.6.9]
Oct 22 06:10:29 homiebackup10 rsyncd[178]: [ID 702911 daemon.warning]
rsync: connection unexpectedly closed (0 bytes received so far) [receiver]
Oct 22 06:10:29 homiebackup10 rsyncd[178]: [ID 702911 daemon.warning]
rsync error: error in rsync protocol data stream (code 12) at io.c(453)
[receiver=2.6.9]
Oct 22 06:25:27 homiebackup10 rsyncd[776]: [ID 702911 daemon.warning]
rsync: connection unexpectedly closed (0 bytes received so far) [receiver]

I think the rsync warnings are indicative of the pool being hung.  So it
would seem that the bus is freaking out and then the pool dies and
that's that?  The strange thing is that this machine is way underloaded
compared to another one we have (which has 5 shelves, so ~150TB of
storage attached) which hasn't really had any problems like this.  We
had issues with that one when rebuilding drives, but it's been pretty
stable since.

looking at fmdump -eV, I see lots and lots of these:

Oct 24 2009 05:02:54.098815545 ereport.io.scsi.cmd.disk.tran
nvlist version: 0
class = ereport.io.scsi.cmd.disk.tran
ena = 0x882108543f200401
detector = (embedded nvlist)
nvlist version: 0
version = 0x0
scheme = dev
device-path = /p...@0,0/pci8086,4...@5/pci1000,3...@0/s...@30,0
(end detector)

driver-assessment = retry
op-code = 0x28
cdb = 0x28 0x0 0x51 0x9c 0xa5 0x80 0x0 0x0 0x80 0x0
pkt-reason = 0x4
pkt-state = 0x0
pkt-stats = 0x10
__ttl = 0x1
__tod = 0x4ae2ecee 0x5e3ce39



always with the same device name.  So, it would appear that the drive at
 that location is probably broken, and zfs just isn't detecting it properly?

Also, I'm wondering if this is related to the thread just recently
titled [zfs-discuss] SNV_125 MPT warning in logfile, as we're using the
same controller that person mentions.

We're going to order some beefier controllers with the next shipment,
any suggestions on what to get?  If we find that the new controllers
work much better, we may even go as far as replacing the ones in the
existing machines (or at least any machines experiencing these issues).

We're not married to LSI, but we use LSI controllers in our webservers
for the most part and they're pretty solid there (though admittedly
those are hardware raid, rather than JBOD)

Thanks so much for your help!

-Jeremy



signature.asc
Description: OpenPGP digital signature
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] zpool getting in a stuck state?

2009-10-27 Thread Jeremy Kitchen
Jeremy Kitchen wrote:
 Cindy Swearingen wrote:
 Jeremy,

 I generally suspect device failures in this case and if possible,
 review the contents of /var/adm/messages and fmdump -eV to see
 if the pool hang could be attributed to failed or failing devices.
 
 perusing /var/adm/messages, I see:
 
 Oct 22 05:06:11 homiebackup10 scsi: [ID 365881 kern.info]
 /p...@0,0/pci8086,4...@1/pci1000,3...@0 (mpt1):
 Oct 22 05:06:11 homiebackup10   Log info 0x3108 received for target 5.
 Oct 22 05:06:11 homiebackup10   scsi_status=0x0, ioc_status=0x804b,
 scsi_state=0x0
 Oct 22 05:06:19 homiebackup10 scsi: [ID 365881 kern.info]
 /p...@0,0/pci8086,4...@1/pci1000,3...@0 (mpt1):
 Oct 22 05:06:19 homiebackup10   Log info 0x3108 received for target 5.
 Oct 22 05:06:19 homiebackup10   scsi_status=0x0, ioc_status=0x804b,
 scsi_state=0x1
 Oct 22 05:06:19 homiebackup10 scsi: [ID 365881 kern.info]
 /p...@0,0/pci8086,4...@1/pci1000,3...@0 (mpt1):
 Oct 22 05:06:19 homiebackup10   Log info 0x3108 received for target 5.
 Oct 22 05:06:19 homiebackup10   scsi_status=0x0, ioc_status=0x804b,
 scsi_state=0x0
 
 lots of messages like that just prior to rsync warnings:
 
 Oct 22 05:55:29 homiebackup10 rsyncd[29746]: [ID 702911 daemon.warning]
 rsync: connection unexpectedly closed (0 bytes received so far) [receiver]
 Oct 22 05:55:29 homiebackup10 rsyncd[29746]: [ID 702911 daemon.warning]
 rsync error: error in rsync protocol data stream (code 12) at io.c(453)
 [receiver=2.6.9]
 Oct 22 06:10:29 homiebackup10 rsyncd[178]: [ID 702911 daemon.warning]
 rsync: connection unexpectedly closed (0 bytes received so far) [receiver]
 Oct 22 06:10:29 homiebackup10 rsyncd[178]: [ID 702911 daemon.warning]
 rsync error: error in rsync protocol data stream (code 12) at io.c(453)
 [receiver=2.6.9]
 Oct 22 06:25:27 homiebackup10 rsyncd[776]: [ID 702911 daemon.warning]
 rsync: connection unexpectedly closed (0 bytes received so far) [receiver]
 
 I think the rsync warnings are indicative of the pool being hung.  So it
 would seem that the bus is freaking out and then the pool dies and
 that's that?  The strange thing is that this machine is way underloaded
 compared to another one we have (which has 5 shelves, so ~150TB of
 storage attached) which hasn't really had any problems like this.  We
 had issues with that one when rebuilding drives, but it's been pretty
 stable since.
 
 looking at fmdump -eV, I see lots and lots of these:
 
 Oct 24 2009 05:02:54.098815545 ereport.io.scsi.cmd.disk.tran
 nvlist version: 0
 class = ereport.io.scsi.cmd.disk.tran
 ena = 0x882108543f200401
 detector = (embedded nvlist)
 nvlist version: 0
 version = 0x0
 scheme = dev
 device-path = 
 /p...@0,0/pci8086,4...@5/pci1000,3...@0/s...@30,0
 (end detector)
 
 driver-assessment = retry
 op-code = 0x28
 cdb = 0x28 0x0 0x51 0x9c 0xa5 0x80 0x0 0x0 0x80 0x0
 pkt-reason = 0x4
 pkt-state = 0x0
 pkt-stats = 0x10
 __ttl = 0x1
 __tod = 0x4ae2ecee 0x5e3ce39

so doing some more reading here on the list and mucking about a bit
more, I've come across this in the fmdump log:

Oct 22 2009 05:03:56.687818542 ereport.fs.zfs.io
nvlist version: 0
class = ereport.fs.zfs.io
ena = 0x99eb889c6fe1
detector = (embedded nvlist)
nvlist version: 0
version = 0x0
scheme = zfs
pool = 0x90ed10dfd0191c3b
vdev = 0xf41193d6d1deedc2
(end detector)

pool = raid3155
pool_guid = 0x90ed10dfd0191c3b
pool_context = 0
pool_failmode = wait
vdev_guid = 0xf41193d6d1deedc2
vdev_type = disk
vdev_path = /dev/dsk/c6t5d0s0
vdev_devid = id1,s...@n5000c50010a7666b/a
parent_guid = 0xcbaa8ea60a3c133
parent_type = raidz
zio_err = 5
zio_offset = 0xab2901da00
zio_size = 0x200
zio_objset = 0x4b
zio_object = 0xa26ef4
zio_level = 0
zio_blkid = 0xf
__ttl = 0x1
__tod = 0x4ae04a2c 0x28ff472e


c6t5d0 is in the problem pool (raid3155) so I've gone ahead and offlined
the drive and will be replacing it shortly.  Hopefully that will take
care of the problem!

If this doesn't solve the problem, do you have any suggestions on what
more I can look at to try to figure out what's wrong?  Is there some
sort of setting I can set which will prevent the zpool from hanging up
the entire system in the event of a single drive failure like this?
It's really annoying to not be able to log into the machine (and having
to forcefully reboot the machine) when this happens.

Thanks again for your help!

-Jeremy



signature.asc
Description: OpenPGP digital signature
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] zpool getting in a stuck state?

2009-10-27 Thread Cindy Swearingen

Hi Jeremy,

The ereport.io.scsi.cmd.disk.tran is describing connections
problems to the /p...@0,0/pci8086,4...@5/pci1000,3...@0/s...@30,0
device. I think the .tran suffix is for transient.

ZFS might be reporting problems with device as well, but if the
zpool/zfs commands are hanging, then it might be difficult to
get this confirmation. The zpool status command will report
device problems.

When a device in a pool fails, then I/O to the pool is blocked,
reads might be successful. See the failmode property description
in zpool.1m.

Is this pool redundant? If so, you can attempt to offline this
device until it is replaced. If you have another device available,
you might replace the suspect drive and see if that solves the
pool hang problem.

Cindy



On 10/27/09 12:04, Jeremy Kitchen wrote:

Cindy Swearingen wrote:

Jeremy,

I generally suspect device failures in this case and if possible,
review the contents of /var/adm/messages and fmdump -eV to see
if the pool hang could be attributed to failed or failing devices.


perusing /var/adm/messages, I see:

Oct 22 05:06:11 homiebackup10 scsi: [ID 365881 kern.info]
/p...@0,0/pci8086,4...@1/pci1000,3...@0 (mpt1):
Oct 22 05:06:11 homiebackup10   Log info 0x3108 received for target 5.
Oct 22 05:06:11 homiebackup10   scsi_status=0x0, ioc_status=0x804b,
scsi_state=0x0
Oct 22 05:06:19 homiebackup10 scsi: [ID 365881 kern.info]
/p...@0,0/pci8086,4...@1/pci1000,3...@0 (mpt1):
Oct 22 05:06:19 homiebackup10   Log info 0x3108 received for target 5.
Oct 22 05:06:19 homiebackup10   scsi_status=0x0, ioc_status=0x804b,
scsi_state=0x1
Oct 22 05:06:19 homiebackup10 scsi: [ID 365881 kern.info]
/p...@0,0/pci8086,4...@1/pci1000,3...@0 (mpt1):
Oct 22 05:06:19 homiebackup10   Log info 0x3108 received for target 5.
Oct 22 05:06:19 homiebackup10   scsi_status=0x0, ioc_status=0x804b,
scsi_state=0x0

lots of messages like that just prior to rsync warnings:

Oct 22 05:55:29 homiebackup10 rsyncd[29746]: [ID 702911 daemon.warning]
rsync: connection unexpectedly closed (0 bytes received so far) [receiver]
Oct 22 05:55:29 homiebackup10 rsyncd[29746]: [ID 702911 daemon.warning]
rsync error: error in rsync protocol data stream (code 12) at io.c(453)
[receiver=2.6.9]
Oct 22 06:10:29 homiebackup10 rsyncd[178]: [ID 702911 daemon.warning]
rsync: connection unexpectedly closed (0 bytes received so far) [receiver]
Oct 22 06:10:29 homiebackup10 rsyncd[178]: [ID 702911 daemon.warning]
rsync error: error in rsync protocol data stream (code 12) at io.c(453)
[receiver=2.6.9]
Oct 22 06:25:27 homiebackup10 rsyncd[776]: [ID 702911 daemon.warning]
rsync: connection unexpectedly closed (0 bytes received so far) [receiver]

I think the rsync warnings are indicative of the pool being hung.  So it
would seem that the bus is freaking out and then the pool dies and
that's that?  The strange thing is that this machine is way underloaded
compared to another one we have (which has 5 shelves, so ~150TB of
storage attached) which hasn't really had any problems like this.  We
had issues with that one when rebuilding drives, but it's been pretty
stable since.

looking at fmdump -eV, I see lots and lots of these:

Oct 24 2009 05:02:54.098815545 ereport.io.scsi.cmd.disk.tran
nvlist version: 0
class = ereport.io.scsi.cmd.disk.tran
ena = 0x882108543f200401
detector = (embedded nvlist)
nvlist version: 0
version = 0x0
scheme = dev
device-path = /p...@0,0/pci8086,4...@5/pci1000,3...@0/s...@30,0
(end detector)

driver-assessment = retry
op-code = 0x28
cdb = 0x28 0x0 0x51 0x9c 0xa5 0x80 0x0 0x0 0x80 0x0
pkt-reason = 0x4
pkt-state = 0x0
pkt-stats = 0x10
__ttl = 0x1
__tod = 0x4ae2ecee 0x5e3ce39



always with the same device name.  So, it would appear that the drive at
 that location is probably broken, and zfs just isn't detecting it properly?

Also, I'm wondering if this is related to the thread just recently
titled [zfs-discuss] SNV_125 MPT warning in logfile, as we're using the
same controller that person mentions.

We're going to order some beefier controllers with the next shipment,
any suggestions on what to get?  If we find that the new controllers
work much better, we may even go as far as replacing the ones in the
existing machines (or at least any machines experiencing these issues).

We're not married to LSI, but we use LSI controllers in our webservers
for the most part and they're pretty solid there (though admittedly
those are hardware raid, rather than JBOD)

Thanks so much for your help!

-Jeremy


___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] zpool getting in a stuck state?

2009-10-27 Thread Cindy Swearingen

Jeremy,

I can't comment on your hardware because I'm not familiar with it.

If you have a storage pool with ZFS redundancy and one device fails
or begins failing, then the pool keeps going, in a degraded mode but
is generally available.

You can try setting the failmode property to continue, which would
allow reads to continue in case of a device failure, might prevent
the pool from hanging.

If offlining the disk or replacing the disk doesn't help, let us know.

Cindy

On 10/27/09 13:13, Jeremy Kitchen wrote:

Jeremy Kitchen wrote:

Cindy Swearingen wrote:

Jeremy,

I generally suspect device failures in this case and if possible,
review the contents of /var/adm/messages and fmdump -eV to see
if the pool hang could be attributed to failed or failing devices.

perusing /var/adm/messages, I see:

Oct 22 05:06:11 homiebackup10 scsi: [ID 365881 kern.info]
/p...@0,0/pci8086,4...@1/pci1000,3...@0 (mpt1):
Oct 22 05:06:11 homiebackup10   Log info 0x3108 received for target 5.
Oct 22 05:06:11 homiebackup10   scsi_status=0x0, ioc_status=0x804b,
scsi_state=0x0
Oct 22 05:06:19 homiebackup10 scsi: [ID 365881 kern.info]
/p...@0,0/pci8086,4...@1/pci1000,3...@0 (mpt1):
Oct 22 05:06:19 homiebackup10   Log info 0x3108 received for target 5.
Oct 22 05:06:19 homiebackup10   scsi_status=0x0, ioc_status=0x804b,
scsi_state=0x1
Oct 22 05:06:19 homiebackup10 scsi: [ID 365881 kern.info]
/p...@0,0/pci8086,4...@1/pci1000,3...@0 (mpt1):
Oct 22 05:06:19 homiebackup10   Log info 0x3108 received for target 5.
Oct 22 05:06:19 homiebackup10   scsi_status=0x0, ioc_status=0x804b,
scsi_state=0x0

lots of messages like that just prior to rsync warnings:

Oct 22 05:55:29 homiebackup10 rsyncd[29746]: [ID 702911 daemon.warning]
rsync: connection unexpectedly closed (0 bytes received so far) [receiver]
Oct 22 05:55:29 homiebackup10 rsyncd[29746]: [ID 702911 daemon.warning]
rsync error: error in rsync protocol data stream (code 12) at io.c(453)
[receiver=2.6.9]
Oct 22 06:10:29 homiebackup10 rsyncd[178]: [ID 702911 daemon.warning]
rsync: connection unexpectedly closed (0 bytes received so far) [receiver]
Oct 22 06:10:29 homiebackup10 rsyncd[178]: [ID 702911 daemon.warning]
rsync error: error in rsync protocol data stream (code 12) at io.c(453)
[receiver=2.6.9]
Oct 22 06:25:27 homiebackup10 rsyncd[776]: [ID 702911 daemon.warning]
rsync: connection unexpectedly closed (0 bytes received so far) [receiver]

I think the rsync warnings are indicative of the pool being hung.  So it
would seem that the bus is freaking out and then the pool dies and
that's that?  The strange thing is that this machine is way underloaded
compared to another one we have (which has 5 shelves, so ~150TB of
storage attached) which hasn't really had any problems like this.  We
had issues with that one when rebuilding drives, but it's been pretty
stable since.

looking at fmdump -eV, I see lots and lots of these:

Oct 24 2009 05:02:54.098815545 ereport.io.scsi.cmd.disk.tran
nvlist version: 0
class = ereport.io.scsi.cmd.disk.tran
ena = 0x882108543f200401
detector = (embedded nvlist)
nvlist version: 0
version = 0x0
scheme = dev
device-path = /p...@0,0/pci8086,4...@5/pci1000,3...@0/s...@30,0
(end detector)

driver-assessment = retry
op-code = 0x28
cdb = 0x28 0x0 0x51 0x9c 0xa5 0x80 0x0 0x0 0x80 0x0
pkt-reason = 0x4
pkt-state = 0x0
pkt-stats = 0x10
__ttl = 0x1
__tod = 0x4ae2ecee 0x5e3ce39


so doing some more reading here on the list and mucking about a bit
more, I've come across this in the fmdump log:

Oct 22 2009 05:03:56.687818542 ereport.fs.zfs.io
nvlist version: 0
class = ereport.fs.zfs.io
ena = 0x99eb889c6fe1
detector = (embedded nvlist)
nvlist version: 0
version = 0x0
scheme = zfs
pool = 0x90ed10dfd0191c3b
vdev = 0xf41193d6d1deedc2
(end detector)

pool = raid3155
pool_guid = 0x90ed10dfd0191c3b
pool_context = 0
pool_failmode = wait
vdev_guid = 0xf41193d6d1deedc2
vdev_type = disk
vdev_path = /dev/dsk/c6t5d0s0
vdev_devid = id1,s...@n5000c50010a7666b/a
parent_guid = 0xcbaa8ea60a3c133
parent_type = raidz
zio_err = 5
zio_offset = 0xab2901da00
zio_size = 0x200
zio_objset = 0x4b
zio_object = 0xa26ef4
zio_level = 0
zio_blkid = 0xf
__ttl = 0x1
__tod = 0x4ae04a2c 0x28ff472e


c6t5d0 is in the problem pool (raid3155) so I've gone ahead and offlined
the drive and will be replacing it shortly.  Hopefully that will take
care of the problem!

If this doesn't solve the problem, do you have any suggestions on what
more I can look at to try to figure out what's wrong?  Is there some
sort of setting I can set which will prevent the zpool from hanging up
the entire system in the 

Re: [zfs-discuss] zpool getting in a stuck state?

2009-10-26 Thread Jeremy Kitchen
Jeremy Kitchen wrote:
 Hey folks!
 
 We're using zfs-based file servers for our backups and we've been having
 some issues as of late with certain situations causing zfs/zpool
 commands to hang.

anyone?  this is happening right now and because we're doing a restore I
can't reboot the machine, so it's a prime opportunity to get debugging
information if it'll help.

Thanks!

-Jeremy




signature.asc
Description: OpenPGP digital signature
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] zpool getting in a stuck state?

2009-10-26 Thread Cindy Swearingen

Hi Jeremy,

Can you use the command below and send me the output, please?

Thanks,

Cindy

# mdb -k
 ::stacks -m zfs

On 10/26/09 11:58, Jeremy Kitchen wrote:

Jeremy Kitchen wrote:

Hey folks!

We're using zfs-based file servers for our backups and we've been having
some issues as of late with certain situations causing zfs/zpool
commands to hang.


anyone?  this is happening right now and because we're doing a restore I
can't reboot the machine, so it's a prime opportunity to get debugging
information if it'll help.

Thanks!

-Jeremy






___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] zpool getting in a stuck state?

2009-10-26 Thread Jeremy Kitchen
Cindy Swearingen wrote:
  Hi Jeremy,
 
  Can you use the command below and send me the output, please?
 
  Thanks,
 
  Cindy
 
  # mdb -k
  ::stacks -m zfs

ack!  it *just* fully died.  I've had our noc folks reset the machine
and I will get this info to you as soon as it happens again (I'm fairly
certain it will, if not on this specific machine, one of our other
machines!)

-Jeremy




signature.asc
Description: OpenPGP digital signature
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


[zfs-discuss] zpool getting in a stuck state?

2009-10-22 Thread Jeremy Kitchen

Hey folks!

We're using zfs-based file servers for our backups and we've been  
having some issues as of late with certain situations causing zfs/ 
zpool commands to hang.


Currently, it appears that raid3155 is in this broken state:

r...@homiebackup10:~# ps auxwww | grep zfs
root 15873  0.0  0.0 4216 1236 pts/2S 15:56:54  0:00 grep zfs
root 13678  0.0  0.1 7516 2176 ?S 14:18:00  0:00 zfs list - 
t filesystem raid3155/angels
root 13691  0.0  0.1 7516 2188 ?S 14:18:04  0:00 zfs list - 
t filesystem raid3155/blazers
root 13731  0.0  0.1 7516 2200 ?S 14:18:20  0:00 zfs list - 
t filesystem raid3155/broncos
root 13792  0.0  0.1 7516 2220 ?S 14:18:51  0:00 zfs list - 
t filesystem raid3155/diamondbacks
root 13910  0.0  0.1 7516 2216 ?S 14:19:52  0:00 zfs list - 
t filesystem raid3155/knicks
root 13911  0.0  0.1 7516 2196 ?S 14:19:53  0:00 zfs list - 
t filesystem raid3155/lions
root 13916  0.0  0.1 7516 2220 ?S 14:19:55  0:00 zfs list - 
t filesystem raid3155/magic
root 13933  0.0  0.1 7516 2232 ?S 14:20:01  0:00 zfs list - 
t filesystem raid3155/mariners
root 13966  0.0  0.1 7516 2212 ?S 14:20:11  0:00 zfs list - 
t filesystem raid3155/mets
root 13971  0.0  0.1 7516 2208 ?S 14:20:21  0:00 zfs list - 
t filesystem raid3155/niners
root 13982  0.0  0.1 7516 2220 ?S 14:20:32  0:00 zfs list - 
t filesystem raid3155/padres
root 14064  0.0  0.1 7516 2220 ?S 14:21:03  0:00 zfs list - 
t filesystem raid3155/redwings
root 14123  0.0  0.1 7516 2212 ?S 14:21:20  0:00 zfs list - 
t filesystem raid3155/seahawks
root 14323  0.0  0.1 7420 2184 ?S 14:22:51  0:00 zfs allow  
zfsrcv create,mount,receive,share raid3155
root 15245  0.0  0.1 7468 2256 ?S 15:17:59  0:00 zfs  
create raid3155/angels
root 15250  0.0  0.1 7468 2244 ?S 15:18:03  0:00 zfs  
create raid3155/blazers
root 15256  0.0  0.1 7468 2248 ?S 15:18:19  0:00 zfs  
create raid3155/broncos
root 15284  0.0  0.1 7468 2256 ?S 15:18:51  0:00 zfs  
create raid3155/diamondbacks
root 15322  0.0  0.1 7468 2260 ?S 15:19:51  0:00 zfs  
create raid3155/knicks
root 15332  0.0  0.1 7468 2260 ?S 15:19:53  0:00 zfs  
create raid3155/magic
root 15333  0.0  0.1 7468 2236 ?S 15:19:53  0:00 zfs  
create raid3155/lions
root 15345  0.0  0.1 7468 2264 ?S 15:20:01  0:00 zfs  
create raid3155/mariners
root 15355  0.0  0.1 7468 2260 ?S 15:20:10  0:00 zfs  
create raid3155/mets
root 15363  0.0  0.1 7468 2252 ?S 15:20:20  0:00 zfs  
create raid3155/niners
root 15368  0.0  0.1 7468 2256 ?S 15:20:33  0:00 zfs  
create raid3155/padres
root 15384  0.0  0.1 7468 2256 ?S 15:21:01  0:00 zfs  
create raid3155/redwings
root 15389  0.0  0.1 7468 2264 ?S 15:21:20  0:00 zfs  
create raid3155/seahawks


attempting to do a zpool list hangs, as does attempting to do a zpool  
status raid3155.  Rebooting the system (forcefully) seems to 'fix' the  
problem, but once it comes back up, doing a zpool list or zpool status  
shows no issues with any of the drives.


(after a reboot):
r...@homiebackup10:~# zpool list
NAME   SIZE   USED  AVAILCAP  HEALTH  ALTROOT
raid3066  32.5T  18.1T  14.4T55%  ONLINE  -
raid3154  32.5T  18.2T  14.3T55%  ONLINE  -
raid3155  32.5T  18.7T  13.8T57%  ONLINE  -
raid3156  32.5T  22.0T  10.5T67%  ONLINE  -
rpool 59.5G  14.1G  45.4G23%  ONLINE  -

We are using silmech storform iserv r505 machines with 3x silmech  
storform D55J jbod sas expanders connected to LSI Logic SAS1068E B3  
esas cards all containing 1.5TB seagate 7200.11 sata hard drives.  We  
make a single striped raidz2 pool out of each chassis giving us ~29TB  
of storage out of each 'brick' and we use rsync to copy the data from  
the machines to be backed up.


They're currently running OpenSolaris 2009.06 (snv_111b)

We have had issues with the backplanes on these machines, but this  
particular machine has been up and running for nearly a year without  
any problems.  It's currently at about 50% capacity on all pools.


I'm not really sure how to proceed from here as far as getting debug  
information while it's hung like this.  I saw someone with similar  
issues post a few days ago but don't see any replies.  The thread  
title is [zfs-discuss] Problem with resilvering and faulty disk.   
We've been seeing that issue as well while rebuilding these drives.


Any assistance with this would be greatly appreciated, and any  
information you folks might need to help troubleshoot this issue I can  
provide, just let me know what you need!


-Jeremy
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss