[zfs-discuss] The iSCSI-backed zpool for my zone hangs.

2009-10-21 Thread Jacob Ritorto
My goal is to have a big, fast, HA filer that holds nearly everything for a 
bunch of development services, each running in its own Solaris zone.  So when I 
need a new service, test box, etc., I provision a new zone and hand it to the 
dev requesters and they load their stuff on it and go.

Each zone has zonepath on its own zpool, which is an iSCSI-backed device 
pointing to an a unique sparse zvol on the filer.

If things slow down, we buy more 1U boxes with lots of CPU and RAM, don't 
care about the disk, and simply provision more LUNs on the filer.  Works great. 
 Cheap, good performance, nice and scalable.  They smiled on me for a while.

Until the filer dropped a few packets.

I know it shouldn't happen and I'm addressing that, but the failure mode 
for this eventuality is too drastic.  If the filer isn't responding nicely to 
the zone's i/o request, the zone pretty much completely hangs, responding to 
pings perhaps, but not allowing any real connections. Kind of, not 
surprisingly, like a machine whose root disk got yanked during normal 
operations.

To make it worse, the whole global zone seems unable to do anything about 
the issue.  I can't down the affected zone; zoneadm commands just put the zone 
in a shutting_down state forever.  zpool commands just hang.  Only thing I've 
found to recover (from far away in the middle of the night) is to uadmin 1 1 
the global zone.  Even reboot didn't work. So all the zones on the box get 
hard-reset and that makes all the dev guys pretty unhappy.

I thought about setting failmode to continue on these individual zone pools 
because it's set to wait right now.  How do you folks predict that action will 
change play?

thx
jake
-- 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


[zfs-discuss] The iSCSI-backed zpool for my zone hangs.

2009-10-19 Thread Jacob Ritorto
	My goal is to have a big, fast, HA filer that holds nearly everything 
for a bunch of development services, each running in its own Solaris 
zone.  So when I need a new service, test box, etc., I provision a new 
zone and hand it to the dev requesters and they load their stuff on it 
and go.


	Each zone has zonepath on its own zpool, which is an iSCSI-backed 
device pointing to an a unique sparse zvol on the filer.


	If things slow down, we buy more 1U boxes with lots of CPU and RAM, 
don't care about the disk, and simply provision more LUNs on the filer. 
 Works great.  Cheap, good performance, nice and scalable.  They smiled 
on me for a while.


Until the filer dropped a few packets.

	I know it shouldn't happen and I'm addressing that, but the failure 
mode for this eventuality is too drastic.  If the filer isn't responding 
nicely to the zone's i/o request, the zone pretty much completely hangs, 
responding to pings perhaps, but not allowing any real connections. 
Kind of, not surprisingly, like a machine whose root disk got yanked 
during normal operations.


	To make it worse, the whole global zone seems unable to do anything 
about the issue.  I can't down the affected zone; zoneadm commands just 
put the zone in a shutting_down state forever.  zpool commands just 
hang.  Only thing I've found to recover (from far away in the middle of 
the night) is to uadmin 1 1 the global zone.  Even reboot didn't work. 
So all the zones on the box get hard-reset and that makes all the dev 
guys pretty unhappy.


	I thought about setting failmode to continue on these individual zone 
pools because it's set to wait right now.  How do you folks predict that 
action will change play?


thx
jake
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss