Re: [zfs-discuss] zfs not yet suitable for HA applications?

David Anderson Fri, 05 Dec 2008 13:21:33 -0800

Trying to keep this in the spotlight. Apologies for the lengthy post.

I'd really like to see features as described by Ross in his summary of  
the "Availability: ZFS needs to handle disk removal / driver failure  
better"  (http://www.opensolaris.org/jive/thread.jspa?messageID=274031&#274031 
  ). I'd like to have these/similar features as well. Has there  
already been internal discussions regarding adding this type of  
functionality to ZFS itself, and was there approval, disapproval or no  
decision?

Unfortunately my situation has put me in urgent need to find  
workarounds in the meantime.

My setup: I have two iSCSI target nodes, each with six drives exported  
via iscsi (Storage Nodes). I have a ZFS Node that logs into each  
target from both Storage Nodes and creates a mirrored Zpool with one  
drive from each Storage Node comprising each half of the mirrored  
vdevs (6 x 2-way mirrors).

My problem: If a Storage Node crashes completely, is disconnected from  
the network, iscsitgt core dumps, a drive is pulled, or a drive has a  
problem accessing data (read retries), then my ZFS Node hangs while  
ZFS waits patiently for the layers below to report a problem and  
timeout the devices. This can lead to a roughly 3 minute or longer  
halt when reading OR writing to the Zpool on the ZFS node. While this  
is acceptable in certain situations, I have a case where my  
availability demand is more severe.

My goal: figure out how to have the zpool pause for NO LONGER than 30  
seconds (roughly within a typical HTTP request timeout) and then issue  
reads/writes to the good devices in the zpool/mirrors while the other  
side comes back online or is fixed.

My ideas:
   1. In the case of the iscsi targets disappearing (iscsitgt core  
dump, Storage Node crash, Storage Node disconnected from network), I  
need to lower the iSCSI login retry/timeout values. Am I correct in  
assuming the only way to accomplish this is to recompile the iscsi  
initiator? If so, can someone help point me in the right direction (I  
have never compiled ONNV sources - do I need to do this or can I just  
recompile the iscsi initiator)?

    1.a. I'm not sure in what Initiator session states  
iscsi_sess_max_delay is applicable - only for the initial login, or  
also in the case of reconnect? Ross, if you still have your test boxes  
available, can you please try setting "set iscsi:iscsi_sess_max_delay  
= 5" in /etc/system, reboot and try failing your iscsi vdevs again? I  
can't find a case where this was tested quick failover.

     1.b. I would much prefer to have bug 6497777 addressed and fixed  
rather than having to resort to recompiling the iscsi initiator (if  
iscsi_sess_max_delay) doesn't work. This seems like a trivial feature  
to implement. How can I sponsor development?

   2. In the case of the iscsi target being reachable, but the  
physical disk is having problems reading/writing data (retryable  
events that take roughly 60 seconds to timeout), should I change the  
iscsi_rx_max_window tunable with mdb? Is there a tunable for iscsi_tx?  
Ross, I know you tried this recently in the thread referenced above  
(with value 15), which resulted in a 60 second hang. How did you  
offline the iscsi vol to test this failure? Unless iscsi uses a  
multiple of the value for retries, then maybe the way you failed the  
disk caused the iscsi system to follow a different failure path?  
Unfortunately I don't know of a way to introduce read/write retries to  
a disk while the disk is still reachable and presented via iscsitgt,  
so I'm not sure how to test this.

     2.a With the fix of http://bugs.opensolaris.org/view_bug.do?bug_id=6518995 
  , we can set sd_retry_count along with sd_io_time to cause I/O  
failure when a command takes longer than sd_retry_count * sd_io_time.  
Can (or should) these tunables be set on the imported iscsi disks in  
the ZFS Node, or can/should they be applied only to the local disk on  
the Storage Nodes? If there is a way to apply them to ONLY the  
imported iscsi disks (and not the local disks) of the ZFS Node, and  
without rebooting every time a new iscsi disk is imported, then I'm  
thinking this is the way to go.

In a year of having this setup in customer beta I have never had  
Storage nodes (or both sides of a mirror) down at the same time. I'd  
like ZFS to take advantage of this. If (and only if) both sides fail  
then ZFS can enter failmode=wait.

Currently using Nevada b96. Planning to move to >100 shortly to avoid  
zpool commands hanging while the zpool is waiting to reach a device.

David Anderson
Aktiom Networks, LLC

Ross wrote:
 > I discussed this exact issue on the forums in February, and filed a  
bug at the time.  I've also e-mailed and chatted with the iSCSI  
developers, and the iSER developers a few times.  There was also been  
another thread about the iSCSI timeouts being made configurable a few  
months back, and finally, I started another discussion on ZFS  
availability, and filed an RFE for pretty much exactly what you're  
asking for.
 >
 > So the question is being asked, but as for how long it will be  
before Sun improve ZFS availability, I really wouldn't like to say.   
One potential problem is that Sun almost certainly have a pretty good  
HA system with Fishworks running on their own hardware, and I don't  
know how much they are going to want to create an open source  
alternative to that.
 >
 > My original discussion in Feb:
 > http://opensolaris.org/jive/thread.jspa?messageID=213482
 >
 > The iSCSI timeout bugs.  The first one was raised in November 2006!!
 > http://bugs.opensolaris.org/bugdatabase/view_bug.do?bug_id=6497777
 > http://bugs.opensolaris.org/bugdatabase/view_bug.do;jsessionid=a1c19a874eb8bffffffffac94084acffabc5?bug_id=6670866
 >
 > The ZFS availability thread:
 > http://www.opensolaris.org/jive/thread.jspa?messageID=274031&#274031
 >
 > I can't find the RFE I filed on the back of that just yet, I'll  
have a look through my e-mails on Monday to find it for you.
 >
 > The one bright point is that it does look like it would be possible  
to edit iscsi.h manually and recompile the driver, but that's a bit  
outside of my experience right now so I'm leaving that until I have no  
other choice.
 >
 > Ross

_______________________________________________
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] zfs not yet suitable for HA applications?

Reply via email to