Trying to keep this in the spotlight. Apologies for the lengthy post. I'd really like to see features as described by Ross in his summary of the "Availability: ZFS needs to handle disk removal / driver failure better" (http://www.opensolaris.org/jive/thread.jspa?messageID=274031񂹯 ). I'd like to have these/similar features as well. Has there already been internal discussions regarding adding this type of functionality to ZFS itself, and was there approval, disapproval or no decision?
Unfortunately my situation has put me in urgent need to find workarounds in the meantime. My setup: I have two iSCSI target nodes, each with six drives exported via iscsi (Storage Nodes). I have a ZFS Node that logs into each target from both Storage Nodes and creates a mirrored Zpool with one drive from each Storage Node comprising each half of the mirrored vdevs (6 x 2-way mirrors). My problem: If a Storage Node crashes completely, is disconnected from the network, iscsitgt core dumps, a drive is pulled, or a drive has a problem accessing data (read retries), then my ZFS Node hangs while ZFS waits patiently for the layers below to report a problem and timeout the devices. This can lead to a roughly 3 minute or longer halt when reading OR writing to the Zpool on the ZFS node. While this is acceptable in certain situations, I have a case where my availability demand is more severe. My goal: figure out how to have the zpool pause for NO LONGER than 30 seconds (roughly within a typical HTTP request timeout) and then issue reads/writes to the good devices in the zpool/mirrors while the other side comes back online or is fixed. My ideas: 1. In the case of the iscsi targets disappearing (iscsitgt core dump, Storage Node crash, Storage Node disconnected from network), I need to lower the iSCSI login retry/timeout values. Am I correct in assuming the only way to accomplish this is to recompile the iscsi initiator? If so, can someone help point me in the right direction (I have never compiled ONNV sources - do I need to do this or can I just recompile the iscsi initiator)? 1.a. I'm not sure in what Initiator session states iscsi_sess_max_delay is applicable - only for the initial login, or also in the case of reconnect? Ross, if you still have your test boxes available, can you please try setting "set iscsi:iscsi_sess_max_delay = 5" in /etc/system, reboot and try failing your iscsi vdevs again? I can't find a case where this was tested quick failover. 1.b. I would much prefer to have bug 6497777 addressed and fixed rather than having to resort to recompiling the iscsi initiator (if iscsi_sess_max_delay) doesn't work. This seems like a trivial feature to implement. How can I sponsor development? 2. In the case of the iscsi target being reachable, but the physical disk is having problems reading/writing data (retryable events that take roughly 60 seconds to timeout), should I change the iscsi_rx_max_window tunable with mdb? Is there a tunable for iscsi_tx? Ross, I know you tried this recently in the thread referenced above (with value 15), which resulted in a 60 second hang. How did you offline the iscsi vol to test this failure? Unless iscsi uses a multiple of the value for retries, then maybe the way you failed the disk caused the iscsi system to follow a different failure path? Unfortunately I don't know of a way to introduce read/write retries to a disk while the disk is still reachable and presented via iscsitgt, so I'm not sure how to test this. 2.a With the fix of http://bugs.opensolaris.org/view_bug.do?bug_id=6518995 , we can set sd_retry_count along with sd_io_time to cause I/O failure when a command takes longer than sd_retry_count * sd_io_time. Can (or should) these tunables be set on the imported iscsi disks in the ZFS Node, or can/should they be applied only to the local disk on the Storage Nodes? If there is a way to apply them to ONLY the imported iscsi disks (and not the local disks) of the ZFS Node, and without rebooting every time a new iscsi disk is imported, then I'm thinking this is the way to go. In a year of having this setup in customer beta I have never had Storage nodes (or both sides of a mirror) down at the same time. I'd like ZFS to take advantage of this. If (and only if) both sides fail then ZFS can enter failmode=wait. Currently using Nevada b96. Planning to move to >100 shortly to avoid zpool commands hanging while the zpool is waiting to reach a device. David Anderson Aktiom Networks, LLC Ross wrote: > I discussed this exact issue on the forums in February, and filed a bug at the time. I've also e-mailed and chatted with the iSCSI developers, and the iSER developers a few times. There was also been another thread about the iSCSI timeouts being made configurable a few months back, and finally, I started another discussion on ZFS availability, and filed an RFE for pretty much exactly what you're asking for. > > So the question is being asked, but as for how long it will be before Sun improve ZFS availability, I really wouldn't like to say. One potential problem is that Sun almost certainly have a pretty good HA system with Fishworks running on their own hardware, and I don't know how much they are going to want to create an open source alternative to that. > > My original discussion in Feb: > http://opensolaris.org/jive/thread.jspa?messageID=213482 > > The iSCSI timeout bugs. The first one was raised in November 2006!! > http://bugs.opensolaris.org/bugdatabase/view_bug.do?bug_id=6497777 > http://bugs.opensolaris.org/bugdatabase/view_bug.do;jsessionid=a1c19a874eb8bffffffffac94084acffabc5?bug_id=6670866 > > The ZFS availability thread: > http://www.opensolaris.org/jive/thread.jspa?messageID=274031񂹯 > > I can't find the RFE I filed on the back of that just yet, I'll have a look through my e-mails on Monday to find it for you. > > The one bright point is that it does look like it would be possible to edit iscsi.h manually and recompile the driver, but that's a bit outside of my experience right now so I'm leaving that until I have no other choice. > > Ross _______________________________________________ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss