On Sun, May 17, 2009 at 10:09 AM, Milan Jurik <[email protected]> wrote: > Hi Ross, > > Ross píše v so 16. 05. 2009 v 08:37 -0700: >> Here's the iSCSI bug I raised last year, which details the line of iscsi.h >> where I think the 3 minute timeout is configured: >> http://bugs.opensolaris.org/bugdatabase/view_bug.do?bug_id=6670866 >> >> And here's the original thread where I first spotted the problem of the >> entire pool hanging: >> http://opensolaris.org/jive/thread.jspa?messageID=213482 >> > > CR 6670866 -> 11-Closed:Duplicate -> CR 6497777 -> "3-Accepted (Yes, > that is a problem)" > >> Please note that it's not just iSCSI that can cause ZFS to timeout. *Any* >> single device that starts to timeout can hang the entire pool if the driver >> does not catch it. This behaviour has been seen with many types of drive, >> and even on Sun x4500's I believe. >> >> Here's the thread on ZFS availability I started back in August because I >> think the ZFS guys could do with addressing this: >> http://www.opensolaris.org/jive/thread.jspa?messageID=350750 >> >> Note: Despite submitting this as a bug/rfe *three* times, there has never >> been a bug report or RFE number generated for that. It seems that Sun want >> to ignore this problem. > > Do you mean this one? > > http://bugs.opensolaris.org/bugdatabase/view_bug.do?bug_id=6735932 > > I cannot identify the CR, do you remember content of the description or > the subject?
Nope, that's another one I found :-). This one would have been titled ZFS availability or similar. Key words to search for would be: ZFS, availability, timeout, hang The content is likely to be similar to this message where I summed up the main points from the discussion: http://www.opensolaris.org/jive/thread.jspa?messageID=350750#274745 The timeline for when it has been submitted is roughly as follows: - It was first logged as an RFE between September and November 2008. - It was submitted again on November 27, 2008. - On December 2nd, I asked Sun to check since I still had no RFE numbers. - On March 3rd, 2009, I bounced the thread again (and believe I attempted to raise this for the 3rd time). If I were to sum up the suggestion in one paragraph, it would probably be this: If any one device in a pool starts to timeout, ZFS should have the ability to immediately issue the read to another device, and keep the pool running. Handling writes is a little more difficult, and probably needs an option setting. My preference would be that if the pool has two disk redundancy, writes should continue as normal if any one disk times out (with less than a few seconds delay from the pool). This can happen without failing the disks; FMA and the existing routines can do that. This is simply a timeout aimed at keeping the pool running while the problem is diagnosed. Ross > > Best regards, > > Milan > > _______________________________________________ storage-discuss mailing list [email protected] http://mail.opensolaris.org/mailman/listinfo/storage-discuss
