Re: [zfs-discuss] Availability: ZFS needs to handle disk removal / driver failure better
Ok, I've done some more testing today and I almost don't know where to start. I'll begin with the good news for Miles :) - Rebooting doesn't appear to cause ZFS to loose the resilver status (but see 1. below) - Resilvering appears to work fine, once complete I never saw any checksum errors when scrubbing the pool. - Reconnecting iscsi drives causes zfs to automatically online the pool and automatically begin resilvering. And now the bad news: 1. While rebooting doesn't seem cause the resilver to loose it's status, something's causing it problems. I saw it restart several times. 2. With iscsi, you can't reboot with sendtargets enabled, static discovery still seems to be the order of the day. 3. There appears to be a disconnect between what iscsiadm knows and what ZFS knows about the status of the devices. And I have confirmation of some of my earlier findings too: 4. iSCSI still has a 3 minute timeout, during which time your pool will hang, no matter how many redundant drives you have available. 5. zpool status can still hang when a device goes offline, and when it finally recovers, it will then report out of date information. This could be Bug 6667199, but I've not seen anybody reporting the incorrect information part of this. 6. After one drive goes offline, during the resilver process, zpool status shows that information is being resilvered on the good drives. Does anybody know why this happens? 7. Although ZFS will automatically online a pool when iscsi devices come online, CIFS shares are not automatically remounted. I also have a few extra notes about a couple of those: 1 - resilver loosing status === Regarding the resilver restarting, I've seen it reported that zpool status can cause this when run as admin, but I'm not convinced that's the cause. Same for the rebooting problem. I was able to run zpool status dozens of times as an admin, but only two or three times did I see the resilver restart. Also, after rebooting, I could see that the resilver was showing that it was 66% complete, but then a second later it restarted. Now, none of this is conclusive. I really need to test with a much larger dataset to get an idea of what's really going on, but there's definately something weird happening here. 3 - disconnect between iscsiadm and ZFS = I repeated my test of offlining an iscsi target, this time checking iscsiadm to see when it disconnected. What I did was wait until iscsiadm reported 0 connections to the target, and then started a CIFS file copy and ran zpool status. Zpool status hung as expected, and a minute or so later, the CIFS copy failed. It seems that although iscsiadm was aware that the target was offline, ZFS did not yet know about it. As expected, a minute or so later, zpool status completed (returning incorrect results), and I could then run the CIFS copy fine. 5 - zpool status hanging and reporting incorrect information === When an iSCSI device goes offline, if you immediately run zpool status, it hangs for 3-4 minutes. Also, when it finally completes, it gives incorrect information, reporting all the devices as online. If you immediately re-run zpool status, it completes rapidly and will now correctly show the offline devices. -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Availability: ZFS needs to handle disk removal / driver failure better
2. With iscsi, you can't reboot with sendtargets enabled, static discovery still seems to be the order of the day. I'm seeing this problem with static discovery: http://bugs.opensolaris.org/view_bug.do?bug_id=6775008. 4. iSCSI still has a 3 minute timeout, during which time your pool will hang, no matter how many redundant drives you have available. This is CR 649, http://bugs.opensolaris.org/view_bug.do?bug_id=649, which is separate from the boot time timeout, though, and also one that Sun so far has been unable to fix! -- Maurice Volaski, [EMAIL PROTECTED] Computing Support, Rose F. Kennedy Center Albert Einstein College of Medicine of Yeshiva University ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Availability: ZFS needs to handle disk removal / driver failure better
Yeah, thanks Maurice, I just saw that one this afternoon. I guess you can't reboot with iscsi full stop... o_0 And I've seen the iscsi bug before (I was just too lazy to look it up lol), I've been complaining about that since February. In fact it's been a bad week for iscsi here, I've managed to crash the iscsi client twice in the last couple of days too (full kernel dump crashes), so I'll be filing a bug report on that tomorrow morning when I get back to the office. Ross On Wed, Dec 3, 2008 at 7:39 PM, Maurice Volaski [EMAIL PROTECTED] wrote: 2. With iscsi, you can't reboot with sendtargets enabled, static discovery still seems to be the order of the day. I'm seeing this problem with static discovery: http://bugs.opensolaris.org/view_bug.do?bug_id=6775008. 4. iSCSI still has a 3 minute timeout, during which time your pool will hang, no matter how many redundant drives you have available. This is CR 649, http://bugs.opensolaris.org/view_bug.do?bug_id=649, which is separate from the boot time timeout, though, and also one that Sun so far has been unable to fix! -- Maurice Volaski, [EMAIL PROTECTED] Computing Support, Rose F. Kennedy Center Albert Einstein College of Medicine of Yeshiva University ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Availability: ZFS needs to handle disk removal / driver failure better
r == Ross [EMAIL PROTECTED] writes: rs I don't think it likes it if the iscsi targets aren't rs available during boot. from my cheatsheet: -8- ok boot -m milestone=none [boots. enter root password for maintenance.] bash-3.00# /sbin/mount -o remount,rw / [-- otherwise iscsiadm won't update /etc/iscsi/*] bash-3.00# /sbin/mount /usr bash-3.00# /sbin/mount /var bash-3.00# /sbin/mount /tmp bash-3.00# iscsiadm remove discovery-address 10.100.100.135 bash-3.00# iscsiadm remove discovery-address 10.100.100.138 bash-3.00# iscsiadm remove discovery-address 10.100.100.138 iscsiadm: unexpected OS error iscsiadm: Unable to complete operation [-- good. it's gone.] bash-3.00# sync bash-3.00# lockfs -fa bash-3.00# reboot -8- rs # time zpool status [...] rs real 3m51.774s so, this hang may happen in fewer situations, but it is not fixed. r 6. After one drive goes offline, during the resilver process, r zpool status shows that information is being resilvered on the r good drives. Does anybody know why this happens? I don't know why. I've seen that, too, though. For me it's always been relatively short, 1min. I wonder if there are three kinds of scrub-like things, not just two (resilvers and scrubs), and 'zpool status' is ``simplifying'' for us again? r 7. Although ZFS will automatically online a pool when iscsi r devices come online, CIFS shares are not automatically r remounted. For me, even plain filesystems are not all remounted. ZFS tries to mount them in the wrong order, so it would mount /a/b/c, then try to mount /a/b and complain ``directory not empty''. I'm not sure why it mounts things in the right order at boot/import, but in haphazard order after one of these auto-onlines. Then NFS exporting didn't work either. To fix, I have to 'zfs umount /a/b/c', but then there is a b/c directory inside filesystem /a, so I have to 'rmdir /a/b/c' by hand because the '... set mountpoint' koolaid creates the directories but doesn't remove them. Then 'zfs mount -a' and 'zfs share -a'. pgpJzJr1P7Q4e.pgp Description: PGP signature ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Availability: ZFS needs to handle disk removal / driver failure better
Hey folks, I've just followed up on this, testing iSCSI with a raided pool, and it still appears to be struggling when a device goes offline. I don't see how this could work except for mirrored pools. Would that carry enough market to be worthwhile? -- richard I have to admit, I've not tested this with a raided pool, but since all ZFS commands hung when my iSCSI device went offline, I assumed that you would get the same effect of the pool hanging if a raid-z2 pool is waiting for a response from a device. Mirrored pools do work particularly well with this since it gives you the potential to have remote mirrors of your data, but if you had a raid-z2 pool, you still wouldn't want that hanging if a single device failed. zpool commands hanging is CR6667208, and has been fixed in b100. http://bugs.opensolaris.org/view_bug.do?bug_id=6667208 I will go and test the raid scenario though on a current build, just to be sure. Please. -- richard I've just created a pool using three snv_103 iscsi Targets, with a fourth install of snv_103 collating those targets into a raidz pool, and sharing that out over CIFS. To test the server, while transferring files from a windows workstation, I powered down one of the three iSCSI targets. It took a few minutes to shutdown, but once that happened the windows copy halted with the error: The specified network name is no longer available. At this point, the zfs admin tools still work fine (which is a huge improvement, well done!), but zpool status still reports that all three devices are online. A minute later, I can open the share again, and start another copy. Thirty seconds after that, zpool status finally reports that the iscsi device is offline. So it looks like we have the same problems with that 3 minute delay, with zpool status reporting wrong information, and the CIFS service having problems tool. At this point I restarted the iSCSI target, but had problems bringing it back online. It appears there's a bug in the initiator, but it's easily worked around: http://www.opensolaris.org/jive/thread.jspa?messageID=312981#312981 What was great was that as soon as the iSCSI initiator reconnected, ZFS started resilvering. What might not be so great is the fact that all three devices are showing that they've been resilvered: # zpool status pool: iscsipool state: ONLINE status: One or more devices has experienced an unrecoverable error. An attempt was made to correct the error. Applications are unaffected. action: Determine if the device needs to be replaced, and clear the errors using 'zpool clear' or replace the device with 'zpool replace'. see: http://www.sun.com/msg/ZFS-8000-9P scrub: resilver completed after 0h2m with 0 errors on Tue Dec 2 11:04:10 2008 config: NAME STATE READ WRITE CKSUM iscsipool ONLINE 0 0 0 raidz1 ONLINE 0 0 0 c2t600144F04933FF6C5056967AC800d0 ONLINE 0 0 0 179K resilvered c2t600144F04934FAB35056964D9500d0 ONLINE 5 9.88K 0 311M resilvered c2t600144F04934119E50569675FF00d0 ONLINE 0 0 0 179K resilvered errors: No known data errors It's proving a little hard to know exactly what's happening when, since I've only got a few seconds to log times, and there are delays with each step. However, I ran another test using robocopy and was able to observe the behaviour a little more closely: Test 2: Using robocopy for the transfer, and iostat plus zpool status on the server 10:46:30 - iSCSI server shutdown started 10:52:20 - all drives still online according to zpool status 10:53:30 - robocopy error - The specified network name is no longer available - zpool status shows all three drives as online - zpool iostat appears to have hung, taking much longer than the 30s specified to return a result - robocopy is now retrying the file, but appears to have hung 10:54:30 - robocopy, CIFS and iostat all start working again, pretty much simultaneously - zpool status now shows the drive as offline I could probably do with using DTrace to get a better look at this, but I haven't learnt that yet. My guess as to what's happening would be: - iSCSI target goes offline - ZFS will not be notified for 3 minutes, but I/O to that device is essentially hung - CIFS times out (I suspect this is on the client side with around a 30s timeout, but I can't find the timeout documented anywhere). - zpool iostat is now waiting, I may be wrong but this doesn't appear to have benefited from the changes to zpool status - After 3 minutes, the iSCSI drive goes offline. The pool carries on with the remaining two drives, CIFS carries on working, iostat carries on working. zpool status however is still out of date. - zpool status eventually catches up, and reports that the drive has gone
Re: [zfs-discuss] Availability: ZFS needs to handle disk removal / driver failure better
Incidentally, while I've reported this again as a RFE, I still haven't seen a CR number for this. Could somebody from Sun check if it's been filed please. thanks, Ross -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Availability: ZFS needs to handle disk removal / driver failure better
Hi Richard, Thanks, I'll give that a try. I think I just had a kernel dump while trying to boot this system back up though, I don't think it likes it if the iscsi targets aren't available during boot. Again, that rings a bell, so I'll go see if that's another known bug. Changing that setting on the fly didn't seem to help, if anything things are worse this time around. I changed the timeout to 15 seconds, but didn't restart any services: # echo iscsi_rx_max_window/D | mdb -k iscsi_rx_max_window: iscsi_rx_max_window:180 # echo iscsi_rx_max_window/W0t15 | mdb -kw iscsi_rx_max_window:0xb4= 0xf # echo iscsi_rx_max_window/D | mdb -k iscsi_rx_max_window: iscsi_rx_max_window:15 After making those changes, and repeating the test, offlining an iscsi volume hung all the commands running on the pool. I had three ssh sessions open, running the following: # zpool iostats -v iscsipool 10 100 # format /dev/null # time zpool status They hung for what felt a minute or so. After that, the CIFS copy timed out. After the CIFS copy timed out, I tried immediately restarting it. It took a few more seconds, but restarted no problem. Within a few seconds of that restarting, iostat recovered, and format returned it's result too. Around 30 seconds later, zpool status reported two drives, paused again, then showed the status of the third: # time zpool status pool: iscsipool state: ONLINE status: One or more devices has experienced an unrecoverable error. An attempt was made to correct the error. Applications are unaffected. action: Determine if the device needs to be replaced, and clear the errors using 'zpool clear' or replace the device with 'zpool replace'. see: http://www.sun.com/msg/ZFS-8000-9P scrub: resilver completed after 0h0m with 0 errors on Tue Dec 2 16:39:21 2008 config: NAME STATE READ WRITE CKSUM iscsipool ONLINE 0 0 0 raidz1 ONLINE 0 0 0 c2t600144F04933FF6C5056967AC800d0 ONLINE 0 0 0 15K resilvered c2t600144F04934FAB35056964D9500d0 ONLINE 0 0 0 15K resilvered c2t600144F04934119E50569675FF00d0 ONLINE 0 200 0 24K resilvered errors: No known data errors real3m51.774s user0m0.015s sys 0m0.100s Repeating that a few seconds later gives: # time zpool status pool: iscsipool state: DEGRADED status: One or more devices could not be opened. Sufficient replicas exist for the pool to continue functioning in a degraded state. action: Attach the missing device and online it using 'zpool online'. see: http://www.sun.com/msg/ZFS-8000-2Q scrub: resilver completed after 0h0m with 0 errors on Tue Dec 2 16:39:21 2008 config: NAME STATE READ WRITE CKSUM iscsipool DEGRADED 0 0 0 raidz1 DEGRADED 0 0 0 c2t600144F04933FF6C5056967AC800d0 ONLINE 0 0 0 15K resilvered c2t600144F04934FAB35056964D9500d0 ONLINE 0 0 0 15K resilvered c2t600144F04934119E50569675FF00d0 UNAVAIL 3 5.80K 0 cannot open errors: No known data errors real0m0.272s user0m0.029s sys 0m0.169s On Tue, Dec 2, 2008 at 3:58 PM, Richard Elling [EMAIL PROTECTED] wrote: .. iSCSI timeout is set to 180 seconds in the client code. The only way to change is to recompile it, or use mdb. Since you have this test rig setup, and I don't, do you want to experiment with this timeout? The variable is actually called iscsi_rx_max_window so if you do echo iscsi_rx_max_window/D | mdb -k you should see 180 Change it using something like: echo iscsi_rx_max_window/W0t30 | mdb -kw to set it to 30 seconds. -- richard ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Availability: ZFS needs to handle disk removal / driver failure better
rs == Ross Smith [EMAIL PROTECTED] writes: rs 4. zpool status still reports out of date information. I know people are going to skim this message and not hear this. They'll say ``well of course zpool status says ONLINE while the pool is hung. ZFS is patiently waiting. It doesn't know anything is broken yet.'' but you are NOT saying it's out of date because it doesn't say OFFLINE the instant you power down an iSCSI target. You're saying: rs - After 3 minutes, the iSCSI drive goes offline. rs The pool carries on with the remaining two drives, CIFS rs carries on working, iostat carries on working. zpool status rs however is still out of date. rs - zpool status eventually rs catches up, and reports that the drive has gone offline. so, there is a ~30sec window when it's out of date. When you say ``goes offline'' in the first bullet, you're saying ``ZFS must have marked it offline internally, because the pool unfroze.'' but you found that even after it ``goes offline'' 'zpool status' still reports it ONLINE. The question is, what the hell is 'zpool status' reporting? not the status, apparently. It's supposed to be a diagnosis tool. Why should you have to second-guess it and infer the position of ZFS's various internal state machines through careful indirect observation, ``oops, CIFS just came back,'' or ``oh sometihng must have changed because zpool iostat isn't hanging any more''? Why not have a tool that TELLS you plainly what's going on? 'zpool status' isn't. Is it trying to oversimplify things, to condescend to the sysadmin or hide ZFS's rough edges? Are there more states for devices that are being compressed down to ONLINE OFFLINE DEGRADED FAULTED? Is there some tool in zdb or mdb that is like 'zpool status -simonsez'? I already know sometimes it'll report everything as ONLINE but refuse 'zpool offline ... device' with 'no valid replicas', so I think, yes there are ``secret states'' for devices? Or is it trying to do too many things with one output format? rs 5. When iSCSI targets finally do come back online, ZFS is rs resilvering all of them (again, this rings a bell, Miles might rs have reported something similar). my zpool status is so old it doesn't say ``xxkB resilvered'' so I've no indication which devices are the source vs. target of the resilver. What I found was, the auto-resilver isn't sufficient. If you wait for it to complete, then 'zpool scrub', you'll get thousands of CKSUM errors on the dirty device, so the resilver isn't covering all the dirtyness. Also ZFS seems to forget about the need to resilver if you shut down the machine, bring back the missing target, and boot---it marks everything ONLINE and then resilvers as you hit the dirty data, counting CKSUM errors. This has likely been fixed between b71 and b101. It's easy to test: (a) shut down one iSCSI target, (b) write to the pool, (c) bring the iSCSI target back, (d) wait for auto-resilver to finish, (e) 'zpool scrub', (f) look for CKSUM errors. I suspect you're more worried about your own problems though---I'll try to retest it soon. pgpcvDMGKA1VP.pgp Description: PGP signature ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Availability: ZFS needs to handle disk removal / driver failure better
Hi Miles, It's probably a bad sign that although that post came through as anonymous in my e-mail, I recognised your style before I got half way through your post :) I agree, the zpool status being out of date is weird, I'll dig out the bug number for that at some point as I'm sure I've mentioned it before. It looks to me like there are two separate pieces of code that work out the status of the pool. There's the stuff ZFS uses internally to run the pool, and then there's a completely separate piece that does the reporting to the end user. I agree that it could be a case of oversimplifying things. There's no denying the ease of admin is one of ZFS' strengths, but I think the whole zpool status thing needs looking at again. Neither the way the command freezes, nor the out of date information make any sense to me. And yes, I'm aware of the problems you've reported with resilvering. That's on my list of things to test with this. I've already done a quick test of running a scrub after the resilver (which appeared ok at first glance), and tomorrow I'll be testing the reboot status too. -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Availability: ZFS needs to handle disk removal / driver failure better
r == Ross [EMAIL PROTECTED] writes: r style before I got half way through your post :) [...status r problems...] could be a case of oversimplifying things. yeah I was a bit inappropriate, but my frustration comes from the (partly paranoid) imagining of how the idea ``we need to make it simple'' might have spooled out through a series of design meetings to a culturally-insidious mind-blowing condescention toward the sysadmin. ``simple'', to me, means that a 'status' tool does not read things off disks, and does not gather a bunch of scraps to fabricate a pretty (``simple''?) fantasy-world at invocation which is torn down again when it exits. The Linux status tools are pretty-printing wrappers around 'cat /proc/$THING/status'. That, is SIMPLE! And, screaming monkeys though they often are, the college kids writing Linux are generally disciplined enough not to grab a bunch of locks and then go to sleep for minutes when delivering things from /proc. I love that. The other, broken, idea of ``simple'' is what I come to Unix to avoid. And yes, this is a religious argument. Just because it spans decades of experience and includes ideas of style doesn't mean it should be dismissed as hocus-pocus. And I don't like all these binary config files either. Not even Mac OS X is pulling that baloney any more. r There's no denying the ease of admin is one of ZFS' strengths, I deny it! It is not simple to start up 'format' and 'zpool iostat' and RoboCopy on another machine because you cannot trust the output of the status command. And getting visibility into something by starting a bunch of commands in different windows and watching when which one unfreezes is hilarious, not simple. r the problems you've reported with resilvering. I think we were watching this bug: http://bugs.opensolaris.org/view_bug.do?bug_id=6675685 so that ought to be fixed in your test system but not in s10u6. but it might not be completely fixed yet: http://bugs.opensolaris.org/view_bug.do?bug_id=6747698 pgpx4Yk6ZjF1M.pgp Description: PGP signature ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Availability: ZFS needs to handle disk removal / driver failure better
On 2-Dec-08, at 3:35 PM, Miles Nordin wrote: r == Ross [EMAIL PROTECTED] writes: r style before I got half way through your post :) [...status r problems...] could be a case of oversimplifying things. ... And yes, this is a religious argument. Just because it spans decades of experience and includes ideas of style doesn't mean it should be dismissed as hocus-pocus. And I don't like all these binary config files either. Not even Mac OS X is pulling that baloney any more. OS X never used binary config files; it standardised on XML property lists for the new subsystems (plus a lot of good old fashioned UNIX config). Perhaps you are thinking of Mac OS 9 and earlier (resource forks). --Toby ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Availability: ZFS needs to handle disk removal / driver failure better
Ross Smith wrote: On Fri, Nov 28, 2008 at 5:05 AM, Richard Elling [EMAIL PROTECTED] wrote: Ross wrote: Well, you're not alone in wanting to use ZFS and iSCSI like that, and in fact my change request suggested that this is exactly one of the things that could be addressed: The idea is really a two stage RFE, since just the first part would have benefits. The key is to improve ZFS availability, without affecting it's flexibility, bringing it on par with traditional raid controllers. A. Track response times, allowing for lop sided mirrors, and better failure detection. I've never seen a study which shows, categorically, that disk or network failures are preceded by significant latency changes. How do we get better failure detection from such measurements? Not preceded by as such, but a disk or network failure will certainly cause significant latency changes. If the hardware is down, there's going to be a sudden, and very large change in latency. Sure, FMA will catch most cases, but we've already shown that there are some cases where it doesn't work too well (and I would argue that's always going to be possible when you are relying on so many different types of driver). This is there to ensure that ZFS can handle *all* cases. I think that there is some confusion about FMA. The value of FMA is diagnosis. If there was no FMA, then driver timeouts would still exist. Where FMA is useful is diagnosing the problem such that we know that the fault is in the SAN and not the RAID array, for example. From the device driver level, all sd knows is that an I/O request to a device timed out. Similarly, all ZFS could know is what sd tells it. Many people have requested this since it would facilitate remote live mirrors. At a minimum, something like VxVM's preferred plex should be reasonably easy to implement. B. Use response times to timeout devices, dropping them to an interim failure mode while waiting for the official result from the driver. This would prevent redundant pools hanging when waiting for a single device. I don't see how this could work except for mirrored pools. Would that carry enough market to be worthwhile? -- richard I have to admit, I've not tested this with a raided pool, but since all ZFS commands hung when my iSCSI device went offline, I assumed that you would get the same effect of the pool hanging if a raid-z2 pool is waiting for a response from a device. Mirrored pools do work particularly well with this since it gives you the potential to have remote mirrors of your data, but if you had a raid-z2 pool, you still wouldn't want that hanging if a single device failed. zpool commands hanging is CR6667208, and has been fixed in b100. http://bugs.opensolaris.org/view_bug.do?bug_id=6667208 I will go and test the raid scenario though on a current build, just to be sure. Please. -- richard ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Availability: ZFS needs to handle disk removal / driver failure better
On Thu, 27 Nov 2008 04:33:54 -0800 (PST) Ross [EMAIL PROTECTED] wrote: Hmm... I logged this CR ages ago, but now I've come to find it in the bug tracker I can't see it anywhere. I actually logged three CR's back to back, the first appears to have been created ok, but two have just disappeared. The one I created ok is: http://bugs.opensolaris.org/view_bug.do?bug_id=6766364 There should be two other CR's created within a few minutes of that, one for disabling caching on CIFS shares, and one regarding this ZFS availability discussion. Could somebody at Sun let me know what's happened to these please. Hi Ross, I can't find the ZFS one you mention. The CIFS one is http://bugs.opensolaris.org/view_bug.do?bug_id=6766126. It's been marked as 'incomplete' so you should contact the R.E. - Alan M. Wright (at sun dot com, etc) to find out what further info is required. hth, James C. McPherson -- Senior Kernel Software Engineer, Solaris Sun Microsystems http://blogs.sun.com/jmcp http://www.jmcp.homeunix.com/blog ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Availability: ZFS needs to handle disk removal / driver failure better
Thanks James, I've e-mailed Alan and submitted this one again. -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Availability: ZFS needs to handle disk removal / driver failure better
Hmm... I logged this CR ages ago, but now I've come to find it in the bug tracker I can't see it anywhere. I actually logged three CR's back to back, the first appears to have been created ok, but two have just disappeared. The one I created ok is: http://bugs.opensolaris.org/view_bug.do?bug_id=6766364 There should be two other CR's created within a few minutes of that, one for disabling caching on CIFS shares, and one regarding this ZFS availability discussion. Could somebody at Sun let me know what's happened to these please. -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Availability: ZFS needs to handle disk removal / driver failure better
Hello, Thank you for this very interesting thread ! I want to confirm that Synchronous Distributed Storage is main goal when using ZFS ! The target architecture is 1 local drive, and 2 (or more) remote iSCSI targets, with ZFS being the iSCSI initiator. System is designed/cut so that local disk can handle all needed performance with good margin, as each one of iSCSI targets through large enough Ethernet fibers. I need that any network problem doesn't slow the readings on local disk, and that writings are stopped only if not any remote are available after a time-out. I also did a comment on that subject in : http://blogs.sun.com/roch/entry/using_zfs_as_a_network To myxiplx : we called Sleeping Failure a failure of 1 part, that is hidden by redundancy but not detected by monitoring. These are the most dangerous... Would anybody be interested by supporting an opensource projectseed called MiSCSI ? This is for Multicast iSCSI, so that only 1 writing from initiator be propagated by network to all suscribed targets, with dynamic suscribing and resilvering being delegated to remote targets. I would even prefer this behaviour already exists in ZFS :-) Please let me any comment if interested, i may send a draft for RFP... Best regards ! -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Availability: ZFS needs to handle disk removal / driver failure better
Well, you're not alone in wanting to use ZFS and iSCSI like that, and in fact my change request suggested that this is exactly one of the things that could be addressed: The idea is really a two stage RFE, since just the first part would have benefits. The key is to improve ZFS availability, without affecting it's flexibility, bringing it on par with traditional raid controllers. A. Track response times, allowing for lop sided mirrors, and better failure detection. Many people have requested this since it would facilitate remote live mirrors. B. Use response times to timeout devices, dropping them to an interim failure mode while waiting for the official result from the driver. This would prevent redundant pools hanging when waiting for a single device. Unfortunately if your links tend to drop, you really need both parts. However, if this does get added to ZFS, all you would then need is standard monitoring on the ZFS pool. That would notify you when any device fails and the pool goes to a degraded state, making it easy to spot when either the remote mirrors or local storage are having problems. I'd have thought it would make monitoring much simpler. And if this were possible, I would hope that you could configure iSCSI devices to automatically reconnect and resilver too, so the system would be self repairing once faults are corrected, but I haven't gone so far as to test that yet. -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Availability: ZFS needs to handle disk removal / driver failure better
Well, you're not alone in wanting to use ZFS and iSCSI like that, and in fact my change request suggested that this is exactly one of the things that could be addressed: Thank you ! Yes, this was also to tell you that you are not alone :-) I agree completely with you on your technical points ! -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Availability: ZFS needs to handle disk removal / driver failure better
Ross wrote: Well, you're not alone in wanting to use ZFS and iSCSI like that, and in fact my change request suggested that this is exactly one of the things that could be addressed: The idea is really a two stage RFE, since just the first part would have benefits. The key is to improve ZFS availability, without affecting it's flexibility, bringing it on par with traditional raid controllers. A. Track response times, allowing for lop sided mirrors, and better failure detection. I've never seen a study which shows, categorically, that disk or network failures are preceded by significant latency changes. How do we get better failure detection from such measurements? Many people have requested this since it would facilitate remote live mirrors. At a minimum, something like VxVM's preferred plex should be reasonably easy to implement. B. Use response times to timeout devices, dropping them to an interim failure mode while waiting for the official result from the driver. This would prevent redundant pools hanging when waiting for a single device. I don't see how this could work except for mirrored pools. Would that carry enough market to be worthwhile? -- richard Unfortunately if your links tend to drop, you really need both parts. However, if this does get added to ZFS, all you would then need is standard monitoring on the ZFS pool. That would notify you when any device fails and the pool goes to a degraded state, making it easy to spot when either the remote mirrors or local storage are having problems. I'd have thought it would make monitoring much simpler. And if this were possible, I would hope that you could configure iSCSI devices to automatically reconnect and resilver too, so the system would be self repairing once faults are corrected, but I haven't gone so far as to test that yet. ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Availability: ZFS needs to handle disk removal / driver failure better
On Fri, Nov 28, 2008 at 5:05 AM, Richard Elling [EMAIL PROTECTED] wrote: Ross wrote: Well, you're not alone in wanting to use ZFS and iSCSI like that, and in fact my change request suggested that this is exactly one of the things that could be addressed: The idea is really a two stage RFE, since just the first part would have benefits. The key is to improve ZFS availability, without affecting it's flexibility, bringing it on par with traditional raid controllers. A. Track response times, allowing for lop sided mirrors, and better failure detection. I've never seen a study which shows, categorically, that disk or network failures are preceded by significant latency changes. How do we get better failure detection from such measurements? Not preceded by as such, but a disk or network failure will certainly cause significant latency changes. If the hardware is down, there's going to be a sudden, and very large change in latency. Sure, FMA will catch most cases, but we've already shown that there are some cases where it doesn't work too well (and I would argue that's always going to be possible when you are relying on so many different types of driver). This is there to ensure that ZFS can handle *all* cases. Many people have requested this since it would facilitate remote live mirrors. At a minimum, something like VxVM's preferred plex should be reasonably easy to implement. B. Use response times to timeout devices, dropping them to an interim failure mode while waiting for the official result from the driver. This would prevent redundant pools hanging when waiting for a single device. I don't see how this could work except for mirrored pools. Would that carry enough market to be worthwhile? -- richard I have to admit, I've not tested this with a raided pool, but since all ZFS commands hung when my iSCSI device went offline, I assumed that you would get the same effect of the pool hanging if a raid-z2 pool is waiting for a response from a device. Mirrored pools do work particularly well with this since it gives you the potential to have remote mirrors of your data, but if you had a raid-z2 pool, you still wouldn't want that hanging if a single device failed. I will go and test the raid scenario though on a current build, just to be sure. ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Availability: ZFS needs to handle disk removal / driver failure better
Hey folks, Well, there haven't been any more comments knocking holes in this idea, so I'm wondering now if I should log this as an RFE? Is this something others would find useful? Ross -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Availability: ZFS needs to handle disk removal / driver failure better
Ross wrote: Hey folks, Well, there haven't been any more comments knocking holes in this idea, so I'm wondering now if I should log this as an RFE? go for it! Is this something others would find useful? Yes. But remember that this has a very limited scope. Basically it will apply to mirrors, not raidz. Some people may find that to be uninteresting. Implementing something simple, like a preferred side would be a easy first step (ala VxVM's preferred plex). -- richard ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Availability: ZFS needs to handle disk removal / driver failure better
Thinking about it, we could make use of this too. The ability to add a remote iSCSI mirror to any pool without sacrificing local performance could be a huge benefit. From: [EMAIL PROTECTED] To: [EMAIL PROTECTED] CC: [EMAIL PROTECTED]; zfs-discuss@opensolaris.org Subject: Re: Availability: ZFS needs to handle disk removal / driver failure better Date: Fri, 29 Aug 2008 09:15:41 +1200 Eric Schrock writes: A better option would be to not use this to perform FMA diagnosis, but instead work into the mirror child selection code. This has already been alluded to before, but it would be cool to keep track of latency over time, and use this to both a) prefer one drive over another when selecting the child and b) proactively timeout/ignore results from one child and select the other if it's taking longer than some historical standard deviation. This keeps away from diagnosing drives as faulty, but does allow ZFS to make better choices and maintain response times. It shouldn't be hard to keep track of the average and/or standard deviation and use it for selection; proactively timing out the slow I/Os is much trickier. This would be a good solution to the remote iSCSI mirror configuration. I've been working though this situation with a client (we have been comparing ZFS with Cleversafe) and we'd love to be able to get the read performance of the local drives from such a pool. As others have mentioned, things get more difficult with writes. If I issue a write to both halves of a mirror, should I return when the first one completes, or when both complete? One possibility is to expose this as a tunable, but any such best effort RAS is a little dicey because you have very little visibility into the state of the pool in this scenario - is my data protected? becomes a very difficult question to answer. One solution (again, to be used with a remote mirror) is the three way mirror. If two devices are local and one remote, data is safe once the two local writes return. I guess the issue then changes from is my data safe to how safe is my data. I would be reluctant to deploy a remote mirror device without local redundancy, so this probably won't be an uncommon setup. There would have to be an acceptable window of risk when local data isn't replicated. Ian _ Make a mini you and download it into Windows Live Messenger http://clk.atdmt.com/UKM/go/111354029/direct/01/___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Availability: ZFS needs to handle disk removal / driver failure better
On Thu, Aug 28, 2008 at 11:21 PM, Ian Collins [EMAIL PROTECTED] wrote: Miles Nordin writes: suggested that unlike the SVM feature it should be automatic, because by so being it becomes useful as an availability tool rather than just performance optimisation. So on a server with a read workload, how would you know if the remote volume was working? Even reads induced writes (last access time, if nothing else) My question: If a pool becomes non-redundant (eg due to a timeout, hotplug removal, bad data returned from device, or for whatever reason), do we want the affected pool/vdev/system to hang? Generally speaking I would say that this is what currently happens with other solutions. Conversely: Can the current situation be improved by allowing a device to be taken out of the pool for writes - eg be placed in read-only mode? I would assume it is possible to modify the CoW system / functions which allocates blocks for writes to ignore certain devices, at least temporarily. This would also lay a groundwork for allowing devices to be removed from a pool - eg: Step 1: Make the device read-only. Step 2: touch every allocated block on that device (causing it to be copied to some other disk), step 3: remove it from the pool for reads as well and finally remove it from the pool permanently. _hartz ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Availability: ZFS needs to handle disk removal / driver failure better
Ross Smith wrote: Triple mirroring you say? That'd be me then :D The reason I really want to get ZFS timeouts sorted is that our long term goal is to mirror that over two servers too, giving us a pool mirrored across two servers, each of which is actually a zfs iscsi volume hosted on triply mirrored disks. Oh, and we'll have two sets of online off-site backups running raid-z2, plus a set of off-line backups too. All in all I'm pretty happy with the integrity of the data, wouldn't want to use anything other than ZFS for that now. I'd just like to get the availability working a bit better, without having to go back to buying raid controllers. We have big plans for that too; once we get the iSCSI / iSER timeout issue sorted our long term availability goals are to have the setup I mentioned above hosted out from a pair of clustered Solaris NFS / CIFS servers. Failover time on the cluster is currently in the order of 5-10 seconds, if I can get the detection of a bad iSCSI link down under 2 seconds we'll essentially have a worst case scenario of 15 seconds downtime. I don't think this is possible for a stable system. 2 second failure detection for IP networks is troublesome for a wide variety of reasons. Even with Solaris Clusters, we can show consistent failover times for NFS services on the order of a minute (2-3 client retry intervals, including backoff). But getting to consistent sub-minute failover for a service like NFS might be a bridge too far, given the current technology and the amount of customization required to make it work^TM. Downtime that low means it's effectively transparent for our users as all of our applications can cope with that seamlessly, and I'd really love to be able to do that this calendar year. I think most people (traders are a notable exception) and applications can deal with larger recovery times, as long as human-intervention is not required. -- richard ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Availability: ZFS needs to handle disk removal / driver failure better
Wow, some great comments on here now, even a few people agreeing with me which is nice :D I'll happily admit I don't have the in depth understanding of storage many of you guys have, but since the idea doesn't seem pie-in-the-sky crazy, I'm going to try to write up all my current thoughts on how this could work after reading through all the replies. 1. Track disk response times - ZFS should track the average response time of each disk. - This should be used internally for performance tweaking, so faster disks are favoured for reads. This works particularly well for lop sided mirrors. - I'd like to see this information (and the number of timeouts) in the output of zpool status, so administrators can see if any one device is performing badly. 2. New parameters - ZFS should gain two new parameters. - A timeout value for the pool. - An option to enable that timeout for writes too (off by default). - Still to be decided is whether that timeout is set manually, or automatically based on the information gathered in 1. - Do we need a pool timeout (based on the timeout of the slowest device in the pool), or will individual device timeouts work better? - I've decided that having this off by default for writes is probably better for ZFS. It addresses some people's concerns about writing to a degraded pool, and puts data integrity ahead of availability, which seems to fit better with ZFS' goals. I'd still like it myself for writes I can live with a pool running degraded for 2 minutes while the problem is diagnosed. - With that said, could the write timeout default to on when you have a slog device? After all, the data is safely committed to the slog, and should remain there until it's written to all devices. Bob, you seemed the most concerned about writes, would that be enough redundancy for you to be happy to have this on by default? If not, I'd still be ok having it off by default, we could maybe just include it in the evil tuning guide suggesting that this could be turned on by anybody who has a separate slog device. 3. How it would work - If a read times out for any device, ZFS should immediately issue reads to all other devices holding that data. The first response back will be used. - Timeouts should be logged so the information can be used by administrators or FMA to help diagnose failing drives, but they should not count as a device failure on their own. - Some thought is needed as to how this algorithm works on busy pools. When reads are queuing up, we need to avoid false positives and avoid adding extra load on the pool. Would it be a possibility that instead of checking the response time for an individual request, this timeout is used to check if no responses at all have been received from a device for that length of time? That still sounds reasonable for finding stuck devices, and should still work reliably on a busy pool. - For reads, the pool does not need to go degraded, the device is simply flagged as WAITING. - When enabled for writes, these will be going to all devices, so there are no alternate devices to try. This means any write timeout will be used to put the pool into a degraded mode. This should be considered a temporary state with the drive in WAITING status, as while the pool itself is degraded (due to missing the writes for that drive), the drive is not yet offline. At this point the system is simply keeping itself running while waiting for a proper error response from either the drive or from FMA. If the drive eventually returns the missing response, it can be resilvered with any data it missed. If the drive doesn't return a response, FMA should eventually fault it, and the drive can be taken offline and replaced with a hot spare. At all times the administrator can see what it going on using zpool status, with the appropriate pool and drive status visible. - Please bear in mind that although I'm using the word 'degraded' above, this is not necessarily the case for dual parity pools, I just don't know the proper term to use for a dual parity raid set where a single drive has failed. - If this is just a one off glitch and the device comes back online, the resilver shouldn't take long as ZFS just needs to send the data that was missed (which will still be stored in the ZIL). - If many devices timeout at once due to a bad controller, cable pulled, power failure, etc, all the affected devices will be flagged as WAITING and if too many have gone for the pool to stay operational, ZFS should switch the entire pool to the 'wait' state while it waits for FMA, etc to return a proper response, after which it should react according to the proper failmode property for the pool. 4. Food for thought - While I like nico's idea for lop sided mirrors, I'm not sure any tweaking is needed. I was thinking about whether these timeouts could improve performance for such a mirror, but I think a better option there is simply to use
Re: [zfs-discuss] Availability: ZFS needs to handle disk removal / driver failure better
On Sat, 30 Aug 2008, Ross wrote: while the problem is diagnosed. - With that said, could the write timeout default to on when you have a slog device? After all, the data is safely committed to the slog, and should remain there until it's written to all devices. Bob, you seemed the most concerned about writes, would that be enough redundancy for you to be happy to have this on by default? If not, I'd still be ok having it off by default, we could maybe just include it in the evil tuning guide suggesting that this could be turned on by anybody who has a separate slog device. It is my impression that the slog device is only used for synchronous writes. Depending on the system, this could be just a small fraction of the writes. In my opinion, ZFS's primary goal is to avoid data loss, or consumption of wrong data. Availability is a lesser goal. If someone really needs maximum availability then they can go to triple mirroring or some other maximally redundant scheme. ZFS should to its best to continue moving forward as long as some level of redundancy exists. There could be an option to allow moving forward with no redundancy at all. Bob == Bob Friesenhahn [EMAIL PROTECTED], http://www.simplesystems.org/users/bfriesen/ GraphicsMagick Maintainer,http://www.GraphicsMagick.org/ ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Availability: ZFS needs to handle disk removal / driver failure better
Triple mirroring you say? That'd be me then :D The reason I really want to get ZFS timeouts sorted is that our long term goal is to mirror that over two servers too, giving us a pool mirrored across two servers, each of which is actually a zfs iscsi volume hosted on triply mirrored disks. Oh, and we'll have two sets of online off-site backups running raid-z2, plus a set of off-line backups too. All in all I'm pretty happy with the integrity of the data, wouldn't want to use anything other than ZFS for that now. I'd just like to get the availability working a bit better, without having to go back to buying raid controllers. We have big plans for that too; once we get the iSCSI / iSER timeout issue sorted our long term availability goals are to have the setup I mentioned above hosted out from a pair of clustered Solaris NFS / CIFS servers. Failover time on the cluster is currently in the order of 5-10 seconds, if I can get the detection of a bad iSCSI link down under 2 seconds we'll essentially have a worst case scenario of 15 seconds downtime. Downtime that low means it's effectively transparent for our users as all of our applications can cope with that seamlessly, and I'd really love to be able to do that this calendar year. Anyway, getting back on topic, it's a good point about moving forward while redundancy exists. I think the flag for specifying the write behavior should have that as the default, with the optional setting being to allow the pool to continue accepting writes while the pool is in a non redundant state. Ross Date: Sat, 30 Aug 2008 10:59:19 -0500 From: [EMAIL PROTECTED] To: [EMAIL PROTECTED] CC: zfs-discuss@opensolaris.org Subject: Re: [zfs-discuss] Availability: ZFS needs to handle disk removal / driver failure better On Sat, 30 Aug 2008, Ross wrote: while the problem is diagnosed. - With that said, could the write timeout default to on when you have a slog device? After all, the data is safely committed to the slog, and should remain there until it's written to all devices. Bob, you seemed the most concerned about writes, would that be enough redundancy for you to be happy to have this on by default? If not, I'd still be ok having it off by default, we could maybe just include it in the evil tuning guide suggesting that this could be turned on by anybody who has a separate slog device. It is my impression that the slog device is only used for synchronous writes. Depending on the system, this could be just a small fraction of the writes. In my opinion, ZFS's primary goal is to avoid data loss, or consumption of wrong data. Availability is a lesser goal. If someone really needs maximum availability then they can go to triple mirroring or some other maximally redundant scheme. ZFS should to its best to continue moving forward as long as some level of redundancy exists. There could be an option to allow moving forward with no redundancy at all. Bob == Bob Friesenhahn [EMAIL PROTECTED], http://www.simplesystems.org/users/bfriesen/ GraphicsMagick Maintainer,http://www.GraphicsMagick.org/ _ Win a voice over part with Kung Fu Panda Live Search  and  100’s of Kung Fu Panda prizes to win with Live Search http://clk.atdmt.com/UKM/go/107571439/direct/01/___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Availability: ZFS needs to handle disk removal / driver failure better
Eric Schrock writes: A better option would be to not use this to perform FMA diagnosis, but instead work into the mirror child selection code. This has already been alluded to before, but it would be cool to keep track of latency over time, and use this to both a) prefer one drive over another when selecting the child and b) proactively timeout/ignore results from one child and select the other if it's taking longer than some historical standard deviation. This keeps away from diagnosing drives as faulty, but does allow ZFS to make better choices and maintain response times. It shouldn't be hard to keep track of the average and/or standard deviation and use it for selection; proactively timing out the slow I/Os is much trickier. This would be a good solution to the remote iSCSI mirror configuration. I've been working though this situation with a client (we have been comparing ZFS with Cleversafe) and we'd love to be able to get the read performance of the local drives from such a pool. As others have mentioned, things get more difficult with writes. If I issue a write to both halves of a mirror, should I return when the first one completes, or when both complete? One possibility is to expose this as a tunable, but any such best effort RAS is a little dicey because you have very little visibility into the state of the pool in this scenario - is my data protected? becomes a very difficult question to answer. One solution (again, to be used with a remote mirror) is the three way mirror. If two devices are local and one remote, data is safe once the two local writes return. I guess the issue then changes from is my data safe to how safe is my data. I would be reluctant to deploy a remote mirror device without local redundancy, so this probably won't be an uncommon setup. There would have to be an acceptable window of risk when local data isn't replicated. Ian ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Availability: ZFS needs to handle disk removal / driver failure better
Miles Nordin writes: bf == Bob Friesenhahn [EMAIL PROTECTED] writes: bf You are saying that I can't split my mirrors between a local bf disk in Dallas and a remote disk in New York accessed via bf iSCSI? nope, you've misread. I'm saying reads should go to the local disk only, and writes should go to both. See SVM's 'metaparam -r'. I suggested that unlike the SVM feature it should be automatic, because by so being it becomes useful as an availability tool rather than just performance optimisation. So on a server with a read workload, how would you know if the remote volume was working? Ian ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Availability: ZFS needs to handle disk removal / driver failure better
On Thu, Aug 28, 2008 at 11:29:21AM -0500, Bob Friesenhahn wrote: Which of these do you prefer? o System waits substantial time for devices to (possibly) recover in order to ensure that subsequently written data has the least chance of being lost. o System immediately ignores slow devices and switches to non-redundant non-fail-safe non-fault-tolerant may-lose-your-data mode. When system is under intense load, it automatically switches to the may-lose-your-data mode. Given how long a resilver might take, waiting some time for a device to come back makes sense. Also, if a cable was taken out, or drive tray powered off, then you'll see lots of drives timing out, and then the better thing to do is to wait (heuristic: not enough spares to recover). Nico -- ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Availability: ZFS needs to handle disk removal / driver failure better
On Thu, Aug 28, 2008 at 01:05:54PM -0700, Eric Schrock wrote: As others have mentioned, things get more difficult with writes. If I issue a write to both halves of a mirror, should I return when the first one completes, or when both complete? One possibility is to expose this as a tunable, but any such best effort RAS is a little dicey because you have very little visibility into the state of the pool in this scenario - is my data protected? becomes a very difficult question to answer. Depending on the amount of redundancy left one might want the writes to continue. E.g., a 3-way mirror with one vdev timing out or going extra slow, or Richard's lopsided mirror example. The value of best effort RAS might make a useful property for mirrors and RAIDZ-2. If because of some slow vdev you've got less redundancy for recent writes, but still have enough (for some value of enough), and still have full redundancy for older writes, well, that's not so bad. Something like: % # require at least successful writes to two mirrors and wait no more % # than 15 seconds for the 3rd. % zpool create mypool mirror ... mirror ... mirror ... % zpool set minimum_redundancy=1 mypool % zpool set vdev_write_wait=15s mypool and for known-to-be-lopsided mirrors: % # require at least successful writes to two mirrors and don't wait for % # the slow vdevs % zpool create mypool mirror ... mirror ... mirror -slow ... % zpool set minimum_redundancy=1 mypool % zpool set vdev_write_wait=0s mypool ? Nico -- ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Availability: ZFS needs to handle disk removal / driver failure better
Nicolas Williams wrote: On Thu, Aug 28, 2008 at 11:29:21AM -0500, Bob Friesenhahn wrote: Which of these do you prefer? o System waits substantial time for devices to (possibly) recover in order to ensure that subsequently written data has the least chance of being lost. o System immediately ignores slow devices and switches to non-redundant non-fail-safe non-fault-tolerant may-lose-your-data mode. When system is under intense load, it automatically switches to the may-lose-your-data mode. Given how long a resilver might take, waiting some time for a device to come back makes sense. Also, if a cable was taken out, or drive tray powered off, then you'll see lots of drives timing out, and then the better thing to do is to wait (heuristic: not enough spares to recover). argv! I didn't even consider switches. Ethernet switches often use spanning-tree algorithms to converge on the topology. I'm not sure what SAN switches use. We have the following problem with highly available clusters which use switches in the interconnect: + Solaris Cluster interconnect timeout defaults to 10 seconds + STP can take 30 seconds to converge So, if you use Ethernet switches in the interconnect, you need to disable STP on the ports used for interconnects or risk unnecessary cluster reconfigurations. Normally, this isn't a problem as the people who tend to build HA clusters also tend to read the docs which point this out. Still, a few slip through every few months. As usual, Solaris Cluster gets blamed, though it really is a systems engineering problem. Can we expect a similar attention to detail for ZFS implementers? I'm afraid not :-(. I'm not confident we can be successful with sub-minute reconfiguration, so the B_FAILFAST may be the best we could do for the general case. That isn't so bad, in fact we use failfasts rather extensively for Solaris Clusters, too. -- richard ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Availability: ZFS needs to handle disk removal / driver failure better
es == Eric Schrock [EMAIL PROTECTED] writes: es The main problem with exposing tunables like this is that they es have a direct correlation to service actions, and es mis-diagnosing failures costs everybody (admin, companies, es Sun, etc) lots of time and money. Once you expose such a es tunable, it will be impossible to trust any FMA diagnosis, Yeah, I tend to agree that the constants shouldn't be tunable, becuase I hoped Sun would become a disciplined collection-point for experience to set the constants, discipline meaning the constants are only adjusted in response to bad diagnosis not ``preference,'' and in a direction that improves diagnosis for everyone, not for ``the site''. I'm not yet won over to the idea that statistical FMA diagnosis constants shouldn't exist. I think drives can't diagnose themselves for shit, and I think drivers these days are diagnosees, not diagnosers. But clearly a confusingly-bad diagnosis is much worse than diagnosis that's bad in a simple way. es If I issue a write to both halves of a mirror, should es I return when the first one completes, or when both complete? well, if it's not a synchronous write, you return before you've written either half of the mirror, so it's only an issue for O_SYNC/ZIL writes, true? BTW what does ZFS do right now for synchronous writes to mirrors, wait for all, wait for two, or wait for one? es any such best effort RAS is a little dicey because you have es very little visibility into the state of the pool in this es scenario - is my data protected? becomes a very difficult es question to answer. I think it's already difficult. For example, a pool will say ONLINE while it's resilvering, won't it? I might be wrong. Take a pool that can only tolerate one failure. Is the difference between replacing an ONLINE device (still redundant) and replacing an OFFLINE device (not redundant until resilvered) captured? Likewise, should a pool with a spare in use really be marked DEGRADED both before the spare resilvers and after? The answers to the questions aren't important so much as that you have to think about the answers---what should they be, what are they now---which means ``is my data protected?'' is already a difficult question to answer. Also there were recently fixed bugs with DTL. The status of each device's DTL, even the existence and purpose of the DTL, isn't well-exposed to the admin, and is relevant to answering the ``is my data protected?'' question---indirect means of inspecting it like tracking the status of resilvering seem too wallpapered given that the bug escaped notice for so long. I agree with the problem 100% and don't wish to worsen it, just disagree that it's a new one. re 3 orders of magnitude range for magnetic disk I/Os, 4 orders re of magnitude for power managed disks. I would argue for power management a fixed timeout. The time to spin up doesn't have anything to do with the io/s you got before the disk spun down. There's no reason to disguise the constant for which we secretly wish inside some fancy math for deriving it just because writing down constants feels bad. unless you _know_ the disk is spinning up through some in-band means, and want to compare its spinup time to recorded measurements of past spinups. This is a good case for pointing out there are two sets of rules: * 'metaparam -r' rules + not invoked at all if there's no redundancy. + very complicated - involve sets of disks, not one disk. comparison of statistic among disks within a vdev (definitely), and comparison of individual disks to themselves over time (possibly). - complicated output: rules return a set of disks per vdev, not a yay-or-nay diagnosis per disk. And there are two kinds of output decision: o for n-way mirrors, select anywhere from 1 to n disks. for example, a three-way mirror with two fast local mirrors, one slow remote iSCSI mirror, should split reads among the two local disks. for raidz and raidz2 they can eliminate 0, 1 (,or 2) disks from the read-us set. It's possible to issue all the reads and take the first sufficient set to return as Anton suggested, but I imagine 4-device raidz2 vdevs will be common which could some day perform as well as a 2-device mirror. o also, decide when to stop waiting on an existing read and re-issue it. so the decision is not only about future reads, but has to cancel already-issued reads, possibly replacing the B_FAILFAST mechanism so there will be a second uncancellable round of reads once the first round exhausts all redundancy. o that second decision needs to be made thousands of times per second without a lot of CPU overhead + small consequence if the rules deliver false-positives, just reduced performance
Re: [zfs-discuss] Availability: ZFS needs to handle disk removal / driver failure better
re == Richard Elling [EMAIL PROTECTED] writes: re if you use Ethernet switches in the interconnect, you need to re disable STP on the ports used for interconnects or risk re unnecessary cluster reconfigurations. RSTP/802.1w plus setting the ports connected to Solaris as ``edge'' is good enough, less risky for the WAN, and pretty ubiquitously supported with non-EOL switches. The network guys will know this (assuming you have network guys) and do something like this: sw: can you disable STP for me? net: No? sw: jumping up and down screaming net: um,...i mean, Why? sw: [] net: oh, that. Ok, try it now. sw: thanks for disabling STP for me. net: i uh,.. whatever. No problem! re Can we expect a similar attention to detail for ZFS re implementers? I'm afraid not :-(. wellyou weren't really ``expecting'' it of the sun cluster implementers. You just ran into it by surprise in the form of an Issue. so, can you expect ZFS implementers to accept that running ZFS, iSCSI, FC-SW might teach them something about their LAN/SAN they didn't already know? So far they seem receptive to arcane advice like ``make this config change in your SAN controller to let it use the NVRAM cache more aggressively, and stop using EMC PowerPath unless blah.'' so, Yes? I think you can also expect them to wait longer than 40 seconds before declaring a system is frozen and rebooting it, though. ``Let's `patiently wait' forever because we think, based on our uncertainty, that FSPF might take several hours to converge'' is the alternative that strikes me as unreasonable. pgpr5qvdq3JpM.pgp Description: PGP signature ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Availability: ZFS needs to handle disk removal / driver failure better
Miles Nordin wrote: re == Richard Elling [EMAIL PROTECTED] writes: re if you use Ethernet switches in the interconnect, you need to re disable STP on the ports used for interconnects or risk re unnecessary cluster reconfigurations. RSTP/802.1w plus setting the ports connected to Solaris as ``edge'' is good enough, less risky for the WAN, and pretty ubiquitously supported with non-EOL switches. The network guys will know this (assuming you have network guys) and do something like this: sw: can you disable STP for me? net: No? sw: jumping up and down screaming net: um,...i mean, Why? sw: [] net: oh, that. Ok, try it now. sw: thanks for disabling STP for me. net: i uh,.. whatever. No problem! Precisely, this is not a problem that is usually solved unilaterally. re Can we expect a similar attention to detail for ZFS re implementers? I'm afraid not :-(. wellyou weren't really ``expecting'' it of the sun cluster implementers. You just ran into it by surprise in the form of an Issue. Rather, cluster implementers tend to RTFM. I know few ZFSers who have RTFM, and do not expect many to do so... such is life. so, can you expect ZFS implementers to accept that running ZFS, iSCSI, FC-SW might teach them something about their LAN/SAN they didn't already know? No, I expect them to see a problem cause by network reconfiguration and blame ZFS. Indeed, this is what occasionally happens with Solaris Cluster -- but only occasionally, solving via RTFM. So far they seem receptive to arcane advice like ``make this config change in your SAN controller to let it use the NVRAM cache more aggressively, and stop using EMC PowerPath unless blah.'' so, Yes? I have no idea what you are trying to say here. I think you can also expect them to wait longer than 40 seconds before declaring a system is frozen and rebooting it, though. Current [s]sd driver timeouts are 60 seconds with 3-5 retries by default. We've had those timeouts for many, many years now and do provide highly available services on such systems. The B_FAILFAST change did improve the availability of systems and similar tricks have improved service availability for Solaris Clusters. Refer to Eric's post for more details of this minefield. NB some bugids one should research before filing new bugs here are: CR 4713686: sd/ssd driver should have an additional target specific timeout http://bugs.opensolaris.org/bugdatabase/view_bug.do?bug_id=4713686 CR 4500536 introduces B_FAILFAST http://bugs.opensolaris.org/view_bug.do?bug_id=4500536 ``Let's `patiently wait' forever because we think, based on our uncertainty, that FSPF might take several hours to converge'' is the alternative that strikes me as unreasonable. AFAICT, nobody is making such a proposal. Did I miss a post? -- richard ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] Availability: ZFS needs to handle disk removal / driver failure better
Since somebody else has just posted about their entire system locking up when pulling a drive, I thought I'd raise this for discussion. I think Ralf made a very good point in the other thread. ZFS can guarantee data integrity, what it can't do is guarantee data availability. The problem is, the way ZFS is marketed people expect it to be able to do just that. This turned into a longer thread than expected, so I'll start with what I'm asking for, and then attempt to explain my thinking. I'm essentially asking for two features to improve the availability of ZFS pools: - Isolation of storage drivers so that buggy drivers do not bring down the OS. - ZFS timeouts to improve pool availability when no timely response is received from storage drivers. And my reasons for asking for these is that there are now many, many posts on here about people experiencing either total system lockup or ZFS lockup after removing a hot swap drive, and indeed while some of them are using consumer hardware, others have reported problems with server grade kit that definately should be able to handle these errors: Aug 2008: AMD SB600 - System hang - http://www.opensolaris.org/jive/thread.jspa?threadID=70349 Aug 2008: Supermicro SAT2-MV8 - System hang - http://www.opensolaris.org/jive/thread.jspa?messageID=271218 May 2008: Sun hardware - ZFS hang - http://opensolaris.org/jive/thread.jspa?messageID=240481 Feb 2008: iSCSI - ZFS hang - http://www.opensolaris.org/jive/thread.jspa?messageID=206985 Oct 2007: Supermicro SAT2-MV8 - system hang - http://www.opensolaris.org/jive/thread.jspa?messageID=166037 Sept 2007: Fibre channel - http://opensolaris.org/jive/thread.jspa?messageID=151719 ... etc Now while the root cause of each of these may be slightly different, I feel it would still be good to address this if possible as it's going to affect the perception of ZFS as a reliable system. The common factor in all of these is that either the solaris driver hangs and locks the OS, or ZFS hangs and locks the pool. Most of these are for hardware that should handle these failures fine (mine occured for hardware that definately works fine under windows), so I'm wondering: Is there anything that can be done to prevent either type of lockup in these situations? Firstly, for the OS, if a storage component (hardware or driver) fails for a non essential part of the system, the entire OS should not hang. I appreciate there isn't a lot you can do if the OS is using the same driver as it's storage, but certainly in some of the cases above, the OS and the data are using different drivers, and I expect more examples of that could be found with a bit of work. Is there any way storage drivers could be isolated such that the OS (and hence ZFS) can report a problem with that particular driver without hanging the entire system? Please note: I know work is being done on FMA to handle all kinds of bugs, I'm not talking about that. It seems to me that FMA involves proper detection and reporting of bugs, which involves knowing in advance what the problems are and how to report them. What I'm looking for is something much simpler, something that's able to keep the OS running when it encounters unexpected or unhandled behaviour from storage drivers or hardware. It seems to me that one of the benefits of ZFS is working against it here. It's such a flexible system it's being used for many, many types of devices, and that means there are a whole host of drivers being used, and a lot of scope for bugs in those drivers. I know that ultimately any driver issues will need to be sorted individually, but what I'm wondering is whether there's any possibility of putting some error checking code at a layer above the drivers in such a way it's able to trap major problems without hanging the OS? ie: update ZFS/Solaris so they can handle storage layer bugs gracefully without downing the entire system. My second suggestion is to ask if ZFS can be made to handle unexpected events more gracefully. In the past I've suggested that ZFS have a separate timeout so that a redundant pool can continue working even if one device is not responding, and I really think that would be worthwhile. My idea is to have a WAITING status flag for drives, so that if one isn't responding quickly, ZFS can flag it as WAITING, and attempt to read or write the same data from elsewhere in the pool. That would work alongside the existing failure modes, and would allow ZFS to handle hung drivers much more smoothly, preventing redundant pools hanging when a single drive fails. The ZFS update I feel is particularly appropriate. ZFS already uses checksumming since it doesn't trust drivers or hardware to always return the correct data. But ZFS then trusts those same drivers and hardware absolutely when it comes to the availability of the pool. I believe ZFS should apply the same tough standards to pool availability as it does to
Re: [zfs-discuss] Availability: ZFS needs to handle disk removal / driver failure better
On Thu, 28 Aug 2008, Ross wrote: I believe ZFS should apply the same tough standards to pool availability as it does to data integrity. A bad checksum makes ZFS read the data from elsewhere, why shouldn't a timeout do the same thing? A problem is that for some devices, a five minute timeout is ok. For others, there must be a problem if the device does not respond in a second or two. If the system or device is simply overwelmed with work, then you would not want the system to go haywire and make the problems much worse. Which of these do you prefer? o System waits substantial time for devices to (possibly) recover in order to ensure that subsequently written data has the least chance of being lost. o System immediately ignores slow devices and switches to non-redundant non-fail-safe non-fault-tolerant may-lose-your-data mode. When system is under intense load, it automatically switches to the may-lose-your-data mode. Bob == Bob Friesenhahn [EMAIL PROTECTED], http://www.simplesystems.org/users/bfriesen/ GraphicsMagick Maintainer,http://www.GraphicsMagick.org/ ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Availability: ZFS needs to handle disk removal / driver failure better
Ross, thanks for the feedback. A couple points here - A lot of work went into improving the error handling around build 77 of Nevada. There are still problems today, but a number of the complaints we've seen are on s10 software or older nevada builds that didn't have these fixes. Anything from the pre-2008 (or pre-s10u5) timeframe should be taken with grain of salt. There is a fix in the immediate future to prevent I/O timeouts from hanging other parts of the system - namely administrative commands and other pool activity. So I/O to that particular pool will hang, but you'll still be able to run your favorite ZFS commands, and it won't impact the ability of other pools to run. We have some good ideas on how to improve the retry logic. There is a flag in Solaris, B_FAILFAST, that tells the drive to not try too hard getting the data. However, it can return failure when trying harder would produce the correct results. Currently, we try the first I/O with B_FAILFAST, and if that fails immediately retry without the flag. The idea is to elevate the retry logic to a higher level, so when a read from a side of a mirror fails with B_FAILFAST, instead of immediately retrying the same device without the failfast flag, we push the error higher up the stack, and issue another B_FAILFAST I/O to the other half of the mirror. Only if both fail with failfast do we try a more thorough request (though with ditto blocks we may try another vdev alltogether). This should improve I/O error latency for a subset of failure scenarios, and biasing reads away from degraded (but not faulty) devices should also improve response time. The tricky part is incoporating this into the FMA diagnosis engine, as devices may fail B_FAILFAST requests for a variety of non-fatal reasons. Finally, imposing additional timeouts in ZFS is a bad idea. ZFS is designed to be a generic storage consumer. It can be layered on top of directly attached disks, SSDs, SAN devices, iSCSI targets, files, and basically anything else. As such, it doesn't have the necessary context to know what constitutes a reasonable timeout. This is explicitly delegated to the underlying storage subsystem. If a storage subsystem is timing out for excessive periods of time when B_FAILFAST is set, then that's a bug in the storage subsystem, and working around it in ZFS with yet another set of tunables is not practical. It will be interesting to see if this is an issue after the retry logic is modified as described above. Hope that helps, - Eric On Thu, Aug 28, 2008 at 01:08:26AM -0700, Ross wrote: Since somebody else has just posted about their entire system locking up when pulling a drive, I thought I'd raise this for discussion. I think Ralf made a very good point in the other thread. ZFS can guarantee data integrity, what it can't do is guarantee data availability. The problem is, the way ZFS is marketed people expect it to be able to do just that. This turned into a longer thread than expected, so I'll start with what I'm asking for, and then attempt to explain my thinking. I'm essentially asking for two features to improve the availability of ZFS pools: - Isolation of storage drivers so that buggy drivers do not bring down the OS. - ZFS timeouts to improve pool availability when no timely response is received from storage drivers. And my reasons for asking for these is that there are now many, many posts on here about people experiencing either total system lockup or ZFS lockup after removing a hot swap drive, and indeed while some of them are using consumer hardware, others have reported problems with server grade kit that definately should be able to handle these errors: Aug 2008: AMD SB600 - System hang - http://www.opensolaris.org/jive/thread.jspa?threadID=70349 Aug 2008: Supermicro SAT2-MV8 - System hang - http://www.opensolaris.org/jive/thread.jspa?messageID=271218 May 2008: Sun hardware - ZFS hang - http://opensolaris.org/jive/thread.jspa?messageID=240481 Feb 2008: iSCSI - ZFS hang - http://www.opensolaris.org/jive/thread.jspa?messageID=206985 Oct 2007: Supermicro SAT2-MV8 - system hang - http://www.opensolaris.org/jive/thread.jspa?messageID=166037 Sept 2007: Fibre channel - http://opensolaris.org/jive/thread.jspa?messageID=151719 ... etc Now while the root cause of each of these may be slightly different, I feel it would still be good to address this if possible as it's going to affect the perception of ZFS as a reliable system. The common factor in all of these is that either the solaris driver hangs and locks the OS, or ZFS hangs and locks the pool. Most of these are for hardware that should handle these failures fine (mine occured for hardware that definately works fine under windows), so I'm wondering: Is there anything that can be done to prevent either type of lockup in these situations? Firstly, for the OS, if a storage component (hardware or driver)
Re: [zfs-discuss] Availability: ZFS needs to handle disk removal / driver failure better
es == Eric Schrock [EMAIL PROTECTED] writes: es Finally, imposing additional timeouts in ZFS is a bad idea. es [...] As such, it doesn't have the necessary context to know es what constitutes a reasonable timeout. you're right in terms of fixed timeouts, but there's no reason it can't compare the performance of redundant data sources, and if one vdev performs an order of magnitude slower than another set of vdevs with sufficient redundancy, stop issuing reads except scrubs/healing to the underperformer (issue writes only), and pass an event to FMA. ZFS can also compare the performance of a drive to itself over time, and if the performance suddenly decreases, do the same. The former case eliminates the need for the mirror policies in SVM, which Ian requested a few hours ago for the situation that half the mirror is a slow iSCSI target for geographic redundancy and half is faster/local. Some care would have to be taken for targets shared by ZFS and some other initiator, but I'm not sure the care would really be that difficult to take, or that the oscillations induced by failing to take it would really be particularly harmful compared to unsupervised contention for a device. The latter notices quickly drives that have been pulled, or for Richard's ``overwhelmingly dominant'' case, for drives which are stalled for 30 seconds pending their report of an unrecovered read. Developing meaningful performance statistics for drives and a tool for displaying them would be useful for itself, not just for stopping freezes and preventing a failing drive from degrading performance a thousandfold. Issuing reads to redundant devices is cheap compared to freezing. The policy with which it's done is highly tunable and should be fun to tune and watch, and the consequence if the policy makes the wrong choice isn't incredibly dire. This B_FAILFAST architecture captures the situation really poorly. First, it's not implementable in any serious way with near-line drives, or really with any drives with which you're not intimately familiar and in control of firmware/release-engineering, and perhaps not with any drives period. I suspect in practice it's more a controller-level feature, about whether or not you'd like to distrust the device's error report and start resetting busses and channels and mucking everything up trying to recover from some kind of ``weirdness''. It's not an answer to the known problem of drives stalling for 30 seconds when they start to fail. First and a half, when it's not implemented, the system degrades to doubling your timeout pointlessly. A driver-level block cache of UNC's would probably have more value toward this speed/read-aggressiveness tradeoff than the whole B_FAILFAST architecture---just cache known unrecoverable read sectors, and refuse to issue further I/O for them until a timeout of 3 - 10 minutes passes. I bet this would speed up most failures tremendously, and without burdening upper layers with retry logic. Second, B_FAILFAST entertains the fantasy that I/O's are independent, while what happens in practice is that the drive hits a UNC on one I/O, and won't entertain any further I/O's no matter what flags the request has on it or how many times you ``reset'' things. Maybe you could try to rescue B_FAILFAST by putting clever statistics into the driver to compare the drive's performance to recent past as I suggested ZFS do, and admit no B_FAILFAST requests to queues of drives that have suddenly slowed down, just fail them immediately without even trying. I submit this queueing and statistic collection is actually _better_ managed by ZFS than the driver because ZFS can compare a whole floating-point statistic across a whole vdev, while even a driver which is fancier than we ever dreamed, is still playing poker with only 1 bit of input ``I'll call,'' or ``I'll fold.'' ZFS can see all the cards and get better results while being stupider and requiring less clever poker-guessing than would be required by a hypothetical driver B_FAILFAST implementation that actually worked. pgpqZb7GbAEgk.pgp Description: PGP signature ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Availability: ZFS needs to handle disk removal / driver failure better
On Thu, Aug 28, 2008 at 02:17:08PM -0400, Miles Nordin wrote: you're right in terms of fixed timeouts, but there's no reason it can't compare the performance of redundant data sources, and if one vdev performs an order of magnitude slower than another set of vdevs with sufficient redundancy, stop issuing reads except scrubs/healing to the underperformer (issue writes only), and pass an event to FMA. Yep, latency would be a useful metric to add to mirroring choices. The current logic is rather naive (round-robin) and could easily be enhanced. Making diagnoses based on this is much trickier, particularly at the ZFS level. A better option would be to leverage the SCSI FMA work going on to do a more intimate diagnosis at the scsa level. Also, the problem you are trying to solve - timing out the first I/O to take a long time - is not captured well by the type of hysteresis you would need to perform in order to do this diagnosis. It certainly can be done, but is much better suited to diagnosising a failing drive over time, not aborting a transaction in response to immediate failure. This B_FAILFAST architecture captures the situation really poorly. I don't think you understand how this works. Imagine two I/Os, just with different sd timeouts and retry logic - that's B_FAILFAST. It's quite simple, and independent of any hardware implementation. - Eric -- Eric Schrock, Fishworkshttp://blogs.sun.com/eschrock ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Availability: ZFS needs to handle disk removal / driver failure better
On Thu, 28 Aug 2008, Miles Nordin wrote: you're right in terms of fixed timeouts, but there's no reason it can't compare the performance of redundant data sources, and if one vdev performs an order of magnitude slower than another set of vdevs with sufficient redundancy, stop issuing reads except scrubs/healing to the underperformer (issue writes only), and pass an event to FMA. You are saying that I can't split my mirrors between a local disk in Dallas and a remote disk in New York accessed via iSCSI? Why don't you want me to be able to do that? ZFS already backs off from writing to slow vdevs. ZFS can also compare the performance of a drive to itself over time, and if the performance suddenly decreases, do the same. While this may be useful for reads, I would hate to disable redundancy just because a device is currently slow. Bob == Bob Friesenhahn [EMAIL PROTECTED], http://www.simplesystems.org/users/bfriesen/ GraphicsMagick Maintainer,http://www.GraphicsMagick.org/ ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Availability: ZFS needs to handle disk removal / driver failure better
Hi guys, Bob, my thought was to have this timeout as something that can be optionally set by the administrator on a per pool basis. I'll admit I was mainly thinking about reads and hadn't considered the write scenario, but even having thought about that it's still a feature I'd like. After all, this would be a timeout set by the administrator based on the longest delay they can afford for that storage pool. Personally, if a SATA disk wasn't responding to any requests after 2 seconds I really don't care if an error has been detected, as far as I'm concerned that disk is faulty. I'd be quite happy for the array to drop to a degraded mode based on that and for writes to carry on with the rest of the array. Eric, thanks for the extra details, they're very much appreciated. It's good to hear you're working on this, and I love the idea of doing a B_FAILFAST read on both halves of the mirror. I do have a question though. From what you're saying, the response time can't be consistent across all hardware, so you're once again at the mercy of the storage drivers. Do you know how long does B_FAILFAST takes to return a response on iSCSI? If that's over 1-2 seconds I would still consider that too slow I'm afraid. I understand that Sun in general don't want to add fault management to ZFS, but I don't see how this particular timeout does anything other than help ZFS when it's dealing with such a diverse range of media. I agree that ZFS can't know itself what should be a valid timeout, but that's exactly why this needs to be an optional administrator set parameter. The administrator of a storage array who wants to set this certainly knows what a valid timeout is for them, and these timeouts are likely to be several orders of magnitude larger than the standard response times. I would configure very different values for my SATA drives as for my iSCSI connections, but in each case I would be happier knowing that ZFS has more of a chance of catching bad drivers or unexpected scenarios. I very much doubt hardware raid controllers would wait 3 minutes for a drive to return a response, they will have their own internal timeouts to know when a drive has failed, and while ZFS is dealing with very different hardware I can't help but feel it should have that same approach to management of its drives. However, that said, I'll be more than willing to test the new B_FAILFAST logic on iSCSI once it's released. Just let me know when it's out. Ross Date: Thu, 28 Aug 2008 11:29:21 -0500 From: [EMAIL PROTECTED] To: [EMAIL PROTECTED] CC: zfs-discuss@opensolaris.org Subject: Re: [zfs-discuss] Availability: ZFS needs to handle disk removal / driver failure better On Thu, 28 Aug 2008, Ross wrote: I believe ZFS should apply the same tough standards to pool availability as it does to data integrity. A bad checksum makes ZFS read the data from elsewhere, why shouldn't a timeout do the same thing? A problem is that for some devices, a five minute timeout is ok. For others, there must be a problem if the device does not respond in a second or two. If the system or device is simply overwelmed with work, then you would not want the system to go haywire and make the problems much worse. Which of these do you prefer? o System waits substantial time for devices to (possibly) recover in order to ensure that subsequently written data has the least chance of being lost. o System immediately ignores slow devices and switches to non-redundant non-fail-safe non-fault-tolerant may-lose-your-data mode. When system is under intense load, it automatically switches to the may-lose-your-data mode. Bob == Bob Friesenhahn [EMAIL PROTECTED], http://www.simplesystems.org/users/bfriesen/ GraphicsMagick Maintainer,http://www.GraphicsMagick.org/ _ Get Hotmail on your mobile from Vodafone http://clk.atdmt.com/UKM/go/107571435/direct/01/___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Availability: ZFS needs to handle disk removal / driver failure better
es == Eric Schrock [EMAIL PROTECTED] writes: es I don't think you understand how this works. Imagine two es I/Os, just with different sd timeouts and retry logic - that's es B_FAILFAST. It's quite simple, and independent of any es hardware implementation. AIUI the main timeout to which we should be subject, at least for nearline drives, is about 30 seconds long and is decided by the drive's firmware, not the driver, and can't be negotiated in any way that's independent of the hardware implementation, although sometimes there are dependent ways to negotiate it. The driver could also decide through ``retry logic'' to time out the command sooner, before the drive completes it, but this won't do much good because the drive won't accept a second command until ITS timeout expires. which leads to the second problem, that we're talking about timeouts for individual I/O's, not marking whole devices. A ``fast'' timeout of even 1 second could cause a 100- or 1000-fold decrease in performance, which could end up being equivalent to a freeze depending on the type of load on the filesystem. pgphjTr74byaZ.pgp Description: PGP signature ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Availability: ZFS needs to handle disk removal / driver failure better
bf == Bob Friesenhahn [EMAIL PROTECTED] writes: bf If the system or device is simply overwelmed with work, then bf you would not want the system to go haywire and make the bf problems much worse. None of the decisions I described its making based on performance statistics are ``haywire''---I said it should funnel reads to the faster side of the mirror, and do this really quickly and unconservatively. What's your issue with that? bf You are saying that I can't split my mirrors between a local bf disk in Dallas and a remote disk in New York accessed via bf iSCSI? nope, you've misread. I'm saying reads should go to the local disk only, and writes should go to both. See SVM's 'metaparam -r'. I suggested that unlike the SVM feature it should be automatic, because by so being it becomes useful as an availability tool rather than just performance optimisation. The performance-statistic logic should influence read scheduling immediately, and generate events which are fed to FMA, then FMA can mark devices faulty. There's no need for both to make the same decision at the same time. If the events aren't useful for diagnosis, ZFS could not bother generating them, or fmd could ignore them in its diagnosis. I suspect they *would* be useful, though. I'm imagining the read rescheduling would happen very quickly, quicker than one would want a round-trip from FMA, in much less than a second. That's why it would have to compare devices to others in the same vdev, and to themselves over time, rather than use fixed timeouts or punt to haphazard driver and firmware logic. bfo System waits substantial time for devices to (possibly) bf recover in order to ensure that subsequently written data has bf the least chance of being lost. There's no need for the filesystem to *wait* for data to be written, unless you are calling fsync. and maybe not even then if there's a slog. I said clearly that you read only one half of the mirror, but write to both. But you're right that the trick probably won't work perfectly---eventually dead devices need to be faulted. The idea is that normal write caching will buy you orders of magnitude longer time in which to make a better decision before anyone notices. Experience here is that ``waits substantial time'' usually means ``freezes for hours and gets rebooted''. There's no need to be abstract: we know what happens when a drive starts taking 1000x - 2000x longer than usual to respond to commands, and we know that this is THE common online failure mode for drives. That's what started the thread. so, think about this: hanging for an hour trying to write to a broken device may block other writes to devices which are still working, until the patiently-waiting data is eventually lost in the reboot. bfo System immediately ignores slow devices and switches to bf non-redundant non-fail-safe non-fault-tolerant bf may-lose-your-data mode. When system is under intense load, bf it automatically switches to the may-lose-your-data mode. nobody's proposing a system which silently rocks back and forth between faulted and online. That's not what we have now, and no such system would naturally arise. If FMA marked a drive faulty based on performance statistics, that drive would get retired permanently and hot-spare-replaced. Obviously false positives are bad, just as obviously as freezes/reboots are bad. It's not my idea to use FMA in this way. This is how FMA was pitched, and the excuse for leaving good exception handling out of ZFS for two years. so, where's the beef? pgpUDw139jf6A.pgp Description: PGP signature ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Availability: ZFS needs to handle disk removal / driver failure better
On Thu, Aug 28, 2008 at 08:34:24PM +0100, Ross Smith wrote: Personally, if a SATA disk wasn't responding to any requests after 2 seconds I really don't care if an error has been detected, as far as I'm concerned that disk is faulty. Unless you have power management enabled, or there's a bad region of the disk, or the bus was reset, or... I do have a question though. From what you're saying, the response time can't be consistent across all hardware, so you're once again at the mercy of the storage drivers. Do you know how long does B_FAILFAST takes to return a response on iSCSI? If that's over 1-2 seconds I would still consider that too slow I'm afraid. It's main function is how it deals with retryable errors. If the drive responds with a retryable error, or any error at all, it won't attempt to retry again. If you have a device that is taking arbitrarily long to respond to successful commands (or to notice that a command won't succeed), it won't help you. I understand that Sun in general don't want to add fault management to ZFS, but I don't see how this particular timeout does anything other than help ZFS when it's dealing with such a diverse range of media. I agree that ZFS can't know itself what should be a valid timeout, but that's exactly why this needs to be an optional administrator set parameter. The administrator of a storage array who wants to set this certainly knows what a valid timeout is for them, and these timeouts are likely to be several orders of magnitude larger than the standard response times. I would configure very different values for my SATA drives as for my iSCSI connections, but in each case I would be happier knowing that ZFS has more of a chance of catching bad drivers or unexpected scenarios. The main problem with exposing tunables like this is that they have a direct correlation to service actions, and mis-diagnosing failures costs everybody (admin, companies, Sun, etc) lots of time and money. Once you expose such a tunable, it will be impossible to trust any FMA diagnosis, because you won't be able to know whether it was a mistaken tunable. A better option would be to not use this to perform FMA diagnosis, but instead work into the mirror child selection code. This has already been alluded to before, but it would be cool to keep track of latency over time, and use this to both a) prefer one drive over another when selecting the child and b) proactively timeout/ignore results from one child and select the other if it's taking longer than some historical standard deviation. This keeps away from diagnosing drives as faulty, but does allow ZFS to make better choices and maintain response times. It shouldn't be hard to keep track of the average and/or standard deviation and use it for selection; proactively timing out the slow I/Os is much trickier. As others have mentioned, things get more difficult with writes. If I issue a write to both halves of a mirror, should I return when the first one completes, or when both complete? One possibility is to expose this as a tunable, but any such best effort RAS is a little dicey because you have very little visibility into the state of the pool in this scenario - is my data protected? becomes a very difficult question to answer. - Eric -- Eric Schrock, Fishworkshttp://blogs.sun.com/eschrock ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Availability: ZFS needs to handle disk removal / driver failure better
On Thu, 28 Aug 2008, Miles Nordin wrote: None of the decisions I described its making based on performance statistics are ``haywire''---I said it should funnel reads to the faster side of the mirror, and do this really quickly and unconservatively. What's your issue with that? From what I understand, this is partially happening now based on average service time. If I/O is backed up for a device, then the other device is preferred. However it good to keep in mind that if data is never read, then it is never validated and corrected. It is good for ZFS to read data sometimes. Bob == Bob Friesenhahn [EMAIL PROTECTED], http://www.simplesystems.org/users/bfriesen/ GraphicsMagick Maintainer,http://www.GraphicsMagick.org/ ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Availability: ZFS needs to handle disk removal / driver failure better
On Thu, 2008-08-28 at 13:05 -0700, Eric Schrock wrote: A better option would be to not use this to perform FMA diagnosis, but instead work into the mirror child selection code. This has already been alluded to before, but it would be cool to keep track of latency over time, and use this to both a) prefer one drive over another when selecting the child and b) proactively timeout/ignore results from one child and select the other if it's taking longer than some historical standard deviation. This keeps away from diagnosing drives as faulty, but does allow ZFS to make better choices and maintain response times. It shouldn't be hard to keep track of the average and/or standard deviation and use it for selection; proactively timing out the slow I/Os is much trickier. tcp has to solve essentially the same problem: decide when a response is overdue based only on the timing of recent successful exchanges in a context where it's difficult to make assumptions about reasonable expected behavior of the underlying network. it tracks both the smoothed round trip time and the variance, and declares a response overdue after (SRTT + K * variance). I think you'd probably do well to start with something similar to what's described in http://www.ietf.org/rfc/rfc2988.txt and then tweak based on experience. - Bill ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss