Re: [zfs-discuss] Availability: ZFS needs to handle disk removal / driver failure better

2008-12-03 Thread Ross
Ok, I've done some more testing today and I almost don't know where to start.

I'll begin with the good news for Miles :)
- Rebooting doesn't appear to cause ZFS to loose the resilver status (but see 
1. below)
- Resilvering appears to work fine, once complete I never saw any checksum 
errors when scrubbing the pool.
- Reconnecting iscsi drives causes zfs to automatically online the pool and 
automatically begin resilvering.

And now the bad news:
1.  While rebooting doesn't seem cause the resilver to loose it's status, 
something's causing it problems.  I saw it restart several times.
2.  With iscsi, you can't reboot with sendtargets enabled, static discovery 
still seems to be the order of the day.
3.  There appears to be a disconnect between what iscsiadm knows and what ZFS 
knows about the status of the devices.  

And I have confirmation of some of my earlier findings too:
4.  iSCSI still has a 3 minute timeout, during which time your pool will hang, 
no matter how many redundant drives you have available.
5.  zpool status can still hang when a  device goes offline, and when it 
finally recovers, it will then report out of date information.  This could be 
Bug 6667199, but I've not seen anybody reporting the incorrect information part 
of this.
6.  After one drive goes offline, during the resilver process, zpool status 
shows that information is being resilvered on the good drives.  Does anybody 
know why this happens?
7.  Although ZFS will automatically online a pool when iscsi devices come 
online, CIFS shares are not automatically remounted.

I also have a few extra notes about a couple of those:

1 - resilver loosing status
===
Regarding the resilver restarting, I've seen it reported that zpool status 
can cause this when run as admin, but I'm not convinced that's the cause.  Same 
for the rebooting problem.  I was able to run zpool status dozens of times as 
an admin, but only two or three times did I see the resilver restart.

Also, after rebooting, I could see that the resilver was showing that it was 
66% complete, but then a second later it restarted.

Now, none of this is conclusive.  I really need to test with a much larger 
dataset to get an idea of what's really going on, but there's definately 
something weird happening here.

3 - disconnect between iscsiadm and ZFS
=
I repeated my test of offlining an iscsi target, this time checking iscsiadm to 
see when it disconnected. 

What I did was wait until iscsiadm reported 0 connections to the target, and 
then started a CIFS file copy and ran zpool status.

Zpool status hung as expected, and a minute or so later, the CIFS copy failed.  
It seems that although iscsiadm was aware that the target was offline, ZFS did 
not yet know about it.  As expected, a minute or so later, zpool status 
completed (returning incorrect results), and I could then run the CIFS copy 
fine.

5 - zpool status hanging and reporting incorrect information
===
When an iSCSI device goes offline, if you immediately run zpool status, it 
hangs for 3-4 minutes.  Also, when it finally completes, it gives incorrect 
information, reporting all the devices as online.

If you immediately re-run zpool status, it completes rapidly and will now 
correctly show the offline devices.
-- 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Availability: ZFS needs to handle disk removal / driver failure better

2008-12-03 Thread Maurice Volaski
2.  With iscsi, you can't reboot with sendtargets enabled, static 
discovery still seems to be the order of the day.

I'm seeing this problem with static discovery: 
http://bugs.opensolaris.org/view_bug.do?bug_id=6775008.

4.  iSCSI still has a 3 minute timeout, during which time your pool 
will hang, no matter how many redundant drives you have available.

This is CR 649, 
http://bugs.opensolaris.org/view_bug.do?bug_id=649, which is 
separate from the boot time timeout, though, and also one that Sun so 
far has been unable to fix!
-- 

Maurice Volaski, [EMAIL PROTECTED]
Computing Support, Rose F. Kennedy Center
Albert Einstein College of Medicine of Yeshiva University
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Availability: ZFS needs to handle disk removal / driver failure better

2008-12-03 Thread Ross Smith
Yeah, thanks Maurice, I just saw that one this afternoon.  I guess you
can't reboot with iscsi full stop... o_0

And I've seen the iscsi bug before (I was just too lazy to look it up
lol), I've been complaining about that since February.

In fact it's been a bad week for iscsi here, I've managed to crash the
iscsi client twice in the last couple of days too (full kernel dump
crashes), so I'll be filing a bug report on that tomorrow morning when
I get back to the office.

Ross


On Wed, Dec 3, 2008 at 7:39 PM, Maurice Volaski [EMAIL PROTECTED] wrote:
 2.  With iscsi, you can't reboot with sendtargets enabled, static
 discovery still seems to be the order of the day.

 I'm seeing this problem with static discovery:
 http://bugs.opensolaris.org/view_bug.do?bug_id=6775008.

 4.  iSCSI still has a 3 minute timeout, during which time your pool will
 hang, no matter how many redundant drives you have available.

 This is CR 649, http://bugs.opensolaris.org/view_bug.do?bug_id=649,
 which is separate from the boot time timeout, though, and also one that Sun
 so far has been unable to fix!
 --

 Maurice Volaski, [EMAIL PROTECTED]
 Computing Support, Rose F. Kennedy Center
 Albert Einstein College of Medicine of Yeshiva University

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Availability: ZFS needs to handle disk removal / driver failure better

2008-12-03 Thread Miles Nordin
 r == Ross  [EMAIL PROTECTED] writes:

rs I don't think it likes it if the iscsi targets aren't
rs available during boot.

from my cheatsheet:
-8-
ok boot -m milestone=none
[boots.  enter root password for maintenance.]
bash-3.00# /sbin/mount -o remount,rw /  [-- otherwise iscsiadm won't update 
/etc/iscsi/*]
bash-3.00# /sbin/mount /usr
bash-3.00# /sbin/mount /var
bash-3.00# /sbin/mount /tmp
bash-3.00# iscsiadm remove discovery-address 10.100.100.135
bash-3.00# iscsiadm remove discovery-address 10.100.100.138
bash-3.00# iscsiadm remove discovery-address 10.100.100.138
iscsiadm: unexpected OS error
iscsiadm: Unable to complete operation  [-- good.  it's gone.]
bash-3.00# sync
bash-3.00# lockfs -fa
bash-3.00# reboot
-8-

rs # time zpool status 
[...]
rs real 3m51.774s

so, this hang may happen in fewer situations, but it is not fixed.

 r 6.  After one drive goes offline, during the resilver process,
 r zpool status shows that information is being resilvered on the
 r good drives.  Does anybody know why this happens?  

I don't know why.

I've seen that, too, though.  For me it's always been relatively
short, 1min.  I wonder if there are three kinds of scrub-like things,
not just two (resilvers and scrubs), and 'zpool status' is
``simplifying'' for us again?

 r 7.  Although ZFS will automatically online a pool when iscsi
 r devices come online, CIFS shares are not automatically
 r remounted.

For me, even plain filesystems are not all remounted.  ZFS tries to
mount them in the wrong order, so it would mount /a/b/c, then try to
mount /a/b and complain ``directory not empty''.  I'm not sure why it
mounts things in the right order at boot/import, but in haphazard
order after one of these auto-onlines.  Then NFS exporting didn't work
either.

To fix, I have to 'zfs umount /a/b/c', but then there is a b/c
directory inside filesystem /a, so I have to 'rmdir /a/b/c' by hand
because the '... set mountpoint' koolaid creates the directories but
doesn't remove them.  Then 'zfs mount -a' and 'zfs share -a'.


pgpJzJr1P7Q4e.pgp
Description: PGP signature
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Availability: ZFS needs to handle disk removal / driver failure better

2008-12-02 Thread Ross Smith
Hey folks,

I've just followed up on this, testing iSCSI with a raided pool, and
it still appears to be struggling when a device goes offline.

 I don't see how this could work except for mirrored pools.  Would that
 carry enough market to be worthwhile?
 -- richard


 I have to admit, I've not tested this with a raided pool, but since
 all ZFS commands hung when my iSCSI device went offline, I assumed
 that you would get the same effect of the pool hanging if a raid-z2
 pool is waiting for a response from a device.  Mirrored pools do work
 particularly well with this since it gives you the potential to have
 remote mirrors of your data, but if you had a raid-z2 pool, you still
 wouldn't want that hanging if a single device failed.


 zpool commands hanging is CR6667208, and has been fixed in b100.
 http://bugs.opensolaris.org/view_bug.do?bug_id=6667208

 I will go and test the raid scenario though on a current build, just to be
 sure.


 Please.
 -- richard


I've just created a pool using three snv_103 iscsi Targets, with a
fourth install of snv_103 collating those targets into a raidz pool,
and sharing that out over CIFS.

To test the server, while transferring files from a windows
workstation, I powered down one of the three iSCSI targets.  It took a
few minutes to shutdown, but once that happened the windows copy
halted with the error:
The specified network name is no longer available.

At this point, the zfs admin tools still work fine (which is a huge
improvement, well done!), but zpool status still reports that all
three devices are online.

A minute later, I can open the share again, and start another copy.

Thirty seconds after that, zpool status finally reports that the iscsi
device is offline.

So it looks like we have the same problems with that 3 minute delay,
with zpool status reporting wrong information, and the CIFS service
having problems tool.

At this point I restarted the iSCSI target, but had problems bringing
it back online.  It appears there's a bug in the initiator, but it's
easily worked around:
http://www.opensolaris.org/jive/thread.jspa?messageID=312981#312981

What was great was that as soon as the iSCSI initiator reconnected,
ZFS started resilvering.

What might not be so great is the fact that all three devices are
showing that they've been resilvered:

# zpool status
  pool: iscsipool
 state: ONLINE
status: One or more devices has experienced an unrecoverable error.  An
attempt was made to correct the error.  Applications are unaffected.
action: Determine if the device needs to be replaced, and clear the errors
using 'zpool clear' or replace the device with 'zpool replace'.
   see: http://www.sun.com/msg/ZFS-8000-9P
 scrub: resilver completed after 0h2m with 0 errors on Tue Dec  2 11:04:10 2008
config:

NAME   STATE READ WRITE CKSUM
iscsipool  ONLINE   0 0 0
  raidz1   ONLINE   0 0 0
c2t600144F04933FF6C5056967AC800d0  ONLINE   0
0 0  179K resilvered
c2t600144F04934FAB35056964D9500d0  ONLINE   5
9.88K 0  311M resilvered
c2t600144F04934119E50569675FF00d0  ONLINE   0
0 0  179K resilvered

errors: No known data errors

It's proving a little hard to know exactly what's happening when,
since I've only got a few seconds to log times, and there are delays
with each step.  However, I ran another test using robocopy and was
able to observe the behaviour a little more closely:

Test 2:  Using robocopy for the transfer, and iostat plus zpool status
on the server

10:46:30 - iSCSI server shutdown started
10:52:20 - all drives still online according to zpool status
10:53:30 - robocopy error - The specified network name is no longer available
 - zpool status shows all three drives as online
 - zpool iostat appears to have hung, taking much longer than the 30s
specified to return a result
 - robocopy is now retrying the file, but appears to have hung
10:54:30 - robocopy, CIFS and iostat all start working again, pretty
much simultaneously
 - zpool status now shows the drive as offline

I could probably do with using DTrace to get a better look at this,
but I haven't learnt that yet.  My guess as to what's happening would
be:

- iSCSI target goes offline
- ZFS will not be notified for 3 minutes, but I/O to that device is
essentially hung
- CIFS times out (I suspect this is on the client side with around a
30s timeout, but I can't find the timeout documented anywhere).
- zpool iostat is now waiting, I may be wrong but this doesn't appear
to have benefited from the changes to zpool status
- After 3 minutes, the iSCSI drive goes offline.  The pool carries on
with the remaining two drives, CIFS carries on working, iostat carries
on working.  zpool status however is still out of date.
- zpool status eventually catches up, and reports that the drive has
gone 

Re: [zfs-discuss] Availability: ZFS needs to handle disk removal / driver failure better

2008-12-02 Thread Ross
Incidentally, while I've reported this again as a RFE, I still haven't seen a 
CR number for this.  Could somebody from Sun check if it's been filed please.

thanks,

Ross
-- 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Availability: ZFS needs to handle disk removal / driver failure better

2008-12-02 Thread Ross Smith
Hi Richard,

Thanks, I'll give that a try.  I think I just had a kernel dump while
trying to boot this system back up though, I don't think it likes it
if the iscsi targets aren't available during boot.  Again, that rings
a bell, so I'll go see if that's another known bug.

Changing that setting on the fly didn't seem to help, if anything
things are worse this time around.  I changed the timeout to 15
seconds, but didn't restart any services:

# echo iscsi_rx_max_window/D | mdb -k
iscsi_rx_max_window:
iscsi_rx_max_window:180
# echo iscsi_rx_max_window/W0t15 | mdb -kw
iscsi_rx_max_window:0xb4=   0xf
# echo iscsi_rx_max_window/D | mdb -k
iscsi_rx_max_window:
iscsi_rx_max_window:15

After making those changes, and repeating the test, offlining an iscsi
volume hung all the commands running on the pool.  I had three ssh
sessions open, running the following:
# zpool iostats -v iscsipool 10 100
# format  /dev/null
# time zpool status

They hung for what felt a minute or so.
After that, the CIFS copy timed out.

After the CIFS copy timed out, I tried immediately restarting it.  It
took a few more seconds, but restarted no problem.  Within a few
seconds of that restarting, iostat recovered, and format returned it's
result too.

Around 30 seconds later, zpool status reported two drives, paused
again, then showed the status of the third:

# time zpool status
  pool: iscsipool
 state: ONLINE
status: One or more devices has experienced an unrecoverable error.  An
attempt was made to correct the error.  Applications are unaffected.
action: Determine if the device needs to be replaced, and clear the errors
using 'zpool clear' or replace the device with 'zpool replace'.
   see: http://www.sun.com/msg/ZFS-8000-9P
 scrub: resilver completed after 0h0m with 0 errors on Tue Dec  2 16:39:21 2008
config:

NAME   STATE READ WRITE CKSUM
iscsipool  ONLINE   0 0 0
  raidz1   ONLINE   0 0 0
c2t600144F04933FF6C5056967AC800d0  ONLINE   0
0 0  15K resilvered
c2t600144F04934FAB35056964D9500d0  ONLINE   0
0 0  15K resilvered
c2t600144F04934119E50569675FF00d0  ONLINE   0
200 0  24K resilvered

errors: No known data errors

real3m51.774s
user0m0.015s
sys 0m0.100s

Repeating that a few seconds later gives:

# time zpool status
  pool: iscsipool
 state: DEGRADED
status: One or more devices could not be opened.  Sufficient replicas exist for
the pool to continue functioning in a degraded state.
action: Attach the missing device and online it using 'zpool online'.
   see: http://www.sun.com/msg/ZFS-8000-2Q
 scrub: resilver completed after 0h0m with 0 errors on Tue Dec  2 16:39:21 2008
config:

NAME   STATE READ WRITE CKSUM
iscsipool  DEGRADED 0 0 0
  raidz1   DEGRADED 0 0 0
c2t600144F04933FF6C5056967AC800d0  ONLINE   0
0 0  15K resilvered
c2t600144F04934FAB35056964D9500d0  ONLINE   0
0 0  15K resilvered
c2t600144F04934119E50569675FF00d0  UNAVAIL  3
5.80K 0  cannot open

errors: No known data errors

real0m0.272s
user0m0.029s
sys 0m0.169s




On Tue, Dec 2, 2008 at 3:58 PM, Richard Elling [EMAIL PROTECTED] wrote:

..

 iSCSI timeout is set to 180 seconds in the client code.  The only way
 to change is to recompile it, or use mdb.  Since you have this test rig
 setup, and I don't, do you want to experiment with this timeout?
 The variable is actually called iscsi_rx_max_window so if you do
   echo iscsi_rx_max_window/D | mdb -k
 you should see 180
 Change it using something like:
   echo iscsi_rx_max_window/W0t30 | mdb -kw
 to set it to 30 seconds.
 -- richard
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Availability: ZFS needs to handle disk removal / driver failure better

2008-12-02 Thread Miles Nordin
 rs == Ross Smith [EMAIL PROTECTED] writes:

rs 4. zpool status still reports out of date information.

I know people are going to skim this message and not hear this.
They'll say ``well of course zpool status says ONLINE while the pool
is hung.  ZFS is patiently waiting.  It doesn't know anything is
broken yet.''  but you are NOT saying it's out of date because it
doesn't say OFFLINE the instant you power down an iSCSI target.
You're saying:

rs - After 3 minutes, the iSCSI drive goes offline.
rs The pool carries on with the remaining two drives, CIFS
rs carries on working, iostat carries on working.  zpool status
rs however is still out of date.

rs - zpool status eventually
rs catches up, and reports that the drive has gone offline.

so, there is a ~30sec window when it's out of date.  When you say
``goes offline'' in the first bullet, you're saying ``ZFS must have
marked it offline internally, because the pool unfroze.''  but you
found that even after it ``goes offline'' 'zpool status' still 
reports it ONLINE.

The question is, what the hell is 'zpool status' reporting?  not the
status, apparently.  It's supposed to be a diagnosis tool.  Why should
you have to second-guess it and infer the position of ZFS's various
internal state machines through careful indirect observation, ``oops,
CIFS just came back,'' or ``oh sometihng must have changed because
zpool iostat isn't hanging any more''?  Why not have a tool that TELLS
you plainly what's going on?  'zpool status' isn't.

Is it trying to oversimplify things, to condescend to the sysadmin or
hide ZFS's rough edges?  Are there more states for devices that are
being compressed down to ONLINE OFFLINE DEGRADED FAULTED?  Is there
some tool in zdb or mdb that is like 'zpool status -simonsez'?  I
already know sometimes it'll report everything as ONLINE but refuse
'zpool offline ... device' with 'no valid replicas', so I think, yes
there are ``secret states'' for devices?  Or is it trying to do too
many things with one output format?

rs 5. When iSCSI targets finally do come back online, ZFS is
rs resilvering all of them (again, this rings a bell, Miles might
rs have reported something similar).

my zpool status is so old it doesn't say ``xxkB resilvered'' so I've
no indication which devices are the source vs. target of the resilver.
What I found was, the auto-resilver isn't sufficient.  If you wait for
it to complete, then 'zpool scrub', you'll get thousands of CKSUM
errors on the dirty device, so the resilver isn't covering all the
dirtyness.  Also ZFS seems to forget about the need to resilver if you
shut down the machine, bring back the missing target, and boot---it
marks everything ONLINE and then resilvers as you hit the dirty data,
counting CKSUM errors.  This has likely been fixed between b71 and
b101.  It's easy to test: (a) shut down one iSCSI target, (b) write to
the pool, (c) bring the iSCSI target back, (d) wait for auto-resilver
to finish, (e) 'zpool scrub', (f) look for CKSUM errors.  I suspect
you're more worried about your own problems though---I'll try to
retest it soon.


pgpcvDMGKA1VP.pgp
Description: PGP signature
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Availability: ZFS needs to handle disk removal / driver failure better

2008-12-02 Thread Ross
Hi Miles,

It's probably a bad sign that although that post came through as anonymous in 
my e-mail, I recognised your style before I got half way through your post :)

I agree, the zpool status being out of date is weird, I'll dig out the bug 
number for that at some point as I'm sure I've mentioned it before.  It looks 
to me like there are two separate pieces of code that work out the status of 
the pool.  There's the stuff ZFS uses internally to run the pool, and then 
there's a completely separate piece that does the reporting to the end user.

I agree that it could be a case of oversimplifying things.  There's no denying 
the ease of admin is one of ZFS' strengths, but I think the whole zpool status 
thing needs looking at again.  Neither the way the command freezes, nor the out 
of date information make any sense to me.

And yes, I'm aware of the problems you've reported with resilvering.  That's on 
my list of things to test with this.  I've already done a quick test of running 
a scrub after the resilver (which appeared ok at first glance), and tomorrow 
I'll be testing the reboot status too.
-- 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Availability: ZFS needs to handle disk removal / driver failure better

2008-12-02 Thread Miles Nordin
 r == Ross  [EMAIL PROTECTED] writes:

 r style before I got half way through your post :) [...status
 r problems...] could be a case of oversimplifying things.

yeah I was a bit inappropriate, but my frustration comes from the
(partly paranoid) imagining of how the idea ``we need to make it
simple'' might have spooled out through a series of design meetings to
a culturally-insidious mind-blowing condescention toward the sysadmin.

``simple'', to me, means that a 'status' tool does not read things off
disks, and does not gather a bunch of scraps to fabricate a pretty
(``simple''?) fantasy-world at invocation which is torn down again
when it exits.  The Linux status tools are pretty-printing wrappers
around 'cat /proc/$THING/status'.  That, is SIMPLE!  And, screaming
monkeys though they often are, the college kids writing Linux are
generally disciplined enough not to grab a bunch of locks and then go
to sleep for minutes when delivering things from /proc.  I love that.
The other, broken, idea of ``simple'' is what I come to Unix to avoid.

And yes, this is a religious argument.  Just because it spans decades
of experience and includes ideas of style doesn't mean it should be
dismissed as hocus-pocus.  And I don't like all these binary config
files either.  Not even Mac OS X is pulling that baloney any more.

 r There's no denying the ease of admin is one of ZFS' strengths,

I deny it!  It is not simple to start up 'format' and 'zpool iostat'
and RoboCopy on another machine because you cannot trust the output of
the status command.  And getting visibility into something by starting
a bunch of commands in different windows and watching when which one
unfreezes is hilarious, not simple.

 r the problems you've reported with resilvering.

I think we were watching this bug:

 http://bugs.opensolaris.org/view_bug.do?bug_id=6675685

so that ought to be fixed in your test system but not in s10u6.  but
it might not be completely fixed yet:

 http://bugs.opensolaris.org/view_bug.do?bug_id=6747698



pgpx4Yk6ZjF1M.pgp
Description: PGP signature
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Availability: ZFS needs to handle disk removal / driver failure better

2008-12-02 Thread Toby Thain

On 2-Dec-08, at 3:35 PM, Miles Nordin wrote:

 r == Ross  [EMAIL PROTECTED] writes:

  r style before I got half way through your post :) [...status
  r problems...] could be a case of oversimplifying things.
 ...
 And yes, this is a religious argument.  Just because it spans decades
 of experience and includes ideas of style doesn't mean it should be
 dismissed as hocus-pocus.  And I don't like all these binary config
 files either.  Not even Mac OS X is pulling that baloney any more.

OS X never used binary config files; it standardised on XML property  
lists for the new subsystems (plus a lot of good old fashioned UNIX  
config).

Perhaps you are thinking of Mac OS 9 and earlier (resource forks).

--Toby
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Availability: ZFS needs to handle disk removal / driver failure better

2008-11-28 Thread Richard Elling
Ross Smith wrote:
 On Fri, Nov 28, 2008 at 5:05 AM, Richard Elling [EMAIL PROTECTED] wrote:
   
 Ross wrote:
 
 Well, you're not alone in wanting to use ZFS and iSCSI like that, and in
 fact my change request suggested that this is exactly one of the things that
 could be addressed:

 The idea is really a two stage RFE, since just the first part would have
 benefits.  The key is to improve ZFS availability, without affecting it's
 flexibility, bringing it on par with traditional raid controllers.

 A.  Track response times, allowing for lop sided mirrors, and better
 failure detection.
   
 I've never seen a study which shows, categorically, that disk or network
 failures are preceded by significant latency changes.  How do we get
 better failure detection from such measurements?
 

 Not preceded by as such, but a disk or network failure will certainly
 cause significant latency changes.  If the hardware is down, there's
 going to be a sudden, and very large change in latency.  Sure, FMA
 will catch most cases, but we've already shown that there are some
 cases where it doesn't work too well (and I would argue that's always
 going to be possible when you are relying on so many different types
 of driver).  This is there to ensure that ZFS can handle *all* cases.
   

I think that there is some confusion about FMA. The value of FMA is
diagnosis.  If there was no FMA, then driver timeouts would still exist.
Where FMA is useful is diagnosing the problem such that we know that
the fault is in the SAN and not the RAID array, for example.  From the
device driver level, all sd knows is that an I/O request to a device timed
out.  Similarly, all ZFS could know is what sd tells it.

  Many people have requested this since it would facilitate remote live
 mirrors.

   
 At a minimum, something like VxVM's preferred plex should be reasonably
 easy to implement.

 
 B.  Use response times to timeout devices, dropping them to an interim
 failure mode while waiting for the official result from the driver.  This
 would prevent redundant pools hanging when waiting for a single device.

   
 I don't see how this could work except for mirrored pools.  Would that
 carry enough market to be worthwhile?
 -- richard
 

 I have to admit, I've not tested this with a raided pool, but since
 all ZFS commands hung when my iSCSI device went offline, I assumed
 that you would get the same effect of the pool hanging if a raid-z2
 pool is waiting for a response from a device.  Mirrored pools do work
 particularly well with this since it gives you the potential to have
 remote mirrors of your data, but if you had a raid-z2 pool, you still
 wouldn't want that hanging if a single device failed.
   

zpool commands hanging is CR6667208, and has been fixed in b100.
http://bugs.opensolaris.org/view_bug.do?bug_id=6667208

 I will go and test the raid scenario though on a current build, just to be 
 sure.
   

Please.
 -- richard

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Availability: ZFS needs to handle disk removal / driver failure better

2008-11-27 Thread James C. McPherson
On Thu, 27 Nov 2008 04:33:54 -0800 (PST)
Ross [EMAIL PROTECTED] wrote:

 Hmm...  I logged this CR ages ago, but now I've come to find it in
 the bug tracker I can't see it anywhere.
 
 I actually logged three CR's back to back, the first appears to have
 been created ok, but two have just disappeared.  The one I created ok
 is:  http://bugs.opensolaris.org/view_bug.do?bug_id=6766364
 
 There should be two other CR's created within a few minutes of that,
 one for disabling caching on CIFS shares, and one regarding this ZFS
 availability discussion.  Could somebody at Sun let me know what's
 happened to these please.

Hi Ross,
I can't find the ZFS one you mention. The CIFS one is 
http://bugs.opensolaris.org/view_bug.do?bug_id=6766126.
It's been marked as 'incomplete' so you should contact
the R.E. - Alan M. Wright (at sun dot com, etc) to find
out what further info is required.


hth,
James C. McPherson
--
Senior Kernel Software Engineer, Solaris
Sun Microsystems
http://blogs.sun.com/jmcp   http://www.jmcp.homeunix.com/blog
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Availability: ZFS needs to handle disk removal / driver failure better

2008-11-27 Thread Ross
Thanks James, I've e-mailed Alan and submitted this one again.
-- 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Availability: ZFS needs to handle disk removal / driver failure better

2008-11-27 Thread Ross
Hmm...  I logged this CR ages ago, but now I've come to find it in the bug 
tracker I can't see it anywhere.

I actually logged three CR's back to back, the first appears to have been 
created ok, but two have just disappeared.  The one I created ok is:  
http://bugs.opensolaris.org/view_bug.do?bug_id=6766364

There should be two other CR's created within a few minutes of that, one for 
disabling caching on CIFS shares, and one regarding this ZFS availability 
discussion.  Could somebody at Sun let me know what's happened to these please.
-- 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Availability: ZFS needs to handle disk removal / driver failure better

2008-11-27 Thread Bernard Dugas
Hello,

Thank you for this very interesting thread !

I want to confirm that Synchronous Distributed Storage is main goal when using 
ZFS !

The target architecture is 1 local drive, and 2 (or more) remote iSCSI targets, 
with ZFS being the iSCSI initiator.

System is designed/cut so that local disk can handle all needed performance 
with good margin, as each one of iSCSI targets through large enough Ethernet 
fibers.

I need that any network problem doesn't slow the readings on local disk, and 
that writings are stopped only if not any remote are available after a time-out.

I also did a comment on that subject in :
http://blogs.sun.com/roch/entry/using_zfs_as_a_network

To  myxiplx :  we called Sleeping Failure a failure of 1 part, that is hidden 
by redundancy but not detected by monitoring. These are the most dangerous...

Would anybody be interested by supporting an opensource projectseed called 
MiSCSI ? This is for Multicast iSCSI, so that only 1 writing from initiator be 
propagated by network to all suscribed targets, with dynamic suscribing and 
resilvering being delegated to remote targets. I would even prefer this 
behaviour already exists in ZFS :-)

Please let me any comment if interested, i may send a draft for RFP...

Best regards !
-- 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Availability: ZFS needs to handle disk removal / driver failure better

2008-11-27 Thread Ross
Well, you're not alone in wanting to use ZFS and iSCSI like that, and in fact 
my change request suggested that this is exactly one of the things that could 
be addressed:

The idea is really a two stage RFE, since just the first part would have 
benefits.  The key is to improve ZFS availability, without affecting it's 
flexibility, bringing it on par with traditional raid controllers.

A.  Track response times, allowing for lop sided mirrors, and better failure 
detection.  Many people have requested this since it would facilitate remote 
live mirrors.

B.  Use response times to timeout devices, dropping them to an interim failure 
mode while waiting for the official result from the driver.  This would prevent 
redundant pools hanging when waiting for a single device.

Unfortunately if your links tend to drop, you really need both parts.  However, 
if this does get added to ZFS, all you would then need is standard monitoring 
on the ZFS pool.  That would notify you when any device fails and the pool goes 
to a degraded state, making it easy to spot when either the remote mirrors or 
local storage are having problems.  I'd have thought it would make monitoring 
much simpler.

And if this were possible, I would hope that you could configure iSCSI devices 
to automatically reconnect and resilver too, so the system would be self 
repairing once faults are corrected, but I haven't gone so far as to test that 
yet.
-- 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Availability: ZFS needs to handle disk removal / driver failure better

2008-11-27 Thread Bernard Dugas
 Well, you're not alone in wanting to use ZFS and
 iSCSI like that, and in fact my change request
 suggested that this is exactly one of the things that
 could be addressed:

Thank you ! Yes, this was also to tell you that you are not alone :-)

I agree completely with you on your technical points !
-- 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Availability: ZFS needs to handle disk removal / driver failure better

2008-11-27 Thread Richard Elling
Ross wrote:
 Well, you're not alone in wanting to use ZFS and iSCSI like that, and in fact 
 my change request suggested that this is exactly one of the things that could 
 be addressed:

 The idea is really a two stage RFE, since just the first part would have 
 benefits.  The key is to improve ZFS availability, without affecting it's 
 flexibility, bringing it on par with traditional raid controllers.

 A.  Track response times, allowing for lop sided mirrors, and better failure 
 detection. 

I've never seen a study which shows, categorically, that disk or network
failures are preceded by significant latency changes.  How do we get
better failure detection from such measurements?

  Many people have requested this since it would facilitate remote live 
 mirrors.
   

At a minimum, something like VxVM's preferred plex should be reasonably
easy to implement.

 B.  Use response times to timeout devices, dropping them to an interim 
 failure mode while waiting for the official result from the driver.  This 
 would prevent redundant pools hanging when waiting for a single device.
   

I don't see how this could work except for mirrored pools.  Would that
carry enough market to be worthwhile?
 -- richard

 Unfortunately if your links tend to drop, you really need both parts.  
 However, if this does get added to ZFS, all you would then need is standard 
 monitoring on the ZFS pool.  That would notify you when any device fails and 
 the pool goes to a degraded state, making it easy to spot when either the 
 remote mirrors or local storage are having problems.  I'd have thought it 
 would make monitoring much simpler.

 And if this were possible, I would hope that you could configure iSCSI 
 devices to automatically reconnect and resilver too, so the system would be 
 self repairing once faults are corrected, but I haven't gone so far as to 
 test that yet.
   

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Availability: ZFS needs to handle disk removal / driver failure better

2008-11-27 Thread Ross Smith
On Fri, Nov 28, 2008 at 5:05 AM, Richard Elling [EMAIL PROTECTED] wrote:
 Ross wrote:

 Well, you're not alone in wanting to use ZFS and iSCSI like that, and in
 fact my change request suggested that this is exactly one of the things that
 could be addressed:

 The idea is really a two stage RFE, since just the first part would have
 benefits.  The key is to improve ZFS availability, without affecting it's
 flexibility, bringing it on par with traditional raid controllers.

 A.  Track response times, allowing for lop sided mirrors, and better
 failure detection.

 I've never seen a study which shows, categorically, that disk or network
 failures are preceded by significant latency changes.  How do we get
 better failure detection from such measurements?

Not preceded by as such, but a disk or network failure will certainly
cause significant latency changes.  If the hardware is down, there's
going to be a sudden, and very large change in latency.  Sure, FMA
will catch most cases, but we've already shown that there are some
cases where it doesn't work too well (and I would argue that's always
going to be possible when you are relying on so many different types
of driver).  This is there to ensure that ZFS can handle *all* cases.


  Many people have requested this since it would facilitate remote live
 mirrors.


 At a minimum, something like VxVM's preferred plex should be reasonably
 easy to implement.

 B.  Use response times to timeout devices, dropping them to an interim
 failure mode while waiting for the official result from the driver.  This
 would prevent redundant pools hanging when waiting for a single device.


 I don't see how this could work except for mirrored pools.  Would that
 carry enough market to be worthwhile?
 -- richard

I have to admit, I've not tested this with a raided pool, but since
all ZFS commands hung when my iSCSI device went offline, I assumed
that you would get the same effect of the pool hanging if a raid-z2
pool is waiting for a response from a device.  Mirrored pools do work
particularly well with this since it gives you the potential to have
remote mirrors of your data, but if you had a raid-z2 pool, you still
wouldn't want that hanging if a single device failed.

I will go and test the raid scenario though on a current build, just to be sure.
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Availability: ZFS needs to handle disk removal / driver failure better

2008-09-06 Thread Ross
Hey folks,

Well, there haven't been any more comments knocking holes in this idea, so I'm 
wondering now if I should log this as an RFE?  

Is this something others would find useful?

Ross
--
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Availability: ZFS needs to handle disk removal / driver failure better

2008-09-06 Thread Richard Elling
Ross wrote:
 Hey folks,

 Well, there haven't been any more comments knocking holes in this idea, so 
 I'm wondering now if I should log this as an RFE?  
   

go for it!

 Is this something others would find useful?
   


Yes.  But remember that this has a very limited scope.  Basically
it will apply to mirrors, not raidz.  Some people may find that to
be uninteresting.  Implementing something simple, like a preferred
side would be a easy first step (ala VxVM's preferred plex).
 -- richard

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Availability: ZFS needs to handle disk removal / driver failure better

2008-09-02 Thread Ross Smith

Thinking about it, we could make use of this too.  The ability to add a
remote iSCSI mirror to any pool without sacrificing local performance
could be a huge benefit.


 From: [EMAIL PROTECTED]
 To: [EMAIL PROTECTED]
 CC: [EMAIL PROTECTED]; zfs-discuss@opensolaris.org
 Subject: Re: Availability: ZFS needs to handle disk removal / driver failure 
 better
 Date: Fri, 29 Aug 2008 09:15:41 +1200
 
 Eric Schrock writes:
  
  A better option would be to not use this to perform FMA diagnosis, but
  instead work into the mirror child selection code.  This has already
  been alluded to before, but it would be cool to keep track of latency
  over time, and use this to both a) prefer one drive over another when
  selecting the child and b) proactively timeout/ignore results from one
  child and select the other if it's taking longer than some historical
  standard deviation.  This keeps away from diagnosing drives as faulty,
  but does allow ZFS to make better choices and maintain response times.
  It shouldn't be hard to keep track of the average and/or standard
  deviation and use it for selection; proactively timing out the slow I/Os
  is much trickier. 
  
 This would be a good solution to the remote iSCSI mirror configuration.  
 I've been working though this situation with a client (we have been 
 comparing ZFS with Cleversafe) and we'd love to be able to get the read 
 performance of the local drives from such a pool. 
 
  As others have mentioned, things get more difficult with writes.  If I
  issue a write to both halves of a mirror, should I return when the first
  one completes, or when both complete?  One possibility is to expose this
  as a tunable, but any such best effort RAS is a little dicey because
  you have very little visibility into the state of the pool in this
  scenario - is my data protected? becomes a very difficult question to
  answer. 
  
 One solution (again, to be used with a remote mirror) is the three way 
 mirror.  If two devices are local and one remote, data is safe once the two 
 local writes return.  I guess the issue then changes from is my data safe 
 to how safe is my data.  I would be reluctant to deploy a remote mirror 
 device without local redundancy, so this probably won't be an uncommon 
 setup.  There would have to be an acceptable window of risk when local data 
 isn't replicated. 
 
 Ian

_
Make a mini you and download it into Windows Live Messenger
http://clk.atdmt.com/UKM/go/111354029/direct/01/___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Availability: ZFS needs to handle disk removal / driver failure better

2008-08-31 Thread Johan Hartzenberg
On Thu, Aug 28, 2008 at 11:21 PM, Ian Collins [EMAIL PROTECTED] wrote:

 Miles Nordin writes:

  suggested that unlike the SVM feature it should be automatic, because
  by so being it becomes useful as an availability tool rather than just
  performance optimisation.
 
 So on a server with a read workload, how would you know if the remote
 volume
 was working?


Even reads induced writes (last access time, if nothing else)

My question: If a pool becomes non-redundant (eg due to a timeout, hotplug
removal, bad data returned from device, or for whatever reason), do we want
the affected pool/vdev/system to hang?  Generally speaking I would say that
this is what currently happens with other solutions.

Conversely:  Can the current situation be improved by allowing a device to
be taken out of the pool for writes - eg be placed in read-only mode?  I
would assume it is possible to modify the CoW system / functions which
allocates blocks for writes to ignore certain devices, at least
temporarily.

This would also lay a groundwork for allowing devices to be removed from a
pool - eg: Step 1: Make the device read-only. Step 2: touch every allocated
block on that device (causing it to be copied to some other disk), step 3:
remove it from the pool for reads as well and finally remove it from the
pool permanently.

  _hartz
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Availability: ZFS needs to handle disk removal / driver failure better

2008-08-31 Thread Richard Elling
Ross Smith wrote:
 Triple mirroring you say?  That'd be me then :D

 The reason I really want to get ZFS timeouts sorted is that our long 
 term goal is to mirror that over two servers too, giving us a pool 
 mirrored across two servers, each of which is actually a zfs iscsi 
 volume hosted on triply mirrored disks.

 Oh, and we'll have two sets of online off-site backups running 
 raid-z2, plus a set of off-line backups too.

 All in all I'm pretty happy with the integrity of the data, wouldn't 
 want to use anything other than ZFS for that now.  I'd just like to 
 get the availability working a bit better, without having to go back 
 to buying raid controllers.  We have big plans for that too; once we 
 get the iSCSI / iSER timeout issue sorted our long term availability 
 goals are to have the setup I mentioned above hosted out from a pair 
 of clustered Solaris NFS / CIFS servers.

 Failover time on the cluster is currently in the order of 5-10 
 seconds, if I can get the detection of a bad iSCSI link down under 2 
 seconds we'll essentially have a worst case scenario of  15 seconds 
 downtime.

I don't think this is possible for a stable system.  2 second failure 
detection
for IP networks is troublesome for a wide variety of reasons.  Even with
Solaris Clusters, we can show consistent failover times for NFS services on
the order of a minute (2-3 client retry intervals, including backoff).  But
getting to consistent sub-minute failover for a service like NFS might be a
bridge too far, given the current technology and the amount of 
customization
required to make it work^TM.

 Downtime that low means it's effectively transparent for our users as 
 all of our applications can cope with that seamlessly, and I'd really 
 love to be able to do that this calendar year.

I think most people (traders are a notable exception) and applications can
deal with larger recovery times, as long as human-intervention is not  
required.
 -- richard

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Availability: ZFS needs to handle disk removal / driver failure better

2008-08-30 Thread Ross
Wow, some great comments on here now, even a few people agreeing with me which 
is nice :D

I'll happily admit I don't have the in depth understanding of storage many of 
you guys have, but since the idea doesn't seem pie-in-the-sky crazy, I'm going 
to try to write up all my current thoughts on how this could work after reading 
through all the replies.

1. Track disk response times
- ZFS should track the average response time of each disk.
- This should be used internally for performance tweaking, so faster disks are 
favoured for reads.  This works particularly well for lop sided mirrors.
- I'd like to see this information (and the number of timeouts) in the output 
of zpool status, so administrators can see if any one device is performing 
badly.

2. New parameters
- ZFS should gain two new parameters.
  - A timeout value for the pool.
  - An option to enable that timeout for writes too (off by default).
- Still to be decided is whether that timeout is set manually, or automatically 
based on the information gathered in 1.
- Do we need a pool timeout (based on the timeout of the slowest device in the 
pool), or will individual device timeouts work better?
- I've decided that having this off by default for writes is probably better 
for ZFS.  It addresses some people's concerns about writing to a degraded pool, 
and puts data integrity ahead of availability, which seems to fit better with 
ZFS' goals.  I'd still like it myself for writes  I can live with a pool 
running degraded for 2 minutes while the problem is diagnosed.
- With that said, could the write timeout default to on when you have a slog 
device?  After all, the data is safely committed to the slog, and should remain 
there until it's written to all devices.  Bob, you seemed the most concerned 
about writes, would that be enough redundancy for you to be happy to have this 
on by default?  If not, I'd still be ok having it off by default, we could 
maybe just include it in the evil tuning guide suggesting that this could be 
turned on by anybody who has a separate slog device.

3. How it would work
- If a read times out for any device, ZFS should immediately issue reads to all 
other devices holding that data.  The first response back will be used. 
- Timeouts should be logged so the information can be used by administrators or 
FMA to help diagnose failing drives, but they should not count as a device 
failure on their own.
- Some thought is needed as to how this algorithm works on busy pools.  When 
reads are queuing up, we need to avoid false positives and avoid adding extra 
load on the pool.  Would it be a possibility that instead of checking the 
response time for an individual request, this timeout is used to check if no 
responses at all have been received from a device for that length of time?  
That still sounds reasonable for finding stuck devices, and should still work 
reliably on a busy pool.
- For reads, the pool does not need to go degraded, the device is simply 
flagged as WAITING.
- When enabled for writes, these will be going to all devices, so there are no 
alternate devices to try.  This means any write timeout will be used to put the 
pool into a degraded mode.  This should be considered a temporary state with 
the drive in WAITING status, as while the pool itself is degraded (due to 
missing the writes for that drive), the drive is not yet offline.  At this 
point the system is simply keeping itself running while waiting for a proper 
error response from either the drive or from FMA.  If the drive eventually 
returns the missing response, it can be resilvered with any data it missed.  If 
the drive doesn't return a response, FMA should eventually fault it, and the 
drive can be taken offline and replaced with a hot spare.  At all times the 
administrator can see what it going on using zpool status, with the appropriate 
pool and drive status visible.
- Please bear in mind that although I'm using the word 'degraded' above, this 
is not necessarily the case for dual parity pools, I just don't know the proper 
term to use for a dual parity raid set where a single drive has failed.
- If this is just a one off glitch and the device comes back online, the 
resilver shouldn't take long as ZFS just needs to send the data that was missed 
(which will still be stored in the ZIL).
- If many devices timeout at once due to a bad controller, cable pulled, power 
failure, etc, all the affected devices will be flagged as WAITING and if too 
many have gone for the pool to stay operational, ZFS should switch the entire 
pool to the 'wait' state while it waits for FMA, etc to return a proper 
response, after which it should react according to the proper failmode property 
for the pool.

4. Food for thought
- While I like nico's idea for lop sided mirrors, I'm not sure any tweaking is 
needed.  I was thinking about whether these timeouts could improve performance 
for such a mirror, but I think a better option there is simply to use 

Re: [zfs-discuss] Availability: ZFS needs to handle disk removal / driver failure better

2008-08-30 Thread Bob Friesenhahn
On Sat, 30 Aug 2008, Ross wrote:
 while the problem is diagnosed. - With that said, could the write 
 timeout default to on when you have a slog device?  After all, the 
 data is safely committed to the slog, and should remain there until 
 it's written to all devices.  Bob, you seemed the most concerned 
 about writes, would that be enough redundancy for you to be happy to 
 have this on by default?  If not, I'd still be ok having it off by 
 default, we could maybe just include it in the evil tuning guide 
 suggesting that this could be turned on by anybody who has a 
 separate slog device.

It is my impression that the slog device is only used for synchronous 
writes.  Depending on the system, this could be just a small fraction 
of the writes.

In my opinion, ZFS's primary goal is to avoid data loss, or 
consumption of wrong data.  Availability is a lesser goal.

If someone really needs maximum availability then they can go to 
triple mirroring or some other maximally redundant scheme.  ZFS should 
to its best to continue moving forward as long as some level of 
redundancy exists.  There could be an option to allow moving forward 
with no redundancy at all.

Bob
==
Bob Friesenhahn
[EMAIL PROTECTED], http://www.simplesystems.org/users/bfriesen/
GraphicsMagick Maintainer,http://www.GraphicsMagick.org/

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Availability: ZFS needs to handle disk removal / driver failure better

2008-08-30 Thread Ross Smith

Triple mirroring you say?  That'd be me then :D

The reason I really want to get ZFS timeouts sorted is that our long term goal 
is to mirror that over two servers too, giving us a pool mirrored across two 
servers, each of which is actually a zfs iscsi volume hosted on triply mirrored 
disks.

Oh, and we'll have two sets of online off-site backups running raid-z2, plus a 
set of off-line backups too.

All in all I'm pretty happy with the integrity of the data, wouldn't want to 
use anything other than ZFS for that now.  I'd just like to get the 
availability working a bit better, without having to go back to buying raid 
controllers.  We have big plans for that too; once we get the iSCSI / iSER 
timeout issue sorted our long term availability goals are to have the setup I 
mentioned above hosted out from a pair of clustered Solaris NFS / CIFS servers.

Failover time on the cluster is currently in the order of 5-10 seconds, if I 
can get the detection of a bad iSCSI link down under 2 seconds we'll 
essentially have a worst case scenario of  15 seconds downtime.  Downtime that 
low means it's effectively transparent for our users as all of our applications 
can cope with that seamlessly, and I'd really love to be able to do that this 
calendar year.

Anyway, getting back on topic, it's a good point about moving forward while 
redundancy exists.  I think the flag for specifying the write behavior should 
have that as the default, with the optional setting being to allow the pool to 
continue accepting writes while the pool is in a non redundant state.

Ross

 Date: Sat, 30 Aug 2008 10:59:19 -0500
 From: [EMAIL PROTECTED]
 To: [EMAIL PROTECTED]
 CC: zfs-discuss@opensolaris.org
 Subject: Re: [zfs-discuss] Availability: ZFS needs to handle disk removal / 
 driver failure better
 
 On Sat, 30 Aug 2008, Ross wrote:
  while the problem is diagnosed. - With that said, could the write 
  timeout default to on when you have a slog device?  After all, the 
  data is safely committed to the slog, and should remain there until 
  it's written to all devices.  Bob, you seemed the most concerned 
  about writes, would that be enough redundancy for you to be happy to 
  have this on by default?  If not, I'd still be ok having it off by 
  default, we could maybe just include it in the evil tuning guide 
  suggesting that this could be turned on by anybody who has a 
  separate slog device.
 
 It is my impression that the slog device is only used for synchronous 
 writes.  Depending on the system, this could be just a small fraction 
 of the writes.
 
 In my opinion, ZFS's primary goal is to avoid data loss, or 
 consumption of wrong data.  Availability is a lesser goal.
 
 If someone really needs maximum availability then they can go to 
 triple mirroring or some other maximally redundant scheme.  ZFS should 
 to its best to continue moving forward as long as some level of 
 redundancy exists.  There could be an option to allow moving forward 
 with no redundancy at all.
 
 Bob
 ==
 Bob Friesenhahn
 [EMAIL PROTECTED], http://www.simplesystems.org/users/bfriesen/
 GraphicsMagick Maintainer,http://www.GraphicsMagick.org/
 

_
Win a voice over part with Kung Fu Panda  Live Search   and   100’s of Kung Fu 
Panda prizes to win with Live Search
http://clk.atdmt.com/UKM/go/107571439/direct/01/___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Availability: ZFS needs to handle disk removal / driver failure better

2008-08-30 Thread Ian Collins
Eric Schrock writes:
 
 A better option would be to not use this to perform FMA diagnosis, but
 instead work into the mirror child selection code.  This has already
 been alluded to before, but it would be cool to keep track of latency
 over time, and use this to both a) prefer one drive over another when
 selecting the child and b) proactively timeout/ignore results from one
 child and select the other if it's taking longer than some historical
 standard deviation.  This keeps away from diagnosing drives as faulty,
 but does allow ZFS to make better choices and maintain response times.
 It shouldn't be hard to keep track of the average and/or standard
 deviation and use it for selection; proactively timing out the slow I/Os
 is much trickier. 
 
This would be a good solution to the remote iSCSI mirror configuration.  
I've been working though this situation with a client (we have been 
comparing ZFS with Cleversafe) and we'd love to be able to get the read 
performance of the local drives from such a pool. 

 As others have mentioned, things get more difficult with writes.  If I
 issue a write to both halves of a mirror, should I return when the first
 one completes, or when both complete?  One possibility is to expose this
 as a tunable, but any such best effort RAS is a little dicey because
 you have very little visibility into the state of the pool in this
 scenario - is my data protected? becomes a very difficult question to
 answer. 
 
One solution (again, to be used with a remote mirror) is the three way 
mirror.  If two devices are local and one remote, data is safe once the two 
local writes return.  I guess the issue then changes from is my data safe 
to how safe is my data.  I would be reluctant to deploy a remote mirror 
device without local redundancy, so this probably won't be an uncommon 
setup.  There would have to be an acceptable window of risk when local data 
isn't replicated. 

Ian
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Availability: ZFS needs to handle disk removal / driver failure better

2008-08-30 Thread Ian Collins
Miles Nordin writes: 

 bf == Bob Friesenhahn [EMAIL PROTECTED] writes:
 
 bf You are saying that I can't split my mirrors between a local
 bf disk in Dallas and a remote disk in New York accessed via
 bf iSCSI? 
 
 nope, you've misread.  I'm saying reads should go to the local disk
 only, and writes should go to both.  See SVM's 'metaparam -r'.  I
 suggested that unlike the SVM feature it should be automatic, because
 by so being it becomes useful as an availability tool rather than just
 performance optimisation. 
 
So on a server with a read workload, how would you know if the remote volume 
was working? 

Ian 
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Availability: ZFS needs to handle disk removal / driver failure better

2008-08-29 Thread Nicolas Williams
On Thu, Aug 28, 2008 at 11:29:21AM -0500, Bob Friesenhahn wrote:
 Which of these do you prefer?
 
o System waits substantial time for devices to (possibly) recover in
  order to ensure that subsequently written data has the least
  chance of being lost.
 
o System immediately ignores slow devices and switches to
  non-redundant non-fail-safe non-fault-tolerant may-lose-your-data
  mode.  When system is under intense load, it automatically
  switches to the may-lose-your-data mode.

Given how long a resilver might take, waiting some time for a device to
come back makes sense.  Also, if a cable was taken out, or drive tray
powered off, then you'll see lots of drives timing out, and then the
better thing to do is to wait (heuristic: not enough spares to recover).

Nico
-- 
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Availability: ZFS needs to handle disk removal / driver failure better

2008-08-29 Thread Nicolas Williams
On Thu, Aug 28, 2008 at 01:05:54PM -0700, Eric Schrock wrote:
 As others have mentioned, things get more difficult with writes.  If I
 issue a write to both halves of a mirror, should I return when the first
 one completes, or when both complete?  One possibility is to expose this
 as a tunable, but any such best effort RAS is a little dicey because
 you have very little visibility into the state of the pool in this
 scenario - is my data protected? becomes a very difficult question to
 answer.

Depending on the amount of redundancy left one might want the writes to
continue.  E.g., a 3-way mirror with one vdev timing out or going extra
slow, or Richard's lopsided mirror example.

The value of best effort RAS might make a useful property for mirrors
and RAIDZ-2.  If because of some slow vdev you've got less redundancy
for recent writes, but still have enough (for some value of enough),
and still have full redundancy for older writes, well, that's not so
bad.

Something like:

% # require at least successful writes to two mirrors and wait no more
% # than 15 seconds for the 3rd.
% zpool create mypool mirror ... mirror ... mirror ...
% zpool set minimum_redundancy=1 mypool
% zpool set vdev_write_wait=15s mypool

and for known-to-be-lopsided mirrors:

% # require at least successful writes to two mirrors and don't wait for
% # the slow vdevs
% zpool create mypool mirror ... mirror ... mirror -slow ...
% zpool set minimum_redundancy=1 mypool
% zpool set vdev_write_wait=0s mypool

?

Nico
-- 
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Availability: ZFS needs to handle disk removal / driver failure better

2008-08-29 Thread Richard Elling
Nicolas Williams wrote:
 On Thu, Aug 28, 2008 at 11:29:21AM -0500, Bob Friesenhahn wrote:
   
 Which of these do you prefer?

o System waits substantial time for devices to (possibly) recover in
  order to ensure that subsequently written data has the least
  chance of being lost.

o System immediately ignores slow devices and switches to
  non-redundant non-fail-safe non-fault-tolerant may-lose-your-data
  mode.  When system is under intense load, it automatically
  switches to the may-lose-your-data mode.
 

 Given how long a resilver might take, waiting some time for a device to
 come back makes sense.  Also, if a cable was taken out, or drive tray
 powered off, then you'll see lots of drives timing out, and then the
 better thing to do is to wait (heuristic: not enough spares to recover).

   

argv!  I didn't even consider switches.  Ethernet switches often use
spanning-tree algorithms to converge on the topology.  I'm not sure
what SAN switches use.  We have the following problem with highly
available clusters which use switches in the interconnect:
   + Solaris Cluster interconnect timeout defaults to 10 seconds
   + STP can take  30 seconds to converge
So, if you use Ethernet switches in the interconnect, you need to
disable STP on the ports used for interconnects or risk unnecessary
cluster reconfigurations.  Normally, this isn't a problem as the people
who tend to build HA clusters also tend to read the docs which point
this out.  Still, a few slip through every few months. As usual, Solaris
Cluster gets blamed, though it really is a systems engineering problem.
Can we expect a similar attention to detail for ZFS implementers?
I'm afraid not :-(. 

I'm not confident we can be successful with sub-minute reconfiguration,
so the B_FAILFAST may be the best we could do for the general case.
That isn't so bad, in fact we use failfasts rather extensively for Solaris
Clusters, too.
 -- richard


___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Availability: ZFS needs to handle disk removal / driver failure better

2008-08-29 Thread Miles Nordin
 es == Eric Schrock [EMAIL PROTECTED] writes:

es The main problem with exposing tunables like this is that they
es have a direct correlation to service actions, and
es mis-diagnosing failures costs everybody (admin, companies,
es Sun, etc) lots of time and money.  Once you expose such a
es tunable, it will be impossible to trust any FMA diagnosis,

Yeah, I tend to agree that the constants shouldn't be tunable, becuase
I hoped Sun would become a disciplined collection-point for experience
to set the constants, discipline meaning the constants are only
adjusted in response to bad diagnosis not ``preference,'' and in a
direction that improves diagnosis for everyone, not for ``the site''.

I'm not yet won over to the idea that statistical FMA diagnosis
constants shouldn't exist.  I think drives can't diagnose themselves
for shit, and I think drivers these days are diagnosees, not
diagnosers.  But clearly a confusingly-bad diagnosis is much worse
than diagnosis that's bad in a simple way.

es If I issue a write to both halves of a mirror, should
es I return when the first one completes, or when both complete?

well, if it's not a synchronous write, you return before you've
written either half of the mirror, so it's only an issue for
O_SYNC/ZIL writes, true?

BTW what does ZFS do right now for synchronous writes to mirrors, wait
for all, wait for two, or wait for one?

es any such best effort RAS is a little dicey because you have
es very little visibility into the state of the pool in this
es scenario - is my data protected? becomes a very difficult
es question to answer.

I think it's already difficult.  For example, a pool will say ONLINE
while it's resilvering, won't it?  I might be wrong.  

Take a pool that can only tolerate one failure.  Is the difference
between replacing an ONLINE device (still redundant) and replacing an
OFFLINE device (not redundant until resilvered) captured?  Likewise,
should a pool with a spare in use really be marked DEGRADED both
before the spare resilvers and after?

The answers to the questions aren't important so much as that you have
to think about the answers---what should they be, what are they
now---which means ``is my data protected?'' is already a difficult
question to answer.  

Also there were recently fixed bugs with DTL.  The status of each
device's DTL, even the existence and purpose of the DTL, isn't
well-exposed to the admin, and is relevant to answering the ``is my
data protected?''  question---indirect means of inspecting it like
tracking the status of resilvering seem too wallpapered given that the
bug escaped notice for so long.

I agree with the problem 100% and don't wish to worsen it, just
disagree that it's a new one.

re 3 orders of magnitude range for magnetic disk I/Os, 4 orders
re of magnitude for power managed disks.

I would argue for power management a fixed timeout.  The time to spin
up doesn't have anything to do with the io/s you got before the disk
spun down.  There's no reason to disguise the constant for which we
secretly wish inside some fancy math for deriving it just because
writing down constants feels bad.

unless you _know_ the disk is spinning up through some in-band means,
and want to compare its spinup time to recorded measurements of past
spinups.


This is a good case for pointing out there are two sets of rules:

 * 'metaparam -r' rules

   + not invoked at all if there's no redundancy.

   + very complicated

 - involve sets of disks, not one disk.  comparison of statistic
   among disks within a vdev (definitely), and comparison of
   individual disks to themselves over time (possibly).

 - complicated output: rules return a set of disks per vdev, not a
   yay-or-nay diagnosis per disk.  And there are two kinds of
   output decision:

   o for n-way mirrors, select anywhere from 1 to n disks.  for
 example, a three-way mirror with two fast local mirrors, one
 slow remote iSCSI mirror, should split reads among the two
 local disks.

 for raidz and raidz2 they can eliminate 0, 1 (,or 2) disks
 from the read-us set.  It's possible to issue all the reads
 and take the first sufficient set to return as Anton
 suggested, but I imagine 4-device raidz2 vdevs will be common
 which could some day perform as well as a 2-device mirror.

   o also, decide when to stop waiting on an existing read and
 re-issue it.  so the decision is not only about future reads,
 but has to cancel already-issued reads, possibly replacing
 the B_FAILFAST mechanism so there will be a second
 uncancellable round of reads once the first round exhausts
 all redundancy.

   o that second decision needs to be made thousands of times per
 second without a lot of CPU overhead

   + small consequence if the rules deliver false-positives, just
 reduced performance 

Re: [zfs-discuss] Availability: ZFS needs to handle disk removal / driver failure better

2008-08-29 Thread Miles Nordin
 re == Richard Elling [EMAIL PROTECTED] writes:

re if you use Ethernet switches in the interconnect, you need to
re disable STP on the ports used for interconnects or risk
re unnecessary cluster reconfigurations.

RSTP/802.1w plus setting the ports connected to Solaris as ``edge'' is
good enough, less risky for the WAN, and pretty ubiquitously supported
with non-EOL switches.  The network guys will know this (assuming you
have network guys) and do something like this:

sw: can you disable STP for me?

net: No?

sw: jumping up and down screaming

net: um,...i mean, Why?

sw: []

net: oh, that.  Ok, try it now.

sw: thanks for disabling STP for me.

net: i uh,.. whatever.  No problem!

re Can we expect a similar attention to detail for ZFS
re implementers?  I'm afraid not :-(.

wellyou weren't really ``expecting'' it of the sun cluster
implementers.  You just ran into it by surprise in the form of an
Issue.  so, can you expect ZFS implementers to accept that running
ZFS, iSCSI, FC-SW might teach them something about their LAN/SAN they
didn't already know?  So far they seem receptive to arcane advice like
``make this config change in your SAN controller to let it use the
NVRAM cache more aggressively, and stop using EMC PowerPath unless
blah.''  so, Yes?

I think you can also expect them to wait longer than 40 seconds before
declaring a system is frozen and rebooting it, though.

``Let's `patiently wait' forever because we think, based on our
uncertainty, that FSPF might take several hours to converge'' is the
alternative that strikes me as unreasonable.


pgpr5qvdq3JpM.pgp
Description: PGP signature
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Availability: ZFS needs to handle disk removal / driver failure better

2008-08-29 Thread Richard Elling
Miles Nordin wrote:
 re == Richard Elling [EMAIL PROTECTED] writes:
 

 re if you use Ethernet switches in the interconnect, you need to
 re disable STP on the ports used for interconnects or risk
 re unnecessary cluster reconfigurations.

 RSTP/802.1w plus setting the ports connected to Solaris as ``edge'' is
 good enough, less risky for the WAN, and pretty ubiquitously supported
 with non-EOL switches.  The network guys will know this (assuming you
 have network guys) and do something like this:

 sw: can you disable STP for me?

 net: No?

 sw: jumping up and down screaming

 net: um,...i mean, Why?

 sw: []

 net: oh, that.  Ok, try it now.

 sw: thanks for disabling STP for me.

 net: i uh,.. whatever.  No problem!
   

Precisely, this is not a problem that is usually solved unilaterally.

 re Can we expect a similar attention to detail for ZFS
 re implementers?  I'm afraid not :-(.

 wellyou weren't really ``expecting'' it of the sun cluster
 implementers.  You just ran into it by surprise in the form of an
 Issue.  

Rather, cluster implementers tend to RTFM. I know few ZFSers who
have RTFM, and do not expect many to do so... such is life.

 so, can you expect ZFS implementers to accept that running
 ZFS, iSCSI, FC-SW might teach them something about their LAN/SAN they
 didn't already know?  

No, I expect them to see a problem cause by network reconfiguration
and blame ZFS.  Indeed, this is what occasionally happens with Solaris
Cluster -- but only occasionally, solving via RTFM.

 So far they seem receptive to arcane advice like
 ``make this config change in your SAN controller to let it use the
 NVRAM cache more aggressively, and stop using EMC PowerPath unless
 blah.''  so, Yes?
   

I have no idea what you are trying to say here.

 I think you can also expect them to wait longer than 40 seconds before
 declaring a system is frozen and rebooting it, though.
   

Current [s]sd driver timeouts are 60 seconds with 3-5 retries by default.
We've had those timeouts for many, many years now and do provide highly
available services on such systems.  The B_FAILFAST change did improve
the availability of systems and similar tricks have improved service 
availability
for Solaris Clusters.  Refer to Eric's post for more details of this 
minefield.

NB some bugids one should research before filing new bugs here are:
CR 4713686: sd/ssd driver should have an additional target specific timeout
http://bugs.opensolaris.org/bugdatabase/view_bug.do?bug_id=4713686
CR 4500536 introduces B_FAILFAST
http://bugs.opensolaris.org/view_bug.do?bug_id=4500536

 ``Let's `patiently wait' forever because we think, based on our
 uncertainty, that FSPF might take several hours to converge'' is the
 alternative that strikes me as unreasonable.
   

AFAICT, nobody is making such a proposal.  Did I miss a post?
 -- richard

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


[zfs-discuss] Availability: ZFS needs to handle disk removal / driver failure better

2008-08-28 Thread Ross
Since somebody else has just posted about their entire system locking up when 
pulling a drive, I thought I'd raise this for discussion.

I think Ralf made a very good point in the other thread.  ZFS can guarantee 
data integrity, what it can't do is guarantee data availability.  The problem 
is, the way ZFS is marketed people expect it to be able to do just that.

This turned into a longer thread than expected, so I'll start with what I'm 
asking for, and then attempt to explain my thinking.  I'm essentially asking 
for two features to improve the availability of ZFS pools:

- Isolation of storage drivers so that buggy drivers do not bring down the OS.

- ZFS timeouts to improve pool availability when no timely response is received 
from storage drivers.

And my reasons for asking for these is that there are now many, many posts on 
here about people experiencing either total system lockup or ZFS lockup after 
removing a hot swap drive, and indeed while some of them are using consumer 
hardware, others have reported problems with server grade kit that definately 
should be able to handle these errors:

Aug 2008:  AMD SB600 - System hang
 - http://www.opensolaris.org/jive/thread.jspa?threadID=70349
Aug 2008:  Supermicro SAT2-MV8 - System hang
 - http://www.opensolaris.org/jive/thread.jspa?messageID=271218
May 2008: Sun hardware - ZFS hang
 - http://opensolaris.org/jive/thread.jspa?messageID=240481
Feb 2008:  iSCSI - ZFS hang
 - http://www.opensolaris.org/jive/thread.jspa?messageID=206985
Oct 2007:  Supermicro SAT2-MV8 - system hang
 - http://www.opensolaris.org/jive/thread.jspa?messageID=166037
Sept 2007:  Fibre channel
 - http://opensolaris.org/jive/thread.jspa?messageID=151719
... etc

Now while the root cause of each of these may be slightly different, I feel it 
would still be good to address this if possible as it's going to affect the 
perception of ZFS as a reliable system.

The common factor in all of these is that either the solaris driver hangs and 
locks the OS, or ZFS hangs and locks the pool.  Most of these are for hardware 
that should handle these failures fine (mine occured for hardware that 
definately works fine under windows), so I'm wondering:  Is there anything that 
can be done to prevent either type of lockup in these situations?

Firstly, for the OS, if a storage component (hardware or driver) fails for a 
non essential part of the system, the entire OS should not hang.  I appreciate 
there isn't a lot you can do if the OS is using the same driver as it's 
storage, but certainly in some of the cases above, the OS and the data are 
using different drivers, and I expect more examples of that could be found with 
a bit of work.  Is there any way storage drivers could be isolated such that 
the OS (and hence ZFS) can report a problem with that particular driver without 
hanging the entire system?

Please note:  I know work is being done on FMA to handle all kinds of bugs, I'm 
not talking about that.  It seems to me that FMA involves proper detection and 
reporting of bugs, which involves knowing in advance what the problems are and 
how to report them.  What I'm looking for is something much simpler, something 
that's able to keep the OS running when it encounters unexpected or unhandled 
behaviour from storage drivers or hardware.

It seems to me that one of the benefits of ZFS is working against it here.  
It's such a flexible system it's being used for many, many types of devices, 
and that means there are a whole host of drivers being used, and a lot of scope 
for bugs in those drivers.  I know that ultimately any driver issues will need 
to be sorted individually, but what I'm wondering is whether there's any 
possibility of putting some error checking code at a layer above the drivers in 
such a way it's able to trap major problems without hanging the OS?  ie: update 
ZFS/Solaris so they can handle storage layer bugs gracefully without downing 
the entire system.

My second suggestion is to ask if ZFS can be made to handle unexpected events 
more gracefully.  In the past I've suggested that ZFS have a separate timeout 
so that a redundant pool can continue working even if one device is not 
responding, and I really think that would be worthwhile.  My idea is to have a 
WAITING status flag for drives, so that if one isn't responding quickly, ZFS 
can flag it as WAITING, and attempt to read or write the same data from 
elsewhere in the pool.  That would work alongside the existing failure modes, 
and would allow ZFS to handle hung drivers much more smoothly, preventing 
redundant pools hanging when a single drive fails.

The ZFS update I feel is particularly appropriate.  ZFS already uses 
checksumming since it doesn't trust drivers or hardware to always return the 
correct data.  But ZFS then trusts those same drivers and hardware absolutely 
when it comes to the availability of the pool.

I believe ZFS should apply the same tough standards to pool availability as it 
does to 

Re: [zfs-discuss] Availability: ZFS needs to handle disk removal / driver failure better

2008-08-28 Thread Bob Friesenhahn
On Thu, 28 Aug 2008, Ross wrote:

 I believe ZFS should apply the same tough standards to pool 
 availability as it does to data integrity.  A bad checksum makes ZFS 
 read the data from elsewhere, why shouldn't a timeout do the same 
 thing?

A problem is that for some devices, a five minute timeout is ok.  For 
others, there must be a problem if the device does not respond in a 
second or two.

If the system or device is simply overwelmed with work, then you would 
not want the system to go haywire and make the problems much worse.

Which of these do you prefer?

   o System waits substantial time for devices to (possibly) recover in
 order to ensure that subsequently written data has the least
 chance of being lost.

   o System immediately ignores slow devices and switches to
 non-redundant non-fail-safe non-fault-tolerant may-lose-your-data
 mode.  When system is under intense load, it automatically
 switches to the may-lose-your-data mode.

Bob
==
Bob Friesenhahn
[EMAIL PROTECTED], http://www.simplesystems.org/users/bfriesen/
GraphicsMagick Maintainer,http://www.GraphicsMagick.org/

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Availability: ZFS needs to handle disk removal / driver failure better

2008-08-28 Thread Eric Schrock
Ross, thanks for the feedback.  A couple points here -

A lot of work went into improving the error handling around build 77 of
Nevada.  There are still problems today, but a number of the
complaints we've seen are on s10 software or older nevada builds that
didn't have these fixes.  Anything from the pre-2008 (or pre-s10u5)
timeframe should be taken with grain of salt.

There is a fix in the immediate future to prevent I/O timeouts from
hanging other parts of the system - namely administrative commands and
other pool activity.  So I/O to that particular pool will hang, but
you'll still be able to run your favorite ZFS commands, and it won't
impact the ability of other pools to run.

We have some good ideas on how to improve the retry logic.  There is a
flag in Solaris, B_FAILFAST, that tells the drive to not try too hard
getting the data.  However, it can return failure when trying harder
would produce the correct results.  Currently, we try the first I/O with
B_FAILFAST, and if that fails immediately retry without the flag.  The
idea is to elevate the retry logic to a higher level, so when a read
from a side of a mirror fails with B_FAILFAST, instead of immediately
retrying the same device without the failfast flag, we push the error
higher up the stack, and issue another B_FAILFAST I/O to the other half
of the mirror.  Only if both fail with failfast do we try a more
thorough request (though with ditto blocks we may try another vdev
alltogether). This should improve I/O error latency for a subset of
failure scenarios, and biasing reads away from degraded (but not faulty)
devices should also improve response time.  The tricky part is
incoporating this into the FMA diagnosis engine, as devices may fail
B_FAILFAST requests for a variety of non-fatal reasons.

Finally, imposing additional timeouts in ZFS is a bad idea.  ZFS is
designed to be a generic storage consumer.  It can be layered on top of
directly attached disks, SSDs, SAN devices, iSCSI targets, files, and
basically anything else.  As such, it doesn't have the necessary context
to know what constitutes a reasonable timeout.  This is explicitly
delegated to the underlying storage subsystem.  If a storage subsystem
is timing out for excessive periods of time when B_FAILFAST is set, then
that's a bug in the storage subsystem, and working around it in ZFS with
yet another set of tunables is not practical.  It will be interesting to
see if this is an issue after the retry logic is modified as described
above.

Hope that helps,

- Eric

On Thu, Aug 28, 2008 at 01:08:26AM -0700, Ross wrote:
 Since somebody else has just posted about their entire system locking up when 
 pulling a drive, I thought I'd raise this for discussion.
 
 I think Ralf made a very good point in the other thread.  ZFS can guarantee 
 data integrity, what it can't do is guarantee data availability.  The problem 
 is, the way ZFS is marketed people expect it to be able to do just that.
 
 This turned into a longer thread than expected, so I'll start with what I'm 
 asking for, and then attempt to explain my thinking.  I'm essentially asking 
 for two features to improve the availability of ZFS pools:
 
 - Isolation of storage drivers so that buggy drivers do not bring down the OS.
 
 - ZFS timeouts to improve pool availability when no timely response is 
 received from storage drivers.
 
 And my reasons for asking for these is that there are now many, many posts on 
 here about people experiencing either total system lockup or ZFS lockup after 
 removing a hot swap drive, and indeed while some of them are using consumer 
 hardware, others have reported problems with server grade kit that definately 
 should be able to handle these errors:
 
 Aug 2008:  AMD SB600 - System hang
  - http://www.opensolaris.org/jive/thread.jspa?threadID=70349
 Aug 2008:  Supermicro SAT2-MV8 - System hang
  - http://www.opensolaris.org/jive/thread.jspa?messageID=271218
 May 2008: Sun hardware - ZFS hang
  - http://opensolaris.org/jive/thread.jspa?messageID=240481
 Feb 2008:  iSCSI - ZFS hang
  - http://www.opensolaris.org/jive/thread.jspa?messageID=206985
 Oct 2007:  Supermicro SAT2-MV8 - system hang
  - http://www.opensolaris.org/jive/thread.jspa?messageID=166037
 Sept 2007:  Fibre channel
  - http://opensolaris.org/jive/thread.jspa?messageID=151719
 ... etc
 
 Now while the root cause of each of these may be slightly different, I feel 
 it would still be good to address this if possible as it's going to affect 
 the perception of ZFS as a reliable system.
 
 The common factor in all of these is that either the solaris driver hangs and 
 locks the OS, or ZFS hangs and locks the pool.  Most of these are for 
 hardware that should handle these failures fine (mine occured for hardware 
 that definately works fine under windows), so I'm wondering:  Is there 
 anything that can be done to prevent either type of lockup in these 
 situations?
 
 Firstly, for the OS, if a storage component (hardware or driver) 

Re: [zfs-discuss] Availability: ZFS needs to handle disk removal / driver failure better

2008-08-28 Thread Miles Nordin
 es == Eric Schrock [EMAIL PROTECTED] writes:

es Finally, imposing additional timeouts in ZFS is a bad idea.
es [...] As such, it doesn't have the necessary context to know
es what constitutes a reasonable timeout.

you're right in terms of fixed timeouts, but there's no reason it
can't compare the performance of redundant data sources, and if one
vdev performs an order of magnitude slower than another set of vdevs
with sufficient redundancy, stop issuing reads except scrubs/healing
to the underperformer (issue writes only), and pass an event to FMA.

ZFS can also compare the performance of a drive to itself over time,
and if the performance suddenly decreases, do the same.

The former case eliminates the need for the mirror policies in SVM,
which Ian requested a few hours ago for the situation that half the
mirror is a slow iSCSI target for geographic redundancy and half is
faster/local.  Some care would have to be taken for targets shared by
ZFS and some other initiator, but I'm not sure the care would really
be that difficult to take, or that the oscillations induced by failing
to take it would really be particularly harmful compared to
unsupervised contention for a device.

The latter notices quickly drives that have been pulled, or for
Richard's ``overwhelmingly dominant'' case, for drives which are
stalled for 30 seconds pending their report of an unrecovered read.

Developing meaningful performance statistics for drives and a tool for
displaying them would be useful for itself, not just for stopping
freezes and preventing a failing drive from degrading performance a
thousandfold.

Issuing reads to redundant devices is cheap compared to freezing.  The
policy with which it's done is highly tunable and should be fun to
tune and watch, and the consequence if the policy makes the wrong
choice isn't incredibly dire.


This B_FAILFAST architecture captures the situation really poorly.

First, it's not implementable in any serious way with near-line
drives, or really with any drives with which you're not intimately
familiar and in control of firmware/release-engineering, and perhaps
not with any drives period.  I suspect in practice it's more a
controller-level feature, about whether or not you'd like to distrust
the device's error report and start resetting busses and channels and
mucking everything up trying to recover from some kind of
``weirdness''.  It's not an answer to the known problem of drives
stalling for 30 seconds when they start to fail.

First and a half, when it's not implemented, the system degrades to
doubling your timeout pointlessly.  A driver-level block cache of
UNC's would probably have more value toward this
speed/read-aggressiveness tradeoff than the whole B_FAILFAST
architecture---just cache known unrecoverable read sectors, and refuse
to issue further I/O for them until a timeout of 3 - 10 minutes
passes.  I bet this would speed up most failures tremendously, and
without burdening upper layers with retry logic.

Second, B_FAILFAST entertains the fantasy that I/O's are independent,
while what happens in practice is that the drive hits a UNC on one
I/O, and won't entertain any further I/O's no matter what flags the
request has on it or how many times you ``reset'' things.


Maybe you could try to rescue B_FAILFAST by putting clever statistics
into the driver to compare the drive's performance to recent past as I
suggested ZFS do, and admit no B_FAILFAST requests to queues of drives
that have suddenly slowed down, just fail them immediately without
even trying.  I submit this queueing and statistic collection is
actually _better_ managed by ZFS than the driver because ZFS can
compare a whole floating-point statistic across a whole vdev, while
even a driver which is fancier than we ever dreamed, is still playing
poker with only 1 bit of input ``I'll call,'' or ``I'll fold.''  ZFS
can see all the cards and get better results while being stupider and
requiring less clever poker-guessing than would be required by a
hypothetical driver B_FAILFAST implementation that actually worked.


pgpqZb7GbAEgk.pgp
Description: PGP signature
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Availability: ZFS needs to handle disk removal / driver failure better

2008-08-28 Thread Eric Schrock
On Thu, Aug 28, 2008 at 02:17:08PM -0400, Miles Nordin wrote:
 
 you're right in terms of fixed timeouts, but there's no reason it
 can't compare the performance of redundant data sources, and if one
 vdev performs an order of magnitude slower than another set of vdevs
 with sufficient redundancy, stop issuing reads except scrubs/healing
 to the underperformer (issue writes only), and pass an event to FMA.

Yep, latency would be a useful metric to add to mirroring choices.
The current logic is rather naive (round-robin) and could easily be
enhanced.

Making diagnoses based on this is much trickier, particularly at the ZFS
level.  A better option would be to leverage the SCSI FMA work going on
to do a more intimate diagnosis at the scsa level.

Also, the problem you are trying to solve - timing out the first I/O to
take a long time - is not captured well by the type  of hysteresis you
would need to perform in order to do this diagnosis.  It certainly can
be done, but is much better suited to diagnosising a failing drive over
time, not aborting a transaction in response to immediate failure.

 This B_FAILFAST architecture captures the situation really poorly.

I don't think you understand how this works.  Imagine two I/Os, just
with different sd timeouts and retry logic - that's B_FAILFAST.  It's
quite simple, and independent of any hardware implementation.

- Eric

--
Eric Schrock, Fishworkshttp://blogs.sun.com/eschrock
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Availability: ZFS needs to handle disk removal / driver failure better

2008-08-28 Thread Bob Friesenhahn
On Thu, 28 Aug 2008, Miles Nordin wrote:

 you're right in terms of fixed timeouts, but there's no reason it
 can't compare the performance of redundant data sources, and if one
 vdev performs an order of magnitude slower than another set of vdevs
 with sufficient redundancy, stop issuing reads except scrubs/healing
 to the underperformer (issue writes only), and pass an event to FMA.

You are saying that I can't split my mirrors between a local disk in 
Dallas and a remote disk in New York accessed via iSCSI?  Why don't 
you want me to be able to do that?

ZFS already backs off from writing to slow vdevs.

 ZFS can also compare the performance of a drive to itself over time,
 and if the performance suddenly decreases, do the same.

While this may be useful for reads, I would hate to disable redundancy 
just because a device is currently slow.

Bob
==
Bob Friesenhahn
[EMAIL PROTECTED], http://www.simplesystems.org/users/bfriesen/
GraphicsMagick Maintainer,http://www.GraphicsMagick.org/

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Availability: ZFS needs to handle disk removal / driver failure better

2008-08-28 Thread Ross Smith

Hi guys,

Bob, my thought was to have this timeout as something that can be optionally 
set by the administrator on a per pool basis.  I'll admit I was mainly thinking 
about reads and hadn't considered the write scenario, but even having thought 
about that it's still a feature I'd like.  After all, this would be a timeout 
set by the administrator based on the longest delay they can afford for that 
storage pool.

Personally, if a SATA disk wasn't responding to any requests after 2 seconds I 
really don't care if an error has been detected, as far as I'm concerned that 
disk is faulty.  I'd be quite happy for the array to drop to a degraded mode 
based on that and for writes to carry on with the rest of the array.

Eric, thanks for the extra details, they're very much appreciated.  It's good 
to hear you're working on this, and I love the idea of doing a B_FAILFAST read 
on both halves of the mirror.

I do have a question though.  From what you're saying, the response time can't 
be consistent across all hardware, so you're once again at the mercy of the 
storage drivers.  Do you know how long does B_FAILFAST takes to return a 
response on iSCSI?  If that's over 1-2 seconds I would still consider that too 
slow I'm afraid.

I understand that Sun in general don't want to add fault management to ZFS, but 
I don't see how this particular timeout does anything other than help ZFS when 
it's dealing with such a diverse range of media.  I agree that ZFS can't know 
itself what should be a valid timeout, but that's exactly why this needs to be 
an optional administrator set parameter.  The administrator of a storage array 
who wants to set this certainly knows what a valid timeout is for them, and 
these timeouts are likely to be several orders of magnitude larger than the 
standard response times.  I would configure very different values for my SATA 
drives as for my iSCSI connections, but in each case I would be happier knowing 
that ZFS has more of a chance of catching bad drivers or unexpected scenarios.

I very much doubt hardware raid controllers would wait 3 minutes for a drive to 
return a response, they will have their own internal timeouts to know when a 
drive has failed, and while ZFS is dealing with very different hardware I can't 
help but feel it should have that same approach to management of its drives.

However, that said, I'll be more than willing to test the new
B_FAILFAST logic on iSCSI once it's released.  Just let me know when
it's out.


Ross





 Date: Thu, 28 Aug 2008 11:29:21 -0500
 From: [EMAIL PROTECTED]
 To: [EMAIL PROTECTED]
 CC: zfs-discuss@opensolaris.org
 Subject: Re: [zfs-discuss] Availability: ZFS needs to handle disk removal / 
 driver failure better
 
 On Thu, 28 Aug 2008, Ross wrote:
 
  I believe ZFS should apply the same tough standards to pool 
  availability as it does to data integrity.  A bad checksum makes ZFS 
  read the data from elsewhere, why shouldn't a timeout do the same 
  thing?
 
 A problem is that for some devices, a five minute timeout is ok.  For 
 others, there must be a problem if the device does not respond in a 
 second or two.
 
 If the system or device is simply overwelmed with work, then you would 
 not want the system to go haywire and make the problems much worse.
 
 Which of these do you prefer?
 
o System waits substantial time for devices to (possibly) recover in
  order to ensure that subsequently written data has the least
  chance of being lost.
 
o System immediately ignores slow devices and switches to
  non-redundant non-fail-safe non-fault-tolerant may-lose-your-data
  mode.  When system is under intense load, it automatically
  switches to the may-lose-your-data mode.
 
 Bob
 ==
 Bob Friesenhahn
 [EMAIL PROTECTED], http://www.simplesystems.org/users/bfriesen/
 GraphicsMagick Maintainer,http://www.GraphicsMagick.org/
 

_
Get Hotmail on your mobile from Vodafone 
http://clk.atdmt.com/UKM/go/107571435/direct/01/___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Availability: ZFS needs to handle disk removal / driver failure better

2008-08-28 Thread Miles Nordin
 es == Eric Schrock [EMAIL PROTECTED] writes:

es I don't think you understand how this works.  Imagine two
es I/Os, just with different sd timeouts and retry logic - that's
es B_FAILFAST.  It's quite simple, and independent of any
es hardware implementation.

AIUI the main timeout to which we should be subject, at least for
nearline drives, is about 30 seconds long and is decided by the
drive's firmware, not the driver, and can't be negotiated in any way
that's independent of the hardware implementation, although sometimes
there are dependent ways to negotiate it.  The driver could also
decide through ``retry logic'' to time out the command sooner, before
the drive completes it, but this won't do much good because the drive
won't accept a second command until ITS timeout expires.

which leads to the second problem, that we're talking about timeouts
for individual I/O's, not marking whole devices.  A ``fast'' timeout
of even 1 second could cause a 100- or 1000-fold decrease in
performance, which could end up being equivalent to a freeze depending
on the type of load on the filesystem.


pgphjTr74byaZ.pgp
Description: PGP signature
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Availability: ZFS needs to handle disk removal / driver failure better

2008-08-28 Thread Miles Nordin
 bf == Bob Friesenhahn [EMAIL PROTECTED] writes:

bf If the system or device is simply overwelmed with work, then
bf you would not want the system to go haywire and make the
bf problems much worse.

None of the decisions I described its making based on performance
statistics are ``haywire''---I said it should funnel reads to the
faster side of the mirror, and do this really quickly and
unconservatively.  What's your issue with that?

bf You are saying that I can't split my mirrors between a local
bf disk in Dallas and a remote disk in New York accessed via
bf iSCSI?

nope, you've misread.  I'm saying reads should go to the local disk
only, and writes should go to both.  See SVM's 'metaparam -r'.  I
suggested that unlike the SVM feature it should be automatic, because
by so being it becomes useful as an availability tool rather than just
performance optimisation.

The performance-statistic logic should influence read scheduling
immediately, and generate events which are fed to FMA, then FMA can
mark devices faulty.  There's no need for both to make the same
decision at the same time.  If the events aren't useful for diagnosis,
ZFS could not bother generating them, or fmd could ignore them in its
diagnosis.  I suspect they *would* be useful, though.

I'm imagining the read rescheduling would happen very quickly, quicker
than one would want a round-trip from FMA, in much less than a second.
That's why it would have to compare devices to others in the same
vdev, and to themselves over time, rather than use fixed timeouts or
punt to haphazard driver and firmware logic.

bfo System waits substantial time for devices to (possibly)
bf recover in order to ensure that subsequently written data has
bf the least chance of being lost.

There's no need for the filesystem to *wait* for data to be written,
unless you are calling fsync.  and maybe not even then if there's a
slog.

I said clearly that you read only one half of the mirror, but write to
both.  But you're right that the trick probably won't work
perfectly---eventually dead devices need to be faulted.  The idea is
that normal write caching will buy you orders of magnitude longer time
in which to make a better decision before anyone notices.

Experience here is that ``waits substantial time'' usually means
``freezes for hours and gets rebooted''.  There's no need to be
abstract: we know what happens when a drive starts taking 1000x -
2000x longer than usual to respond to commands, and we know that this
is THE common online failure mode for drives.  That's what started the
thread.  so, think about this: hanging for an hour trying to write to
a broken device may block other writes to devices which are still
working, until the patiently-waiting data is eventually lost in the
reboot.

bfo System immediately ignores slow devices and switches to
bf non-redundant non-fail-safe non-fault-tolerant
bf may-lose-your-data mode.  When system is under intense load,
bf it automatically switches to the may-lose-your-data mode.

nobody's proposing a system which silently rocks back and forth
between faulted and online.  That's not what we have now, and no such
system would naturally arise.  If FMA marked a drive faulty based on
performance statistics, that drive would get retired permanently and
hot-spare-replaced.  Obviously false positives are bad, just as
obviously as freezes/reboots are bad.

It's not my idea to use FMA in this way.  This is how FMA was pitched,
and the excuse for leaving good exception handling out of ZFS for two
years.  so, where's the beef?


pgpUDw139jf6A.pgp
Description: PGP signature
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Availability: ZFS needs to handle disk removal / driver failure better

2008-08-28 Thread Eric Schrock
On Thu, Aug 28, 2008 at 08:34:24PM +0100, Ross Smith wrote:
 
 Personally, if a SATA disk wasn't responding to any requests after 2
 seconds I really don't care if an error has been detected, as far as
 I'm concerned that disk is faulty.

Unless you have power management enabled, or there's a bad region of the
disk, or the bus was reset, or...

 I do have a question though.  From what you're saying, the response
 time can't be consistent across all hardware, so you're once again at
 the mercy of the storage drivers.  Do you know how long does
 B_FAILFAST takes to return a response on iSCSI?  If that's over 1-2
 seconds I would still consider that too slow I'm afraid.

It's main function is how it deals with retryable errors.  If the drive
responds with a retryable error, or any error at all, it won't attempt
to retry again.  If you have a device that is taking arbitrarily long to
respond to successful commands (or to notice that a command won't
succeed), it won't help you.

 I understand that Sun in general don't want to add fault management to
 ZFS, but I don't see how this particular timeout does anything other
 than help ZFS when it's dealing with such a diverse range of media.  I
 agree that ZFS can't know itself what should be a valid timeout, but
 that's exactly why this needs to be an optional administrator set
 parameter.  The administrator of a storage array who wants to set this
 certainly knows what a valid timeout is for them, and these timeouts
 are likely to be several orders of magnitude larger than the standard
 response times.  I would configure very different values for my SATA
 drives as for my iSCSI connections, but in each case I would be
 happier knowing that ZFS has more of a chance of catching bad drivers
 or unexpected scenarios.

The main problem with exposing tunables like this is that they have a
direct correlation to service actions, and mis-diagnosing failures costs
everybody (admin, companies, Sun, etc) lots of time and money.  Once you
expose such a tunable, it will be impossible to trust any FMA diagnosis,
because you won't be able to know whether it was a mistaken tunable.

A better option would be to not use this to perform FMA diagnosis, but
instead work into the mirror child selection code.  This has already
been alluded to before, but it would be cool to keep track of latency
over time, and use this to both a) prefer one drive over another when
selecting the child and b) proactively timeout/ignore results from one
child and select the other if it's taking longer than some historical
standard deviation.  This keeps away from diagnosing drives as faulty,
but does allow ZFS to make better choices and maintain response times.
It shouldn't be hard to keep track of the average and/or standard
deviation and use it for selection; proactively timing out the slow I/Os
is much trickier.

As others have mentioned, things get more difficult with writes.  If I
issue a write to both halves of a mirror, should I return when the first
one completes, or when both complete?  One possibility is to expose this
as a tunable, but any such best effort RAS is a little dicey because
you have very little visibility into the state of the pool in this
scenario - is my data protected? becomes a very difficult question to
answer.

- Eric

--
Eric Schrock, Fishworkshttp://blogs.sun.com/eschrock
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Availability: ZFS needs to handle disk removal / driver failure better

2008-08-28 Thread Bob Friesenhahn
On Thu, 28 Aug 2008, Miles Nordin wrote:

 None of the decisions I described its making based on performance
 statistics are ``haywire''---I said it should funnel reads to the
 faster side of the mirror, and do this really quickly and
 unconservatively.  What's your issue with that?

From what I understand, this is partially happening now based on 
average service time.  If I/O is backed up for a device, then the 
other device is preferred.  However it good to keep in mind that if 
data is never read, then it is never validated and corrected.  It is 
good for ZFS to read data sometimes.

Bob
==
Bob Friesenhahn
[EMAIL PROTECTED], http://www.simplesystems.org/users/bfriesen/
GraphicsMagick Maintainer,http://www.GraphicsMagick.org/

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Availability: ZFS needs to handle disk removal / driver failure better

2008-08-28 Thread Bill Sommerfeld
On Thu, 2008-08-28 at 13:05 -0700, Eric Schrock wrote:
 A better option would be to not use this to perform FMA diagnosis, but
 instead work into the mirror child selection code.  This has already
 been alluded to before, but it would be cool to keep track of latency
 over time, and use this to both a) prefer one drive over another when
 selecting the child and b) proactively timeout/ignore results from one
 child and select the other if it's taking longer than some historical
 standard deviation.  This keeps away from diagnosing drives as faulty,
 but does allow ZFS to make better choices and maintain response times.
 It shouldn't be hard to keep track of the average and/or standard
 deviation and use it for selection; proactively timing out the slow I/Os
 is much trickier.

tcp has to solve essentially the same problem: decide when a response is
overdue based only on the timing of recent successful exchanges in a
context where it's difficult to make assumptions about reasonable
expected behavior of the underlying network.

it tracks both the smoothed round trip time and the variance, and
declares a response overdue after (SRTT + K * variance).

I think you'd probably do well to start with something similar to what's
described in http://www.ietf.org/rfc/rfc2988.txt and then tweak based on
experience.

- Bill





___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss