Re: [zfs-discuss] ZFS, Smashing Baby a fake???

2008-11-28 Thread Miles Nordin
 es == Eric Schrock [EMAIL PROTECTED] writes:

es Are you running your experiments on build 101 or later?

no.

aside from that quick one for copies=2 im pretty bad about running
well-designed experiments.  and I do have old builds.  I need to buy
more hardware.

It's hard to know how to get the most stable system.  I bet it'll be a
year before this b101 stuff makes it into stable Solaris, yet the
bleeding-edge improvements are all stability-related, so for
mostly-ZFS jobs maybe it's better to run SXCE than sol10 in
production.  I suppose I should be happy about that since it means
more people will have some source. :)

es P.S. I'm also not sure that B_FAILFAST behaves in the way you
es think it does.  My reading of sd.c seems to imply that much of
es what you suggest is actually how it currently behaves,

Yeah, I got a private email referring me to the spec for
PSARC/2002/126 which already included both pieces I hoped for
(killing queued CDB's, and statefully tracking each device as
failed/good), so I take back what I said about B_FAILFAST being
useless---it should be able to help the ZFS availability problems
we've seen.  

The PSARC says B_FAILFAST is implemented in the ``disk driver'' which
AIUI is above the controller, just as I hoped, but there is more than
one ``disk driver'' so the B_FAILFAST stuff is not factored out to one
spot the way a vdev-level system would be but rather punted downwards
and paste-and-raped into sd, ssd, dad, , so whatever experience
you get with it isn't necessarily portable to disks with a different
kind of attachment.  

I still think the vdev-layer logic could make better decisions by
using more than the 1 bit of information per device, but maybe 1-bit
B_FAILFAST is enough to make me accept the shortfall as an
arguable-feature rather than a unanimous-bug.  Also if it can fix my
(1) and (2) with FMA then maybe the gap between B_FAILFAST and real
NetApp-like drive diagnosis can be done partly in userspace the way
developers seem to want.

The problems this doesn't cover are write-related:

 * what should we do about implicit and explicit fsync()s where all
   the data is already on stable storage, but not with full
   redundancy---one device won't finish writing?

   I think there should not be transparent recovery from this, though
   maybe others disagree.  but pool-level failmode doesn't settle the
   issue:

   (a) _when_ will you take the failure action (if failmode != wait)?
   The property says *what* to do, not *when* to do it.

   (b) There isn't any vdev-level failure, only device-level, so it's
   not appropriate to consult the failmode property in the first
   place---the situation is different.  The question is, do we
   keep trying, or do we transition the device to FAULTED and the
   vdev to DEGRADED so that fsync()'s can proceed without that
   device and hotspare resilver kicks in?

   (c) Inside the time interval between when the device starts writing
   slowly and when you take the (b) action, how well can you
   isolate the failure?  For example, can you insure that
   read-only access remains instantaneous, even though atime
   updates involve writing, even though these 5-second txg-flushes
   are blocked, and even though the admin might (gasp!) type
   'zpool status'---or even a label-writing command like 'zpool
   attach'?  or will one of those three things cause pool-wide or
   ZFS-wide hang that blocks read access which could theoretically
   work?

 * commands like zpool attach, detach, replace, offline, export 

(a) should not be uninterruptably hangable.  

(b) Problems in one pool should not spill over into another. 

(c) And finally they should be forcable even when they can't write
everything they'd like to, so that rebooting isn't a necessary
move in certain kinds failure-recovery of pool gymnastics.

I expect there's some quiet work on this in b101 also---at least
someone said 'zpool status' isn't supposed to hang anymore?  so I'll
have to try it out, but B_FAILFAST isn't enough to settle the whole
issue, even modulo marginal performance improvement that more
ambitiously wacky schemes might promise us.


pgp8eFnBFQAmv.pgp
Description: PGP signature
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] ZFS, Smashing Baby a fake???

2008-11-26 Thread Miles Nordin
 rs == Ross Smith [EMAIL PROTECTED] writes:
 nw == Nicolas Williams [EMAIL PROTECTED] writes:

rs I disagree Bob, I think this is a very different function to
rs that which FMA provides.

I see two problems.

 (1) FMA doesn't seem to work very well, and was used as an excuse to
 keep proper exception handling out of ZFS for a couple years, so
 im sort of...skeptical whenever it's brought up like a panacea.

 (2) The FMA model of collecting telemmetry, taking it into
 user-space, chin-strokingly contemplating it for a while, then
 decreeing a diagnosis, is actually a rather limited one.  I can
 think of two kinds of limit:

 (a) you're diagnosing the pool FMA is running on.  FMA is on the
 root pool, but the root pool won't unfreeze until FMA
 diagnoses it.

 In practice it's much worse, because problems in one pool's
 devices can freeze all of ZFS, even other pools.  Or if the
 system is NFS-rooted and also exporting ZFS filesystems over
 NFS, maybe all of NFS freezes?  problems like that, knocking
 out FMA.  Diagnosis in kernel is harder to knock out.

 (b) calls are sleeping uninterruptably in the path that returns
 events to FMA.  ``Call down into the controller driver, wait
 for return success or failure, then count the event and
 callback to FMA as appropriate.  If something's borked, FMA
 will eventually return diagnosis.''  This plan is useless if
 the controller just freezes.  FMA never sees anything.  You
 are analyzing faults, yes, but you can only do it with
 hindsight.  When do you do the FMA callback?  To implement
 this timeout, you'd have to do a callback before and after
 each I/O, which is obviously too expensive.  

 Likewise, when FMA returns the diagnosis, are you prepared to
 act on it?  Or are you busy right now, and you're going to
 act on it just as soon as that controller returns success or
 failure?

 You can't abstract the notion of time out of your diagnosis.
 Trying to compartmentalize it interferes with working it into
 low-level event loops in a way that's sometimes needed.

It's not a matter of where things taxonomically belong, where it feels
clean to put some functionality in your compartmentalized layered
tower.  Certain things just aren't achievable from certain places.

nw If we're talking isolated, or even clumped-but-relatively-few
nw bad sectors, then having a short timeout for writes and
nw remapping should be possible 

I'm not sure I understand the state machine for the remapping plan
but...I think your idea is, try to write to some spot on the disk.  If
it takes too long, cancel the write, and try writing somewhere else
instead.  Then do bad-block-remapping: fix up all the pointers for the
new location, mark the spot that took too long as poisonous, all that.

I don't think it'll work.  First, you can't cancel the write.  Once
you dispatch a write that hangs, you've locked up, at a minimum, the
drive trying to write.  You don't get the option of remapping and
writing elsewhere, because the drive's stopped listening to you.
Likely, you've also locked up the bus (if the drive's on PATA or
SCSI), or maybe the whole controller.  (This is IMHO the best reason
for laying out a RAID to survive a controller failure---interaction
with a bad drive could freeze a whole controller.)

Even if you could cancel the write, when do you cancel it?  If you can
learn your drive and controller so well you convince them to ignore
you for 10 seconds instead of two minutes when they hit a block they
can't write, you've got approximately the same problem, because you
don't know where the poison sectors are.  You'll probably hit another
one.  Even a ten-second write means the drive's performance is shot by
almost three orders of magnitude---it's not workable.

Finally, this approach interferes with diagnosis.  The drives have
their own retry state machine.  If you start muddling all this ad-hoc
stuff on top of it you can't tell the difference between drive
failures, cabling problems, controller failures.  You end up with
normal thermal recalibration events being treated as some kind of
``spurious late read'' and inventing all these strange unexplained
failure terms which make it impossible to write a paper like the
Netapp or Google papers on UNC's we used to cite in here all the time,
because your failure statistics no longer correspond to a single layer
of the storage stack and can't be compared to others' statistics.
Also, remember that we suspect and wish to tolerate drives that
operate many standard deviations outside their specification, even
when they're not broken or suspect or about to break.  There are two
reasons.  First, we think they might do it.  Second, otherwise you
can't collect performance statistics you can compare with others'.

That's why 

Re: [zfs-discuss] ZFS, Smashing Baby a fake???

2008-11-26 Thread Bob Friesenhahn
On Wed, 26 Nov 2008, Miles Nordin wrote:

 (2) The FMA model of collecting telemmetry, taking it into
 user-space, chin-strokingly contemplating it for a while, then
 decreeing a diagnosis, is actually a rather limited one.  I can
 think of two kinds of limit:

 (a) you're diagnosing the pool FMA is running on.  FMA is on the
 root pool, but the root pool won't unfreeze until FMA
 diagnoses it.

I did not have time to read most of your lengthy thesis but I agree 
that FMA is useless if the motherboard catches fire.

Bob
==
Bob Friesenhahn
[EMAIL PROTECTED], http://www.simplesystems.org/users/bfriesen/
GraphicsMagick Maintainer,http://www.GraphicsMagick.org/

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] ZFS, Smashing Baby a fake???

2008-11-26 Thread Eric Schrock
On Wed, Nov 26, 2008 at 07:02:11PM -0500, Miles Nordin wrote:
  (2) The FMA model of collecting telemmetry, taking it into
  user-space, chin-strokingly contemplating it for a while, then
  decreeing a diagnosis, is actually a rather limited one.  I can
  think of two kinds of limit:

As mentioned previously, this is not an accurate description of what's
going on.  FMA allows diagnosis to happen at the detector when the
telemetry is conclusive and cross-domain or predictive analysis isn't
required.  This is exactly what ZFS does on recent nevada builds.  If a
drive is pathologically broken (i.e. a reopen fails, or reads and writes
to the label fail), it will *immediately* fail the drive and not wait
for any further diagnosis from FMA.

For drives that randomly fail I/Os or take along time, but otherwise
respond to basic requests, ZFS is often in no better position to perform
a diagnosis in the kernel.  And as of build 101, ZFS behaves much better
in these circumstances by not aggressively retrying commands before
exhausting all other options.

Are you running your experiments on build 101 or later?  And what
experiments are you running?  Drawing conclusions from previous
experience or reports is basically pointless given the amount of change
that has occurred recently (Jeff's putback wasn't nicknamed SPA 3.0
for nothing).  While there are no doubt more rough edges, we have
incorporated much of the previous feedback into new behavior that should
provide a much improved experience.

- Eric

P.S. I'm also not sure that B_FAILFAST behaves in the way you think it
 does.  My reading of sd.c seems to imply that much of what you
 suggest is actually how it currently behaves, but you should
 probably bring up the issue on storage-discuss where you will find
 more experts in this area.

--
Eric Schrock, Fishworkshttp://blogs.sun.com/eschrock
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] ZFS, Smashing Baby a fake???

2008-11-25 Thread Jeff Bonwick
I think we (the ZFS team) all generally agree with you.  The current
nevada code is much better at handling device failures than it was
just a few months ago.  And there are additional changes that were
made for the FishWorks (a.k.a. Amber Road, a.k.a. Sun Storage 7000)
product line that will make things even better once the FishWorks team
has a chance to catch its breath and integrate those changes into nevada.
And then we've got further improvements in the pipeline.

The reason this is all so much harder than it sounds is that we're
trying to provide increasingly optimal behavior given a collection of
devices whose failure modes are largely ill-defined.  (Is the disk
dead or just slow?  Gone or just temporarily disconnected?  Does this
burst of bad sectors indicate catastrophic failure, or just localized
media errors?)  The disks' SMART data is notoriously unreliable, BTW.
So there's a lot of work underway to model the physical topology of
the hardware, gather telemetry from the devices, the enclosures,
the environmental sensors etc, so that we can generate an accurate
FMA fault diagnosis and then tell ZFS to take appropriate action.

We have some of this today; it's just a lot of work to complete it.

Oh, and regarding the original post -- as several readers correctly
surmised, we weren't faking anything, we just didn't want to wait
for all the device timeouts.  Because the disks were on USB, which
is a hotplug-capable bus, unplugging the dead disk generated an
interrupt that bypassed the timeout.  We could have waited it out,
but 60 seconds is an eternity on stage.

Jeff

On Mon, Nov 24, 2008 at 10:45:18PM -0800, Ross wrote:
 But that's exactly the problem Richard:  AFAIK.
 
 Can you state that absolutely, categorically, there is no failure mode out 
 there (caused by hardware faults, or bad drivers) that won't lock a drive up 
 for hours?  You can't, obviously, which is why we keep saying that ZFS should 
 have this kind of timeout feature.
 
 For once I agree with Miles, I think he's written a really good writeup of 
 the problem here.  My simple view on it would be this:
 
 Drives are only aware of themselves as an individual entity.  Their job is to 
 save  restore data to themselves, and drivers are written to minimise any 
 chance of data loss.  So when a drive starts to fail, it makes complete sense 
 for the driver and hardware to be very, very thorough about trying to read or 
 write that data, and to only fail as a last resort.
 
 I'm not at all surprised that drives take 30 seconds to timeout, nor that 
 they could slow a pool for hours.  That's their job.  They know nothing else 
 about the storage, they just have to do their level best to do as they're 
 told, and will only fail if they absolutely can't store the data.
 
 The raid controller on the other hand (Netapp / ZFS, etc) knows all about the 
 pool.  It knows if you have half a dozen good drives online, it knows if 
 there are hot spares available, and it *should* also know how quickly the 
 drives under its care usually respond to requests.
 
 ZFS is perfectly placed to spot when a drive is starting to fail, and to take 
 the appropriate action to safeguard your data.  It has far more information 
 available than a single drive ever will, and should be designed accordingly.
 
 Expecting the firmware and drivers of individual drives to control the 
 failure modes of your redundant pool is just crazy imo.  You're throwing away 
 some of the biggest benefits of using multiple drives in the first place.
 -- 
 This message posted from opensolaris.org
 ___
 zfs-discuss mailing list
 zfs-discuss@opensolaris.org
 http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] ZFS, Smashing Baby a fake???

2008-11-25 Thread Ross Smith
Hey Jeff,

Good to hear there's work going on to address this.

What did you guys think to my idea of ZFS supporting a waiting for a
response status for disks as an interim solution that allows the pool
to continue operation while it's waiting for FMA or the driver to
fault the drive?

I do appreciate that it's hard to come up with a definative it's dead
Jim answer, and I agree that long term the FMA approach will pay
dividends.  But I still feel this is a good short term solution, and
one that would also compliment your long term plans.

My justification for this is that it seems to me that you can split
disk behavior into two states:
- returns data ok
- doesn't return data ok

And for the state where it's not returning data, you can again split
that in two:
- returns wrong data
- doesn't return data

The first of these is already covered by ZFS with its checksums (with
FMA doing the extra work to fault drives), so it's just the second
that needs immediate attention, and for the life of me I can't think
of any situation that a simple timeout wouldn't catch.

Personally I'd love to see two parameters, allowing this behavior to
be turned on if desired, and allowing timeouts to be configured:

zfs-auto-device-timeout
zfs-auto-device-timeout-fail-delay

The first sets whether to use this feature, and configures the maximum
time ZFS will wait for a response from a device before putting it in a
waiting status.  The second would be optional and is the maximum
time ZFS will wait before faulting a device (at which point it's
replaced by a hot spare).

The reason I think this will work well with the FMA work is that you
can implement this now and have a real improvement in ZFS
availability.  Then, as the other work starts bringing better modeling
for drive timeouts, the parameters can be either removed, or set
automatically by ZFS.

Long term I guess there's also the potential to remove the second
setting if you felt FMA etc ever got reliable enough, but personally I
would always want to have the final fail delay set.  I'd maybe set it
to a long value such as 1-2 minutes to give FMA, etc a fair chance to
find the fault.  But I'd be much happier knowing that the system will
*always* be able to replace a faulty device within a minute or two, no
matter what the FMA system finds.

The key thing is that you're not faulting devices early, so FMA is
still vital.  The idea is purely to let ZFS to keep the pool active by
removing the need for the entire pool to wait on the FMA diagnosis.

As I said before, the driver and firmware are only aware of a single
disk, and I would imagine that FMA also has the same limitation - it's
only going to be looking at a single item and trying to determine
whether it's faulty or not.  Because of that, FMA is going to be
designed to be very careful to avoid false positives, and will likely
take it's time to reach an answer in some situations.

ZFS however has the benefit of knowing more about the pool, and in the
vast majority of situations, it should be possible for ZFS to read or
write from other devices while it's waiting for an 'official' result
from any one faulty component.

Ross


On Tue, Nov 25, 2008 at 8:37 AM, Jeff Bonwick [EMAIL PROTECTED] wrote:
 I think we (the ZFS team) all generally agree with you.  The current
 nevada code is much better at handling device failures than it was
 just a few months ago.  And there are additional changes that were
 made for the FishWorks (a.k.a. Amber Road, a.k.a. Sun Storage 7000)
 product line that will make things even better once the FishWorks team
 has a chance to catch its breath and integrate those changes into nevada.
 And then we've got further improvements in the pipeline.

 The reason this is all so much harder than it sounds is that we're
 trying to provide increasingly optimal behavior given a collection of
 devices whose failure modes are largely ill-defined.  (Is the disk
 dead or just slow?  Gone or just temporarily disconnected?  Does this
 burst of bad sectors indicate catastrophic failure, or just localized
 media errors?)  The disks' SMART data is notoriously unreliable, BTW.
 So there's a lot of work underway to model the physical topology of
 the hardware, gather telemetry from the devices, the enclosures,
 the environmental sensors etc, so that we can generate an accurate
 FMA fault diagnosis and then tell ZFS to take appropriate action.

 We have some of this today; it's just a lot of work to complete it.

 Oh, and regarding the original post -- as several readers correctly
 surmised, we weren't faking anything, we just didn't want to wait
 for all the device timeouts.  Because the disks were on USB, which
 is a hotplug-capable bus, unplugging the dead disk generated an
 interrupt that bypassed the timeout.  We could have waited it out,
 but 60 seconds is an eternity on stage.

 Jeff

 On Mon, Nov 24, 2008 at 10:45:18PM -0800, Ross wrote:
 But that's exactly the problem Richard:  AFAIK.

 Can you state that absolutely, 

Re: [zfs-discuss] ZFS, Smashing Baby a fake???

2008-11-25 Thread Ross Smith
PS.  I think this also gives you a chance at making the whole problem
much simpler.  Instead of the hard question of is this faulty,
you're just trying to say is it working right now?.

In fact, I'm now wondering if the waiting for a response flag
wouldn't be better as possibly faulty.  That way you could use it
with checksum errors too, possibly with settings as simple as errors
per minute or error percentage.  As with the timeouts, you could
have it off by default (or provide sensible defaults), and let
administrators tweak it for their particular needs.

Imagine a pool with the following settings:
- zfs-auto-device-timeout = 5s
- zfs-auto-device-checksum-fail-limit-epm = 20
- zfs-auto-device-checksum-fail-limit-percent = 10
- zfs-auto-device-fail-delay = 120s

That would allow the pool to flag a device as possibly faulty
regardless of the type of fault, and take immediate proactive action
to safeguard data (generally long before the device is actually
faulted).

A device triggering any of these flags would be enough for ZFS to
start reading from (or writing to) other devices first, and should you
get multiple failures, or problems on a non redundant pool, you always
just revert back to ZFS' current behaviour.

Ross





On Tue, Nov 25, 2008 at 8:37 AM, Jeff Bonwick [EMAIL PROTECTED] wrote:
 I think we (the ZFS team) all generally agree with you.  The current
 nevada code is much better at handling device failures than it was
 just a few months ago.  And there are additional changes that were
 made for the FishWorks (a.k.a. Amber Road, a.k.a. Sun Storage 7000)
 product line that will make things even better once the FishWorks team
 has a chance to catch its breath and integrate those changes into nevada.
 And then we've got further improvements in the pipeline.

 The reason this is all so much harder than it sounds is that we're
 trying to provide increasingly optimal behavior given a collection of
 devices whose failure modes are largely ill-defined.  (Is the disk
 dead or just slow?  Gone or just temporarily disconnected?  Does this
 burst of bad sectors indicate catastrophic failure, or just localized
 media errors?)  The disks' SMART data is notoriously unreliable, BTW.
 So there's a lot of work underway to model the physical topology of
 the hardware, gather telemetry from the devices, the enclosures,
 the environmental sensors etc, so that we can generate an accurate
 FMA fault diagnosis and then tell ZFS to take appropriate action.

 We have some of this today; it's just a lot of work to complete it.

 Oh, and regarding the original post -- as several readers correctly
 surmised, we weren't faking anything, we just didn't want to wait
 for all the device timeouts.  Because the disks were on USB, which
 is a hotplug-capable bus, unplugging the dead disk generated an
 interrupt that bypassed the timeout.  We could have waited it out,
 but 60 seconds is an eternity on stage.

 Jeff

 On Mon, Nov 24, 2008 at 10:45:18PM -0800, Ross wrote:
 But that's exactly the problem Richard:  AFAIK.

 Can you state that absolutely, categorically, there is no failure mode out 
 there (caused by hardware faults, or bad drivers) that won't lock a drive up 
 for hours?  You can't, obviously, which is why we keep saying that ZFS 
 should have this kind of timeout feature.

 For once I agree with Miles, I think he's written a really good writeup of 
 the problem here.  My simple view on it would be this:

 Drives are only aware of themselves as an individual entity.  Their job is 
 to save  restore data to themselves, and drivers are written to minimise 
 any chance of data loss.  So when a drive starts to fail, it makes complete 
 sense for the driver and hardware to be very, very thorough about trying to 
 read or write that data, and to only fail as a last resort.

 I'm not at all surprised that drives take 30 seconds to timeout, nor that 
 they could slow a pool for hours.  That's their job.  They know nothing else 
 about the storage, they just have to do their level best to do as they're 
 told, and will only fail if they absolutely can't store the data.

 The raid controller on the other hand (Netapp / ZFS, etc) knows all about 
 the pool.  It knows if you have half a dozen good drives online, it knows if 
 there are hot spares available, and it *should* also know how quickly the 
 drives under its care usually respond to requests.

 ZFS is perfectly placed to spot when a drive is starting to fail, and to 
 take the appropriate action to safeguard your data.  It has far more 
 information available than a single drive ever will, and should be designed 
 accordingly.

 Expecting the firmware and drivers of individual drives to control the 
 failure modes of your redundant pool is just crazy imo.  You're throwing 
 away some of the biggest benefits of using multiple drives in the first 
 place.
 --
 This message posted from opensolaris.org
 ___
 zfs-discuss mailing list
 

Re: [zfs-discuss] ZFS, Smashing Baby a fake???

2008-11-25 Thread Ross Smith
No, I count that as doesn't return data ok, but my post wasn't very
clear at all on that.

Even for a write, the disk will return something to indicate that the
action has completed, so that can also be covered by just those two
scenarios, and right now ZFS can lock the whole pool up if it's
waiting for that response.

My idea is simply to allow the pool to continue operation while
waiting for the drive to fault, even if that's a faulty write.  It
just means that the rest of the operations (reads and writes) can keep
working for the minute (or three) it takes for FMA and the rest of the
chain to flag a device as faulty.

For write operations, the data can be safely committed to the rest of
the pool, with just the outstanding writes for the drive left waiting.
 Then as soon as the device is faulted, the hot spare can kick in, and
the outstanding writes quickly written to the spare.

For single parity, or non redundant volumes there's some benefit in
this.  For dual parity pools there's a massive benefit as your pool
stays available, and your data is still well protected.

Ross



On Tue, Nov 25, 2008 at 10:44 AM,  [EMAIL PROTECTED] wrote:


My justification for this is that it seems to me that you can split
disk behavior into two states:
- returns data ok
- doesn't return data ok


 I think you're missing won't write.

 There's clearly a difference between get data from a different copy
 which you can fix but retrying data to a different part of the redundant
 data and writing data: the data which can't be written must be kept
 until the drive is faulted.


 Casper


___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] ZFS, Smashing Baby a fake???

2008-11-25 Thread Casper . Dik


My idea is simply to allow the pool to continue operation while
waiting for the drive to fault, even if that's a faulty write.  It
just means that the rest of the operations (reads and writes) can keep
working for the minute (or three) it takes for FMA and the rest of the
chain to flag a device as faulty.

Except when you're writing a lot; 3 minutes can cause a 20GB backlog
for a single disk.

Casper

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] ZFS, Smashing Baby a fake???

2008-11-25 Thread Casper . Dik


My justification for this is that it seems to me that you can split
disk behavior into two states:
- returns data ok
- doesn't return data ok


I think you're missing won't write.

There's clearly a difference between get data from a different copy
which you can fix but retrying data to a different part of the redundant 
data and writing data: the data which can't be written must be kept
until the drive is faulted.


Casper

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] ZFS, Smashing Baby a fake???

2008-11-25 Thread Ross Smith
Hmm, true.  The idea doesn't work so well if you have a lot of writes,
so there needs to be some thought as to how you handle that.

Just thinking aloud, could the missing writes be written to the log
file on the rest of the pool?  Or temporarily stored somewhere else in
the pool?  Would it be an option to allow up to a certain amount of
writes to be cached in this way while waiting for FMA, and only
suspend writes once that cache is full?

With a large SSD slog device would it be possible to just stream all
writes to the log?  As a further enhancement, might it be possible to
commit writes to the working drives, and just leave the writes for the
bad drive(s) in the slog (potentially saving a lot of space)?

For pools without log devices, I suspect that you would probably need
the administrator to specify the behavior as I can see several options
depending on the raid level and that pools priorities for data
availability / integrity:

Drive fault write cache settings:
default - pool waits for device, no writes occur until device or spare
comes online
slog - writes are cached to slog device until full, then pool reverts
to default behavior (could this be the default with slog devices
present?)
pool - writes are cached to the pool itself, up to a set maximum, and
are written to the device or spare as soon as possible.  This assumes
a single parity pool with the other devices available.  If the upper
limit is reached, or another devices goes faulty, pool reverts to
default behaviour.

Storing directly to the rest of the pool would probably want to be off
by default on single parity pools, but I would imagine that it could
be on by default on dual parity pools.

Would that be enough to allow writes to continue in most circumstances
while the pool waits for FMA?

Ross



On Tue, Nov 25, 2008 at 10:55 AM,  [EMAIL PROTECTED] wrote:


My idea is simply to allow the pool to continue operation while
waiting for the drive to fault, even if that's a faulty write.  It
just means that the rest of the operations (reads and writes) can keep
working for the minute (or three) it takes for FMA and the rest of the
chain to flag a device as faulty.

 Except when you're writing a lot; 3 minutes can cause a 20GB backlog
 for a single disk.

 Casper


___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] ZFS, Smashing Baby a fake???

2008-11-25 Thread Toby Thain

On 25-Nov-08, at 5:10 AM, Ross Smith wrote:

 Hey Jeff,

 Good to hear there's work going on to address this.

 What did you guys think to my idea of ZFS supporting a waiting for a
 response status for disks as an interim solution that allows the pool
 to continue operation while it's waiting for FMA or the driver to
 fault the drive?
 ...

 The first of these is already covered by ZFS with its checksums (with
 FMA doing the extra work to fault drives), so it's just the second
 that needs immediate attention, and for the life of me I can't think
 of any situation that a simple timeout wouldn't catch.

 Personally I'd love to see two parameters, allowing this behavior to
 be turned on if desired, and allowing timeouts to be configured:

 zfs-auto-device-timeout
 zfs-auto-device-timeout-fail-delay

 The first sets whether to use this feature, and configures the maximum
 time ZFS will wait for a response from a device before putting it in a
 waiting status.


The shortcomings of timeouts have been discussed on this list before.  
How do you tell the difference between a drive that is dead and a  
path that is just highly loaded?

I seem to recall the argument strongly made in the past that making  
decisions based on a timeout alone can provoke various undesirable  
cascade effects.

   The second would be optional and is the maximum
 time ZFS will wait before faulting a device (at which point it's
 replaced by a hot spare).

 The reason I think this will work well with the FMA work is that you
 can implement this now and have a real improvement in ZFS
 availability.  Then, as the other work starts bringing better modeling
 for drive timeouts, the parameters can be either removed, or set
 automatically by ZFS.
 ... it should be possible for ZFS to read or
 write from other devices while it's waiting for an 'official' result
 from any one faulty component.

Sounds good - devil, meet details, etc.

--Toby


 Ross


 On Tue, Nov 25, 2008 at 8:37 AM, Jeff Bonwick  
 [EMAIL PROTECTED] wrote:
 I think we (the ZFS team) all generally agree with you. ...

 The reason this is all so much harder than it sounds is that we're
 trying to provide increasingly optimal behavior given a collection of
 devices whose failure modes are largely ill-defined.  (Is the disk
 dead or just slow?  Gone or just temporarily disconnected? ...

 Jeff

 On Mon, Nov 24, 2008 at 10:45:18PM -0800, Ross wrote:
 But that's exactly the problem Richard:  AFAIK.

 Can you state that absolutely, categorically, there is no failure  
 mode out there (caused by hardware faults, or bad drivers) that  
 won't lock a drive up for hours?  You can't, obviously, which is  
 why we keep saying that ZFS should have this kind of timeout  
 feature.
 ...
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] ZFS, Smashing Baby a fake???

2008-11-25 Thread Ross Smith
 The shortcomings of timeouts have been discussed on this list before. How do
 you tell the difference between a drive that is dead and a path that is just
 highly loaded?

A path that is dead is either returning bad data, or isn't returning
anything.  A highly loaded path is by definition reading  writing
lots of data.  I think you're assuming that these are file level
timeouts, when this would actually need to be much lower level.


 Sounds good - devil, meet details, etc.

Yup, I imagine there are going to be a few details to iron out, many
of which will need looking at by somebody a lot more technical than
myself.

Despite that I still think this is a discussion worth having.  So far
I don't think I've seen any situation where this would make things
worse than they are now, and I can think of plenty of cases where it
would be a huge improvement.

Of course, it also probably means a huge amount of work to implement.
I'm just hoping that it's not prohibitively difficult, and that the
ZFS team see the benefits as being worth it.
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] ZFS, Smashing Baby a fake???

2008-11-25 Thread Scara Maccai
 Oh, and regarding the original post -- as several
 readers correctly
 surmised, we weren't faking anything, we just didn't
 want to wait
 for all the device timeouts.  Because the disks were
 on USB, which
 is a hotplug-capable bus, unplugging the dead disk
 generated an
 interrupt that bypassed the timeout.  We could have
 waited it out,
 but 60 seconds is an eternity on stage.

I'm sorry, I didn't mean to sound offensive. Anyway I think that people should 
know that their drives can stuck the system for minutes, despite ZFS. I mean: 
there are a lot of writings about how ZFS is great for recovery in case a drive 
fails, but there's nothing regarding this problem. I know now it's not ZFS 
fault; but I wonder how many people set up their drives with ZFS assuming that 
as soon as something goes bad, ZFS will fix it. 
Is there any way to test these cases other than smashing the drive with a 
hammer? Having a failover policy where the failover can't be tested sounds 
scary...
-- 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] ZFS, Smashing Baby a fake???

2008-11-25 Thread Moore, Joe
Ross Smith wrote:
 My justification for this is that it seems to me that you can split
 disk behavior into two states:
 - returns data ok
 - doesn't return data ok
 
 And for the state where it's not returning data, you can again split
 that in two:
 - returns wrong data
 - doesn't return data

The state in discussion in this thread is the I/O requested by ZFS hasn't 
finished after 60, 120, 180, 3600, etc. seconds

The pool is waiting (for device timeouts) to distinguish between the first two 
states.

More accurate state descriptions are:
- The I/O has returned data
- The I/O hasn't yet returned data and the user (admin) is justifiably 
impatient.

For the first state, the data is either correct (verified by the ZFS checksums, 
or ESUCCESS on write) or incorrect and retried.

 
 The first of these is already covered by ZFS with its checksums (with
 FMA doing the extra work to fault drives), so it's just the second
 that needs immediate attention, and for the life of me I can't think
 of any situation that a simple timeout wouldn't catch.
 
 Personally I'd love to see two parameters, allowing this behavior to
 be turned on if desired, and allowing timeouts to be configured:
 
 zfs-auto-device-timeout
 zfs-auto-device-timeout-fail-delay

I'd prefer these be set at the (default) pool level:
zpool-device-timeout
zpool-device-timeout-fail-delay

with specific per-VDEV overrides possible:
vdev-device-timeout and vdev-device-fail-delay

This would allow but not require slower VDEVs to be tuned specifically for that 
case without hindering the default pool behavior on the local fast disks.  
Specifically, consider where I'm using mirrored VDEVs with one half over iSCSI, 
and want to have the iSCSI retry logic to still apply.  Writes that failed 
while the iSCSI link is down would have to be resilvered, but at least reads 
would switch to the local devices faster.

Set them to the default magic 0 value to have the system use the current 
behavior, of relying on the device drivers to report failures.
Set to a number (in ms probably) and the pool would consider an I/O that takes 
longer than that as returns invalid data

When the FMA work discussed below, these could be augmented by the pools best 
heuristic guess as to what the proper timeouts should be, which could be saved 
in (kstat?) vdev-device-autotimeout.

If you set the timeout to the magic -1 value, the pool would use 
vdev-device-autotimeout.

All that would be required is for the I/O that caused the disk to take a long 
time to be given a deadline (now + (vdev-device-timeout ?: 
(zpool-device-timeout?: forever)))* and consider the I/O complete with whatever 
data has returned after that deadline: if that's a bunch of 0's in a read, 
which would have a bad checksum; or a partially-completed write that would have 
to be committed somewhere else.

Unfortunately, I'm not enough of a programmer to implement this.

--Joe
* with the -1 magic, it would be a little more complicated calculation.
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] ZFS, Smashing Baby a fake???

2008-11-25 Thread Bob Friesenhahn
On Tue, 25 Nov 2008, Ross Smith wrote:

 Good to hear there's work going on to address this.

 What did you guys think to my idea of ZFS supporting a waiting for a
 response status for disks as an interim solution that allows the pool
 to continue operation while it's waiting for FMA or the driver to
 fault the drive?

A stable and sane system never comes with two brains.  It is wrong 
to put this sort of logic into ZFS when ZFS is already depending on 
FMA to make the decisions and Solaris already has an infrastructure to 
handle faults.  The more appropriate solution is that this feature 
should be in FMA.

Bob
==
Bob Friesenhahn
[EMAIL PROTECTED], http://www.simplesystems.org/users/bfriesen/
GraphicsMagick Maintainer,http://www.GraphicsMagick.org/

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] ZFS, Smashing Baby a fake???

2008-11-25 Thread Richard Elling
Scara Maccai wrote:
 Oh, and regarding the original post -- as several
 readers correctly
 surmised, we weren't faking anything, we just didn't
 want to wait
 for all the device timeouts.  Because the disks were
 on USB, which
 is a hotplug-capable bus, unplugging the dead disk
 generated an
 interrupt that bypassed the timeout.  We could have
 waited it out,
 but 60 seconds is an eternity on stage.
 

 I'm sorry, I didn't mean to sound offensive. Anyway I think that people 
 should know that their drives can stuck the system for minutes, despite 
 ZFS. I mean: there are a lot of writings about how ZFS is great for recovery 
 in case a drive fails, but there's nothing regarding this problem. I know now 
 it's not ZFS fault; but I wonder how many people set up their drives with ZFS 
 assuming that as soon as something goes bad, ZFS will fix it. 
 Is there any way to test these cases other than smashing the drive with a 
 hammer? Having a failover policy where the failover can't be tested sounds 
 scary...
   

It is with this idea in mind that I wrote part of Chapter 1 of the book
Designing Enterprise Solutions with Sun Cluster 3.0.  For convenience,
I also published chapter 1 as a Sun BluePrint Online article.
http://www.sun.com/blueprints/1101/clstrcomplex.pdf
False positives are very expensive in highly available systems, so we
really do want to avoid them.

One thing that we can do, and I've already (again[1]) started down the path
to document, is to show where and how the various (common) timeouts
are in the system.  Once you know how sd, cmdk, dbus, and friends work
you can make better decisions on where to look when the behaviour is not
as you expect.  But this is a very tedious path because there are so many
different failure modes and real-world devices can react ambiguously
when they fail.

[1] we developed a method to benchmark cluster dependability. The
description of the benchmark was published in several papers, but is
now available in the new IEEE book on Dependability Benchmarking.
This is really the first book of its kind and the first steps toward making
dependability benchmarks more mainstream. Anyway, the work done
for that effort included methods to improve failure detection and handling,
so we have a detailed understanding of those things for SPARC, in lab
form.  Expanding that work to cover the random-device-bought-at-Frys
will be a substantial undertaking.  Co-conspirators welcome.
 -- richard

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] ZFS, Smashing Baby a fake???

2008-11-25 Thread Nicolas Williams
On Tue, Nov 25, 2008 at 11:55:17AM +0100, [EMAIL PROTECTED] wrote:
 My idea is simply to allow the pool to continue operation while
 waiting for the drive to fault, even if that's a faulty write.  It
 just means that the rest of the operations (reads and writes) can keep
 working for the minute (or three) it takes for FMA and the rest of the
 chain to flag a device as faulty.
 
 Except when you're writing a lot; 3 minutes can cause a 20GB backlog
 for a single disk.

If we're talking isolated, or even clumped-but-relatively-few bad
sectors, then having a short timeout for writes and remapping
should be possible to do without running out of memory to cache
those writes.  But...

...writes to bad sectors will happen when txgs flush, and depending on
how bad sector remapping is done (say, by picking a new block address
and changing the blkptrs that referred to the old one) that might mean
redoing large chunks of the txg in the next one, which might mean that
fsync() could be delayed an additional 5 seconds or so.  And even if
that's not the case, writes to mirrors are supposed to be synchronous,
so one would think that bad block remapping should be synchronous also,
thus there must be a delay on writes to bad blocks no matter what --
though that delay could be tuned to be no more than a few seconds.

That points to a possibly decent heuristic on writes: vdev-level
timeouts that result in bad block remapping, but if the queue of
outstanding bad block remappings grows too large - treat the disk
as faulted and degrade the pool.

Sounds simple, but it needs to be combined at a higher layer with
information from other vdevs.  Unplugging a whole jbod shouldn't
necessarily fault all the vdevs on it -- perhaps it should cause
pool operation to pause until the jbod is plugged back in... which
should then cause those outstanding bad block remappings to be
rolled back since they weren't bad blocks after all.

That's a lot of fault detection and handling logic across many layers.

Incidentally, cables to fall out, or, rather, get pulled out
accidentally.  What should be the failure mode of a jbod disappearing
due to a pulled cable (or power supply failure)?  A pause in operation
(hangs)?  Or faulting of all affected vdevs, and if you're mirrored
across different jbods, incurring the need to re-silver later, with
degraded operation for hours on end?  I bet answers will vary.  The best
answer is to provide enough redundancy (multiple power supplies,
multi-pathing, ...) to make such situations less likely, but that's not
a complete answer.

Nico
-- 
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] ZFS, Smashing Baby a fake???

2008-11-25 Thread Ross Smith
I disagree Bob, I think this is a very different function to that
which FMA provides.

As far as I know, FMA doesn't have access to the big picture of pool
configuration that ZFS has, so why shouldn't ZFS use that information
to increase the reliability of the pool while still using FMA to
handle device failures?

The flip side of the argument is that ZFS already checks the data
returned by the hardware.  You might as well say that FMA should deal
with that too since it's responsible for all hardware failures.

The role of ZFS is to manage the pool, availability should be part and
parcel of that.


On Tue, Nov 25, 2008 at 3:57 PM, Bob Friesenhahn
[EMAIL PROTECTED] wrote:
 On Tue, 25 Nov 2008, Ross Smith wrote:

 Good to hear there's work going on to address this.

 What did you guys think to my idea of ZFS supporting a waiting for a
 response status for disks as an interim solution that allows the pool
 to continue operation while it's waiting for FMA or the driver to
 fault the drive?

 A stable and sane system never comes with two brains.  It is wrong to put
 this sort of logic into ZFS when ZFS is already depending on FMA to make the
 decisions and Solaris already has an infrastructure to handle faults.  The
 more appropriate solution is that this feature should be in FMA.

 Bob
 ==
 Bob Friesenhahn
 [EMAIL PROTECTED], http://www.simplesystems.org/users/bfriesen/
 GraphicsMagick Maintainer,http://www.GraphicsMagick.org/


___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] ZFS, Smashing Baby a fake???

2008-11-25 Thread Bob Friesenhahn
On Tue, 25 Nov 2008, Ross Smith wrote:

 I disagree Bob, I think this is a very different function to that
 which FMA provides.

 As far as I know, FMA doesn't have access to the big picture of pool
 configuration that ZFS has, so why shouldn't ZFS use that information
 to increase the reliability of the pool while still using FMA to
 handle device failures?

If FMA does not currently have knowledge of the redundancy model but 
needs it to make well-informed decisions, then it should be updated to 
incorporate this information.

FMA sees all the hardware in the system, including devices used for 
UFS and other types of filesystems, and even tape devices.  It is able 
to see hardware at a much more detailed level than ZFS does.  ZFS only 
sees an abstracted level of the hardware.  If a HBA or part of the 
backplane fails, FMA should be able to determine the failing area (at 
least as far out as it can see based on available paths) whereas all 
ZFS knows is that it is having difficulty getting there from here.

 The flip side of the argument is that ZFS already checks the data
 returned by the hardware.  You might as well say that FMA should deal
 with that too since it's responsible for all hardware failures.

If bad data is returned, then I assume that there is a peg to FMA's 
error statistics counters.

 The role of ZFS is to manage the pool, availability should be part and
 parcel of that.

Too much complexity tends to clog up the works and keep other areas of 
ZFS from being enhanced expediently.  ZFS would soon become a chunk of 
source code that no mortal could understand and as such it would be 
put under maintenance with no more hope of moving forward and 
inability to address new requirements.

A rational system really does not want to have mutiple brains. 
Otherwise some parts of the system will think that the device is fine 
while other parts believe that it has failed. None of us want to deal 
with an insane system like that.  There is also the matter of fault 
isolation.  If a drive can not be reached, is it because the drive 
failed, or because a HBA supporting multiple drives failed, or a cable 
got pulled?  This sort of information is extremely important for large 
reliable systems.

Bob
==
Bob Friesenhahn
[EMAIL PROTECTED], http://www.simplesystems.org/users/bfriesen/
GraphicsMagick Maintainer,http://www.GraphicsMagick.org/

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] ZFS, Smashing Baby a fake???

2008-11-25 Thread Eric Schrock
It's hard to tell exactly what you are asking for, but this sounds
similar to how ZFS already works.  If ZFS decides that a device is
pathologically broken (as evidenced by vdev_probe() failure), it knows
that FMA will come back and diagnose the drive is faulty (becuase we
generate a probe_failure ereport).  So ZFS pre-emptively short circuits
all I/O and treats the drive as faulted, even though the diagnosis
hasn't come back yet.  We can only do this for errors that have a 1:1
correspondence with faults.

- Eric

On Tue, Nov 25, 2008 at 04:10:13PM +, Ross Smith wrote:
 I disagree Bob, I think this is a very different function to that
 which FMA provides.
 
 As far as I know, FMA doesn't have access to the big picture of pool
 configuration that ZFS has, so why shouldn't ZFS use that information
 to increase the reliability of the pool while still using FMA to
 handle device failures?
 
 The flip side of the argument is that ZFS already checks the data
 returned by the hardware.  You might as well say that FMA should deal
 with that too since it's responsible for all hardware failures.
 
 The role of ZFS is to manage the pool, availability should be part and
 parcel of that.
 
 
 On Tue, Nov 25, 2008 at 3:57 PM, Bob Friesenhahn
 [EMAIL PROTECTED] wrote:
  On Tue, 25 Nov 2008, Ross Smith wrote:
 
  Good to hear there's work going on to address this.
 
  What did you guys think to my idea of ZFS supporting a waiting for a
  response status for disks as an interim solution that allows the pool
  to continue operation while it's waiting for FMA or the driver to
  fault the drive?
 
  A stable and sane system never comes with two brains.  It is wrong to put
  this sort of logic into ZFS when ZFS is already depending on FMA to make the
  decisions and Solaris already has an infrastructure to handle faults.  The
  more appropriate solution is that this feature should be in FMA.
 
  Bob
  ==
  Bob Friesenhahn
  [EMAIL PROTECTED], http://www.simplesystems.org/users/bfriesen/
  GraphicsMagick Maintainer,http://www.GraphicsMagick.org/
 
 
 ___
 zfs-discuss mailing list
 zfs-discuss@opensolaris.org
 http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

--
Eric Schrock, Fishworkshttp://blogs.sun.com/eschrock
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] ZFS, Smashing Baby a fake???

2008-11-24 Thread Scara Maccai
 Why would it be assumed to be a bug in Solaris? Seems
 more likely on  
 balance to be a problem in the error reporting path
 or a controller/ 
 firmware weakness.

Weird: they would use a controller/firmware that doesn't work? Bad call...

 I'm pretty sure the first 2 versions of this demo I
 saw were executed  
 perfectly - and in a packed auditorium (Moscow? and
 Russians are the  
 toughest crowd). No smoke, no mirrors.

Still don't understand why even the one on http://www.opensolaris.com/, ZFS – 
A Smashing Hit, doesn't show the app running in the moment the HD is 
smashed... weird...
-- 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] ZFS, Smashing Baby a fake???

2008-11-24 Thread Toby Thain

On 24-Nov-08, at 10:40 AM, Scara Maccai wrote:

 Why would it be assumed to be a bug in Solaris? Seems
 more likely on
 balance to be a problem in the error reporting path
 or a controller/
 firmware weakness.

 Weird: they would use a controller/firmware that doesn't work? Bad  
 call...


Seems to me, a sledgehammer would produce fairly random failure  
modes. How would you pre-test?!

--T


 I'm pretty sure the first 2 versions of this demo I
 saw were executed
 perfectly - and in a packed auditorium (Moscow? and
 Russians are the
 toughest crowd). No smoke, no mirrors.

 Still don't understand why even the one on http:// 
 www.opensolaris.com/, ZFS – A Smashing Hit, doesn't show the app  
 running in the moment the HD is smashed... weird...
 -- 
 This message posted from opensolaris.org
 ___
 zfs-discuss mailing list
 zfs-discuss@opensolaris.org
 http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] ZFS, Smashing Baby a fake???

2008-11-24 Thread Will Murnane
On Mon, Nov 24, 2008 at 10:40, Scara Maccai [EMAIL PROTECTED] wrote:
 Still don't understand why even the one on http://www.opensolaris.com/, ZFS 
 – A Smashing Hit, doesn't show the app running in the moment the HD is 
 smashed... weird...
ZFS is primarily about protecting your data: correctness, at the
expense of everything else if necessary.  It happens to be very fast
under most circumstances, but if a disk vanishes like a sledgehammer
hit it, ZFS will wait on the device driver to decide it's dead.
Device drivers are generally the same way, choosing correctness over
speed.  Thus, ZFS can take a while to notice that a disk is gone and
do something about it---but in the meantime, it won't make any
promises it can't keep.

This is to be regarded as a Good Thing.  If a disk fails and ZFS
throws away all of my data as a result I'm not going to be happy; if a
disk fails and ZFS takes 30 seconds to notice I'm still happy with
that.

That said, there have been several threads about wanting configurable
device timeouts handled at the ZFS level rather than the device driver
level.  Perhaps this will be implemented at some point... but in the
meantime I prefer correctness to availability.

Will
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] ZFS, Smashing Baby a fake???

2008-11-24 Thread C. Bergström
Will Murnane wrote:
 On Mon, Nov 24, 2008 at 10:40, Scara Maccai [EMAIL PROTECTED] wrote:
   
 Still don't understand why even the one on http://www.opensolaris.com/, ZFS 
 – A Smashing Hit, doesn't show the app running in the moment the HD is 
 smashed... weird...
 
Sorry this is OT, but is it just me or does is only seem proper to have 
Gallagher do this? ;)

./C
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] ZFS, Smashing Baby a fake???

2008-11-24 Thread Scara Maccai
 if a disk vanishes like
 a sledgehammer
 hit it, ZFS will wait on the device driver to decide
 it's dead.

OK I see it.

 That said, there have been several threads about
 wanting configurable
 device timeouts handled at the ZFS level rather than
 the device driver
 level.  

Uh, so I can configure timeouts at the device level? I didn't know that.
-- 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] ZFS, Smashing Baby a fake???

2008-11-24 Thread Moore, Joe
C. Bergström wrote:
 Will Murnane wrote:
  On Mon, Nov 24, 2008 at 10:40, Scara Maccai [EMAIL PROTECTED] wrote:

  Still don't understand why even the one on 
 http://www.opensolaris.com/, ZFS - A Smashing Hit, doesn't 
 show the app running in the moment the HD is smashed... weird...
  
 Sorry this is OT, but is it just me or does is only seem 
 proper to have 
 Gallagher do this? ;)

Absolutely not.  Under no circumstances should you attempt to create a striped 
ZFS pool on a watermelon, nor on any other type of epigynous berry.

If you try, you will certainly rind up with a mess, if not a core dump.  And 
let me tell you, that's the pits.

--Joe
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] ZFS, Smashing Baby a fake???

2008-11-24 Thread Miles Nordin
 tt == Toby Thain [EMAIL PROTECTED] writes:

tt Why would it be assumed to be a bug in Solaris? Seems more
tt likely on balance to be a problem in the error reporting path
tt or a controller/ firmware weakness.

It's not really an assumption.  It's been discussed in here a lot, and
we know why it's happening.  It's just a case of ``it's a feature not
a bug'' combined with ``somebody else's problem.''

The error-reporting path you mention is inside Solaris, so I have a
little trouble decoding your statement.

I wish drives had a failure-aware QoS with a split queue for
aggressive-retry cdb's and deadline cdb's.  This would make the
B_FAILFAST primitive the Solaris developers seem to believe in
actually mean something.

Solaris is supposed to have a B_FAILFAST option for block I/O that ZFS
could start using to capture vdev-level knowledge like ``don't try too
hard to read this block from one device, because we can get it faster
by asking another device.''  In the real world B_FAILFAST is IMO quite
silly and, not exactly useless but at best deceptive to the
higher-layer developer, because even IF the drive could be told to
fail faster than 30 seconds by some future fancier sd driver, there
would still be some fail-slow cdbs hitting the drive, and the
two can't be parallelized.  Sending a fail-slow cdb to a drive
freezes the drive for up to 30 seconds * n, where n is the
multiplier of some cargo-cult state machine built into the host
adapter driver involving ``bus resets'' and other such stuff.  All the
B_FAILFAST cdbs queued behind the fail-slow may as well forget
the flag becasue the drive's busy with the slow cdb.  If you
have a very few of these retryable cdbs peppered into your
transaction stream, which are expected to take 10 - 100ms each but
actually take one or two MINUTES each, the drive will be so slow it'd
be more expressive to mark it dead.  What will probably happen in
$REALITY is, the sysadmin will declare his machine ``frozen without a
panic message'' and reboot it, losing any write-cached data which, if
not for this idiocy, could have been committed to other drives in a
redundant vdev, as well as rebooting the rest of the system unrelated
to this stuck zpool.

However, it's inappropriate for a driver to actually report ``drive
dead'' in this scenario, because the drive is NOT dead.  The
drive-failure-statistic papers posted in here say that drives usually
fail with a bunch of contiguous or clumped-together unreadable
sectors.  You can still get most of the data off them with dd_rescue
or 'dd if=baddrive of=gooddrive bs=512 conv=noerror,sync', if you wait
about a week.  About four hours of that week is spent copying data and
the rest spent aggressively ``retrying''.

An instantanious manual command, ``I insist this drive is failed.
Mark it failed instantly, without leaving me stuck in bogus state
machines for two minutes or two hours,'' would be a huge improvement,
but I think graceful automatic behavior is not too much to wish for
because this isn't some strange circumstance.  This is *the way drives
usually fail*.

SCSI drives have all kinds of retry-tuning in the ``mode pages'' in a
standardized format.  Even 5.25 10MB SCSI drives had these pages.
One of NetApp's papers said they don't even let their SCSI/FC drives
do their own bad-block reallocation.  They do all that in host
software.  so there are a lot of secret tuning knobs, and they're AIUI
largely standardized across manufacturers and through the years.  ATA
drives, AIUI, don't have the pages, but some WD gamer drives have some
goofy DOS RAID-tuner tool.

But even what SCSI drives offer isn't enough to provide the
architecture ZFS seems to dream ov.  What's really needed to provide
ZFS developer's expectations of B_FAILFAST is QoS inside the drive
firmware.  Drives need to have split queues, with an aggressive-retry
queue and a deadline-service queue.  While retrying a stuck
cdb in the aggressive queue, they must keep servicing the
deadline queue.  I've never heard of anything like this existing in a
real drive.  I think it's beyond the programming skill of an
electrical engineer, and it may be too constraining for them because
drives seem to do spastic head-seeks and sometimes partway spin
themselves down and back up during a retry cycle.

ZFS still seems to have this taxonomic-arcania view of drives that
they are ``failing operations'' or the drive itself is ``failed''.  It
belongs to the driver's realm to decide whether it's the whole drive
or just the ``operation'' which is failing, because that's how the
square peg fits snugly into it's square hole.

One of the NetApp papers mentions they have proprietary statistical
heuristics for when to ignore a drive for a little while and use
redundant drives instead, and when to fail a drive and call
autosupport.  And they log drive behavior really explicitly and
unambiguously separate from ``controller'' failure, which is why
they're able to write the paper at all.  I'm in 

Re: [zfs-discuss] ZFS, Smashing Baby a fake???

2008-11-24 Thread Toby Thain

On 24-Nov-08, at 3:49 PM, Miles Nordin wrote:

 tt == Toby Thain [EMAIL PROTECTED] writes:

 tt Why would it be assumed to be a bug in Solaris? Seems more
 tt likely on balance to be a problem in the error reporting path
 tt or a controller/ firmware weakness.

 It's not really an assumption.  It's been discussed in here a lot, and
 we know why it's happening.  It's just a case of ``it's a feature not
 a bug'' combined with ``somebody else's problem.''

 The error-reporting path you mention is inside Solaris, so I have a
 little trouble decoding your statement.



Not all of it is!

I don't see how anyone could confidently correlate behaviour after  
sledgehammer impact with a specific fault in Solaris, without doing  
a lot more investigation than watching a YouTube video. Perhaps  
this has already been narrowed down to a specific root cause within  
Solaris - I just didn't see enough data in the OP's post to indicate  
that.

But I bow to your far more extensive experience...

--Toby

 ___
 zfs-discuss mailing list
 zfs-discuss@opensolaris.org
 http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] ZFS, Smashing Baby a fake???

2008-11-24 Thread Richard Elling
Toby Thain wrote:
 On 24-Nov-08, at 3:49 PM, Miles Nordin wrote:

   
 tt == Toby Thain [EMAIL PROTECTED] writes:
   
 tt Why would it be assumed to be a bug in Solaris? Seems more
 tt likely on balance to be a problem in the error reporting path
 tt or a controller/ firmware weakness.

 It's not really an assumption.  It's been discussed in here a lot, and
 we know why it's happening.  It's just a case of ``it's a feature not
 a bug'' combined with ``somebody else's problem.''

 The error-reporting path you mention is inside Solaris, so I have a
 little trouble decoding your statement.

 


 Not all of it is!

 I don't see how anyone could confidently correlate behaviour after  
 sledgehammer impact with a specific fault in Solaris, without doing  
 a lot more investigation than watching a YouTube video. Perhaps  
 this has already been narrowed down to a specific root cause within  
 Solaris - I just didn't see enough data in the OP's post to indicate  
 that.
   

We could add strain sensors to disk drives which, when the strain
was suddenly too great, would register an ASC/ASCQ 75/00 DEVICE
WAS HIT BY A HAMMER and then we could add the e-report to sd
and then register with a io-hammer-event FMA diagnosis engine
which would be registered to ZFS to offline the device :-)

But seriously, it really does depend on the failure mode of the device
and I'm not sure people have studied the hammer case very closely.
In the worst case, the device would be selectable, but not responding
to data requests which would lead through the device retry logic and can
take minutes.  If the (USB) device simply disappeared, it would be
indistinquishable from a hot-plug event and that logic would take over
which results in a faster diagnosis.  I suppose it will depend on the
device and your aim.
 -- richard

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] ZFS, Smashing Baby a fake???

2008-11-24 Thread Scara Maccai
 In the worst case, the device would be selectable,
 but not responding
 to data requests which would lead through the device
 retry logic and can
 take minutes.

that's what I didn't know: that a driver could take minutes (hours???) to 
decide that a device is not working anymore.
Now it comes another question: how can one assume that a drive failure won't 
take one hour to be acknowledged by the driver? That is: what good is a 
failover strategy if it takes one hour to start? I'm grateful that the system 
doesn't write until it knows what is going on, but that can't take that long.
-- 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] ZFS, Smashing Baby a fake???

2008-11-24 Thread Richard Elling
Scara Maccai wrote:
 In the worst case, the device would be selectable,
 but not responding
 to data requests which would lead through the device
 retry logic and can
 take minutes.
 

 that's what I didn't know: that a driver could take minutes (hours???) to 
 decide that a device is not working anymore.
   

For Solaris, sd driver, there are, by default, 60 second timeouts with 5
retries.  For ssd driver, 3 retries.  But sometimes, additional tests are
made to try to verify that the disk is really not working properly which
will cause more of these.  Again, it depends on the failure mode.

 Now it comes another question: how can one assume that a drive failure won't 
 take one hour to be acknowledged by the driver? That is: what good is a 
 failover strategy if it takes one hour to start? I'm grateful that the system 
 doesn't write until it knows what is going on, but that can't take that long.
   

AFAIK, there are no cases where the timeouts would result in an hour
delay before making a decision.  Usually, the policy is made in advance,
as in the zpool failmode property.
 -- richard


___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] ZFS, Smashing Baby a fake???

2008-11-24 Thread Ross
But that's exactly the problem Richard:  AFAIK.

Can you state that absolutely, categorically, there is no failure mode out 
there (caused by hardware faults, or bad drivers) that won't lock a drive up 
for hours?  You can't, obviously, which is why we keep saying that ZFS should 
have this kind of timeout feature.

For once I agree with Miles, I think he's written a really good writeup of the 
problem here.  My simple view on it would be this:

Drives are only aware of themselves as an individual entity.  Their job is to 
save  restore data to themselves, and drivers are written to minimise any 
chance of data loss.  So when a drive starts to fail, it makes complete sense 
for the driver and hardware to be very, very thorough about trying to read or 
write that data, and to only fail as a last resort.

I'm not at all surprised that drives take 30 seconds to timeout, nor that they 
could slow a pool for hours.  That's their job.  They know nothing else about 
the storage, they just have to do their level best to do as they're told, and 
will only fail if they absolutely can't store the data.

The raid controller on the other hand (Netapp / ZFS, etc) knows all about the 
pool.  It knows if you have half a dozen good drives online, it knows if there 
are hot spares available, and it *should* also know how quickly the drives 
under its care usually respond to requests.

ZFS is perfectly placed to spot when a drive is starting to fail, and to take 
the appropriate action to safeguard your data.  It has far more information 
available than a single drive ever will, and should be designed accordingly.

Expecting the firmware and drivers of individual drives to control the failure 
modes of your redundant pool is just crazy imo.  You're throwing away some of 
the biggest benefits of using multiple drives in the first place.
-- 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss