es == Eric Schrock [EMAIL PROTECTED] writes:
es Are you running your experiments on build 101 or later?
no.
aside from that quick one for copies=2 im pretty bad about running
well-designed experiments. and I do have old builds. I need to buy
more hardware.
It's hard to know how to get
rs == Ross Smith [EMAIL PROTECTED] writes:
nw == Nicolas Williams [EMAIL PROTECTED] writes:
rs I disagree Bob, I think this is a very different function to
rs that which FMA provides.
I see two problems.
(1) FMA doesn't seem to work very well, and was used as an excuse to
keep
On Wed, 26 Nov 2008, Miles Nordin wrote:
(2) The FMA model of collecting telemmetry, taking it into
user-space, chin-strokingly contemplating it for a while, then
decreeing a diagnosis, is actually a rather limited one. I can
think of two kinds of limit:
(a) you're
On Wed, Nov 26, 2008 at 07:02:11PM -0500, Miles Nordin wrote:
(2) The FMA model of collecting telemmetry, taking it into
user-space, chin-strokingly contemplating it for a while, then
decreeing a diagnosis, is actually a rather limited one. I can
think of two kinds of limit:
I think we (the ZFS team) all generally agree with you. The current
nevada code is much better at handling device failures than it was
just a few months ago. And there are additional changes that were
made for the FishWorks (a.k.a. Amber Road, a.k.a. Sun Storage 7000)
product line that will make
Hey Jeff,
Good to hear there's work going on to address this.
What did you guys think to my idea of ZFS supporting a waiting for a
response status for disks as an interim solution that allows the pool
to continue operation while it's waiting for FMA or the driver to
fault the drive?
I do
PS. I think this also gives you a chance at making the whole problem
much simpler. Instead of the hard question of is this faulty,
you're just trying to say is it working right now?.
In fact, I'm now wondering if the waiting for a response flag
wouldn't be better as possibly faulty. That way
No, I count that as doesn't return data ok, but my post wasn't very
clear at all on that.
Even for a write, the disk will return something to indicate that the
action has completed, so that can also be covered by just those two
scenarios, and right now ZFS can lock the whole pool up if it's
My idea is simply to allow the pool to continue operation while
waiting for the drive to fault, even if that's a faulty write. It
just means that the rest of the operations (reads and writes) can keep
working for the minute (or three) it takes for FMA and the rest of the
chain to flag a device
My justification for this is that it seems to me that you can split
disk behavior into two states:
- returns data ok
- doesn't return data ok
I think you're missing won't write.
There's clearly a difference between get data from a different copy
which you can fix but retrying data to a
Hmm, true. The idea doesn't work so well if you have a lot of writes,
so there needs to be some thought as to how you handle that.
Just thinking aloud, could the missing writes be written to the log
file on the rest of the pool? Or temporarily stored somewhere else in
the pool? Would it be an
On 25-Nov-08, at 5:10 AM, Ross Smith wrote:
Hey Jeff,
Good to hear there's work going on to address this.
What did you guys think to my idea of ZFS supporting a waiting for a
response status for disks as an interim solution that allows the pool
to continue operation while it's waiting for
The shortcomings of timeouts have been discussed on this list before. How do
you tell the difference between a drive that is dead and a path that is just
highly loaded?
A path that is dead is either returning bad data, or isn't returning
anything. A highly loaded path is by definition reading
Oh, and regarding the original post -- as several
readers correctly
surmised, we weren't faking anything, we just didn't
want to wait
for all the device timeouts. Because the disks were
on USB, which
is a hotplug-capable bus, unplugging the dead disk
generated an
interrupt that bypassed
Ross Smith wrote:
My justification for this is that it seems to me that you can split
disk behavior into two states:
- returns data ok
- doesn't return data ok
And for the state where it's not returning data, you can again split
that in two:
- returns wrong data
- doesn't return data
On Tue, 25 Nov 2008, Ross Smith wrote:
Good to hear there's work going on to address this.
What did you guys think to my idea of ZFS supporting a waiting for a
response status for disks as an interim solution that allows the pool
to continue operation while it's waiting for FMA or the driver
Scara Maccai wrote:
Oh, and regarding the original post -- as several
readers correctly
surmised, we weren't faking anything, we just didn't
want to wait
for all the device timeouts. Because the disks were
on USB, which
is a hotplug-capable bus, unplugging the dead disk
generated an
On Tue, Nov 25, 2008 at 11:55:17AM +0100, [EMAIL PROTECTED] wrote:
My idea is simply to allow the pool to continue operation while
waiting for the drive to fault, even if that's a faulty write. It
just means that the rest of the operations (reads and writes) can keep
working for the minute
I disagree Bob, I think this is a very different function to that
which FMA provides.
As far as I know, FMA doesn't have access to the big picture of pool
configuration that ZFS has, so why shouldn't ZFS use that information
to increase the reliability of the pool while still using FMA to
handle
On Tue, 25 Nov 2008, Ross Smith wrote:
I disagree Bob, I think this is a very different function to that
which FMA provides.
As far as I know, FMA doesn't have access to the big picture of pool
configuration that ZFS has, so why shouldn't ZFS use that information
to increase the reliability
It's hard to tell exactly what you are asking for, but this sounds
similar to how ZFS already works. If ZFS decides that a device is
pathologically broken (as evidenced by vdev_probe() failure), it knows
that FMA will come back and diagnose the drive is faulty (becuase we
generate a probe_failure
Why would it be assumed to be a bug in Solaris? Seems
more likely on
balance to be a problem in the error reporting path
or a controller/
firmware weakness.
Weird: they would use a controller/firmware that doesn't work? Bad call...
I'm pretty sure the first 2 versions of this demo I
On 24-Nov-08, at 10:40 AM, Scara Maccai wrote:
Why would it be assumed to be a bug in Solaris? Seems
more likely on
balance to be a problem in the error reporting path
or a controller/
firmware weakness.
Weird: they would use a controller/firmware that doesn't work? Bad
call...
Seems
On Mon, Nov 24, 2008 at 10:40, Scara Maccai [EMAIL PROTECTED] wrote:
Still don't understand why even the one on http://www.opensolaris.com/, ZFS
– A Smashing Hit, doesn't show the app running in the moment the HD is
smashed... weird...
ZFS is primarily about protecting your data: correctness,
Will Murnane wrote:
On Mon, Nov 24, 2008 at 10:40, Scara Maccai [EMAIL PROTECTED] wrote:
Still don't understand why even the one on http://www.opensolaris.com/, ZFS
– A Smashing Hit, doesn't show the app running in the moment the HD is
smashed... weird...
Sorry this is OT, but is
if a disk vanishes like
a sledgehammer
hit it, ZFS will wait on the device driver to decide
it's dead.
OK I see it.
That said, there have been several threads about
wanting configurable
device timeouts handled at the ZFS level rather than
the device driver
level.
Uh, so I can
C. Bergström wrote:
Will Murnane wrote:
On Mon, Nov 24, 2008 at 10:40, Scara Maccai [EMAIL PROTECTED] wrote:
Still don't understand why even the one on
http://www.opensolaris.com/, ZFS - A Smashing Hit, doesn't
show the app running in the moment the HD is smashed... weird...
tt == Toby Thain [EMAIL PROTECTED] writes:
tt Why would it be assumed to be a bug in Solaris? Seems more
tt likely on balance to be a problem in the error reporting path
tt or a controller/ firmware weakness.
It's not really an assumption. It's been discussed in here a lot, and
we
On 24-Nov-08, at 3:49 PM, Miles Nordin wrote:
tt == Toby Thain [EMAIL PROTECTED] writes:
tt Why would it be assumed to be a bug in Solaris? Seems more
tt likely on balance to be a problem in the error reporting path
tt or a controller/ firmware weakness.
It's not really an
Toby Thain wrote:
On 24-Nov-08, at 3:49 PM, Miles Nordin wrote:
tt == Toby Thain [EMAIL PROTECTED] writes:
tt Why would it be assumed to be a bug in Solaris? Seems more
tt likely on balance to be a problem in the error reporting path
tt or a controller/
In the worst case, the device would be selectable,
but not responding
to data requests which would lead through the device
retry logic and can
take minutes.
that's what I didn't know: that a driver could take minutes (hours???) to
decide that a device is not working anymore.
Now it comes
Scara Maccai wrote:
In the worst case, the device would be selectable,
but not responding
to data requests which would lead through the device
retry logic and can
take minutes.
that's what I didn't know: that a driver could take minutes (hours???) to
decide that a device is not
But that's exactly the problem Richard: AFAIK.
Can you state that absolutely, categorically, there is no failure mode out
there (caused by hardware faults, or bad drivers) that won't lock a drive up
for hours? You can't, obviously, which is why we keep saying that ZFS should
have this kind
33 matches
Mail list logo