On Nov 24, 2009, at 2:51 PM, Daniel Carosone wrote:
Those are great, but they're about testing the zfs software.
There's a small amount of overlap, in that these injections include
trying to simulate the hoped-for system response (e.g, EIO) to
various physical scenarios, so it's worth looking at for scenario
suggestions.
However, for most of us, we generally rely on Sun's (generally
acknowledged as excellent) testing of the software stack.
I suspect the OP is more interested in verifying on his own
hardware, that physical events and problems will be connected to the
software fault injection test scenarios. The rest of us running on
random commodity hardware have largely the same interest, because
Sun hasn't qualified the hardware parts of the stack as well. We've
taken on that responsibility ourselves (both individually, and as a
community by sharing findings).
Agree 110%.
For example, for the various kinds of failures that might happen:
* Does my particular drive/controller/chipset/bios/etc combination
notice the problem and result in the appropriate error from the
driver upwards?
* How quickly does it notice? Do I have to wait for some long
timeout or other retry cycle, and is that a problem for my usage?
* Does the rest of the system keep working to allow zfs to recover/
react, or is there some kind of follow-on failure (bus hangs/resets,
etc) that will have wider impact?
Yanking disk controller and/or power cables is an easy and obvious
test. Testing scenarios that involve things like disk firmware
behaviour in response to bad reads is harder - though apparently
yelling at them might be worthwhile :-)
The problem is that yanking a disk tests the failure mode of yanking a
disk.
If this is the sort of failure you expect to see, then perhaps you
should look
at a mechanical solution. If you wish to test the failure modes you
are likely
to see, then you need a more sophisticated test rig that will emulate
a device
and inject the sorts of faults you expect.
Finding ways to dial up the load up your psu (or drop voltage/limit
current to a specific device with an inline filter) might be an
idea, since overloaded power supplies seem to be implicated in
various people's reports of trouble. Finding ways to generate EMF
or "cosmic rays" to induce other kinds of failure is left as an
exercise.
Many parts of the stack have software fault injection capabilities.
Whether
you do this with something like zinject or the wansimulator, the
principle is
the same. For example, you could easily add wansimulator to an iSCSI
rig to inject packet corruption in the network. You can also roll your
own with
Dtrace, which allows you to change the return values of any function.
The COMSTAR project has a test suite that could be leveraged, but it
does
not appear to be explicitly designed to perform system tests. I'm
reasonably
confident that the driver teams have test code, too, but I would also
expect
them to be oriented towards unit testing. A quick search will turn up
many
fault injection software programs geared towards unit testing.
Finally, there are companies that provide system-level test services.
-- richard
_______________________________________________
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss