On Nov 24, 2009, at 2:51 PM, Daniel Carosone wrote:

Those are great, but they're about testing the zfs software. There's a small amount of overlap, in that these injections include trying to simulate the hoped-for system response (e.g, EIO) to various physical scenarios, so it's worth looking at for scenario suggestions.

However, for most of us, we generally rely on Sun's (generally acknowledged as excellent) testing of the software stack.

I suspect the OP is more interested in verifying on his own hardware, that physical events and problems will be connected to the software fault injection test scenarios. The rest of us running on random commodity hardware have largely the same interest, because Sun hasn't qualified the hardware parts of the stack as well. We've taken on that responsibility ourselves (both individually, and as a community by sharing findings).

Agree 110%.

For example, for the various kinds of failures that might happen:
* Does my particular drive/controller/chipset/bios/etc combination notice the problem and result in the appropriate error from the driver upwards? * How quickly does it notice? Do I have to wait for some long timeout or other retry cycle, and is that a problem for my usage? * Does the rest of the system keep working to allow zfs to recover/ react, or is there some kind of follow-on failure (bus hangs/resets, etc) that will have wider impact?

Yanking disk controller and/or power cables is an easy and obvious test. Testing scenarios that involve things like disk firmware behaviour in response to bad reads is harder - though apparently yelling at them might be worthwhile :-)

The problem is that yanking a disk tests the failure mode of yanking a disk. If this is the sort of failure you expect to see, then perhaps you should look at a mechanical solution. If you wish to test the failure modes you are likely to see, then you need a more sophisticated test rig that will emulate a device
and inject the sorts of faults you expect.

Finding ways to dial up the load up your psu (or drop voltage/limit current to a specific device with an inline filter) might be an idea, since overloaded power supplies seem to be implicated in various people's reports of trouble. Finding ways to generate EMF or "cosmic rays" to induce other kinds of failure is left as an exercise.

Many parts of the stack have software fault injection capabilities. Whether you do this with something like zinject or the wansimulator, the principle is
the same.  For example, you could easily add wansimulator to an iSCSI
rig to inject packet corruption in the network. You can also roll your own with
Dtrace, which allows you to change the return values of any function.

The COMSTAR project has a test suite that could be leveraged, but it does not appear to be explicitly designed to perform system tests. I'm reasonably confident that the driver teams have test code, too, but I would also expect them to be oriented towards unit testing. A quick search will turn up many
fault injection software programs geared towards unit testing.

Finally, there are companies that provide system-level test services.
 -- richard

_______________________________________________
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Reply via email to