Re: [zfs-discuss] zfs-raidz - simulate disk failure

Richard Elling Wed, 25 Nov 2009 07:58:07 -0800

On Nov 24, 2009, at 2:51 PM, Daniel Carosone wrote:

Those are great, but they're about testing the zfs software.There's a small amount of overlap, in that these injections includetrying to simulate the hoped-for system response (e.g, EIO) tovarious physical scenarios, so it's worth looking at for scenariosuggestions.
However, for most of us, we generally rely on Sun's (generallyacknowledged as excellent) testing of the software stack.
I suspect the OP is more interested in verifying on his ownhardware, that physical events and problems will be connected to thesoftware fault injection test scenarios. The rest of us running onrandom commodity hardware have largely the same interest, becauseSun hasn't qualified the hardware parts of the stack as well. We'vetaken on that responsibility ourselves (both individually, and as acommunity by sharing findings).


Agree 110%.

For example, for the various kinds of failures that might happen:
* Does my particular drive/controller/chipset/bios/etc combinationnotice the problem and result in the appropriate error from thedriver upwards?* How quickly does it notice? Do I have to wait for some longtimeout or other retry cycle, and is that a problem for my usage?* Does the rest of the system keep working to allow zfs to recover/react, or is there some kind of follow-on failure (bus hangs/resets,etc) that will have wider impact?
Yanking disk controller and/or power cables is an easy and obvioustest. Testing scenarios that involve things like disk firmwarebehaviour in response to bad reads is harder - though apparentlyyelling at them might be worthwhile :-)

The problem is that yanking a disk tests the failure mode of yanking adisk.If this is the sort of failure you expect to see, then perhaps youshould lookat a mechanical solution. If you wish to test the failure modes youare likelyto see, then you need a more sophisticated test rig that will emulatea device

and inject the sorts of faults you expect.

Finding ways to dial up the load up your psu (or drop voltage/limitcurrent to a specific device with an inline filter) might be anidea, since overloaded power supplies seem to be implicated invarious people's reports of trouble. Finding ways to generate EMFor "cosmic rays" to induce other kinds of failure is left as anexercise.

Many parts of the stack have software fault injection capabilities.Whetheryou do this with something like zinject or the wansimulator, theprinciple is

the same.  For example, you could easily add wansimulator to an iSCSI

rig to inject packet corruption in the network. You can also roll yourown with

Dtrace, which allows you to change the return values of any function.

The COMSTAR project has a test suite that could be leveraged, but itdoesnot appear to be explicitly designed to perform system tests. I'mreasonablyconfident that the driver teams have test code, too, but I would alsoexpectthem to be oriented towards unit testing. A quick search will turn upmany

fault injection software programs geared towards unit testing.

Finally, there are companies that provide system-level test services.
 -- richard

_______________________________________________
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] zfs-raidz - simulate disk failure

Reply via email to