hi All, I realize the subject is a bit incendiary, but we're running into what I view as a design omission with ZFS that is preventing us from building highly available storage infrastructure; I want to bring some attention (again) to this major issue:
Currently we have a set of iSCSI targets published from a storage host which are consumed by a ZFS host. If a _single_disk_ on the storage host goes bad, ZFS pauses for a full 180 seconds before allowing read/write operations to resume. This is an aeon, beyond TCP timeout, etc. I've read the claims that ZFS is unconcerned with underlying infrastructure and agree with the basic sense of those claims see:[1], however: * If ZFS experiences _any_ behavior when interacting with a device which is not consistent with known historical performance norms -and- * ZFS knows the data it is attempting to fetch from that device is resident on another device Why then would it not make a decision, dynamically based on a reasonably small sample of recent device performance to drop its current attempt and instead fetch the data from the other device? I don't even think a configurable timeout is that useful - it should be based on a sample of performance from (say) a day - or, hey, for the moment, just to make it easy, a configurable timeout! As it is, I can't put this in production. 180 seconds is not "highly available", it's users seeing "The Connection has Timed Out". Everything - and I mean every other tiny detail - of ZFS that I have seen and used is crystalline perfection. So, ZFS is (for us) a diamond with a little bit of volcanic crust remaining to be polished off. Is there any intention of dealing with this problem in the (hopefully very) near future? If you're in the bay area, I will personally deliver (2) cases of the cold beer of your choice (including trappist) if you solve this problem. If offering a bounty would have any effect, I'd offer one. We need this to work. thanks, _alex Related: [1] http://mail.opensolaris.org/pipermail/zfs-discuss/2008-August/thread.html#50609 -- alex black, founder the turing studio, inc. 888.603.6023 / main 510.666.0074 / office [EMAIL PROTECTED] _______________________________________________ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss