On Jan 6, 2012, at 3:33 PM, Jim Klimov wrote:
> Hello all,
> I have a new idea up for discussion.
> Several RAID systems have implemented "spread" spare drives
> in the sense that there is not an idling disk waiting to
> receive a burst of resilver data filling it up, but the
> capacity of the spare disk is spread among all drives in
> the array. As a result, the healthy array gets one more
> spindle and works a little faster, and rebuild times are
> often decreased since more spindles can participate in
> repairs at the same time.
Xiotech has a distributed, relocatable model, but the FRU is the whole ISE.
There have been other implementations of more distributed RAIDness in the
past (RAID-1E, etc).
The big question is whether they are worth the effort. Spares solve a
problem and only impact availability in an indirect manner. For single-parity
solutions, spares can make a big difference in MTTDL, but have almost no impact
on MTTDL for double-parity solutions (eg. raidz2).
> I don't think I've seen such idea proposed for ZFS, and
> I do wonder if it is at all possible with variable-width
> stripes? Although if the disk is sliced in 200 metaslabs
> or so, implementing a spread-spare is a no-brainer as well.
Put some thoughts down on paper and work through the math. If it all works
out, let's implement it!
> To be honest, I've seen this a long time ago in (Falcon?)
> RAID controllers, and recently - in a USEnix presentation
> of IBM GPFS on YouTube. In the latter the speaker goes
> a greater depth describing how their "declustered RAID"
> approach (as they call it: all blocks - spare, redundancy
> and data are intermixed evenly on all drives and not in
> a single "group" or a mid-level VDEV as would be for ZFS).
> GPFS with declustered RAID not only decreases rebuild
> times and/or impact of rebuilds on end-user operations,
> but it also happens to increase reliability - there is
> a smaller time window in case of multiple-disk failure
> in a large RAID-6 or RAID-7 array (in the example they
> use 47-disk sets) that the data is left in a "critical
> state" due to lack of redundancy, and there is less data
> overall in such state - so the system goes from critical
> to simply degraded (with some redundancy) in a few minutes.
> Another thing they have in GPFS is temporary offlining
> of disks so that they can catch up when reattached - only
> newer writes (bigger TXG numbers in ZFS terms) are added to
> reinserted disks. I am not sure this exists in ZFS today,
> either. This might simplify physical systems maintenance
> (as it does for IBM boxes - see presentation if interested)
> and quick recovery from temporarily unavailable disks, such
> as when a disk gets a bus reset and is unavailable for writes
> for a few seconds (or more) while the array keeps on writing.
> I find these ideas cool. I do believe that IBM might get
> angry if ZFS development copy-pasted them "as is", but it
> might get nonetheless get us inventing a similar wheel
> that would be a bit different ;)
> There are already several vendors doing this in some way,
> so perhaps there is no (patent) monopoly in place already...
> And I think all the magic of spread spares and/or "declustered
> RAID" would go into just making another write-block allocator
> in the same league "raidz" or "mirror" are nowadays...
> BTW, are such allocators pluggable (as software modules)?
> What do you think - can and should such ideas find their
> way into ZFS? Or why not? Perhaps from theoretical or
> real-life experience with such storage approaches?
> //Jim Klimov
> zfs-discuss mailing list
ZFS and performance consulting
illumos meetup, Jan 10, 2012, Menlo Park, CA
zfs-discuss mailing list