I have a new idea up for discussion.
Several RAID systems have implemented "spread" spare drives
in the sense that there is not an idling disk waiting to
receive a burst of resilver data filling it up, but the
capacity of the spare disk is spread among all drives in
the array. As a result, the healthy array gets one more
spindle and works a little faster, and rebuild times are
often decreased since more spindles can participate in
repairs at the same time.
I don't think I've seen such idea proposed for ZFS, and
I do wonder if it is at all possible with variable-width
stripes? Although if the disk is sliced in 200 metaslabs
or so, implementing a spread-spare is a no-brainer as well.
To be honest, I've seen this a long time ago in (Falcon?)
RAID controllers, and recently - in a USEnix presentation
of IBM GPFS on YouTube. In the latter the speaker goes
a greater depth describing how their "declustered RAID"
approach (as they call it: all blocks - spare, redundancy
and data are intermixed evenly on all drives and not in
a single "group" or a mid-level VDEV as would be for ZFS).
GPFS with declustered RAID not only decreases rebuild
times and/or impact of rebuilds on end-user operations,
but it also happens to increase reliability - there is
a smaller time window in case of multiple-disk failure
in a large RAID-6 or RAID-7 array (in the example they
use 47-disk sets) that the data is left in a "critical
state" due to lack of redundancy, and there is less data
overall in such state - so the system goes from critical
to simply degraded (with some redundancy) in a few minutes.
Another thing they have in GPFS is temporary offlining
of disks so that they can catch up when reattached - only
newer writes (bigger TXG numbers in ZFS terms) are added to
reinserted disks. I am not sure this exists in ZFS today,
either. This might simplify physical systems maintenance
(as it does for IBM boxes - see presentation if interested)
and quick recovery from temporarily unavailable disks, such
as when a disk gets a bus reset and is unavailable for writes
for a few seconds (or more) while the array keeps on writing.
I find these ideas cool. I do believe that IBM might get
angry if ZFS development copy-pasted them "as is", but it
might get nonetheless get us inventing a similar wheel
that would be a bit different ;)
There are already several vendors doing this in some way,
so perhaps there is no (patent) monopoly in place already...
And I think all the magic of spread spares and/or "declustered
RAID" would go into just making another write-block allocator
in the same league "raidz" or "mirror" are nowadays...
BTW, are such allocators pluggable (as software modules)?
What do you think - can and should such ideas find their
way into ZFS? Or why not? Perhaps from theoretical or
real-life experience with such storage approaches?
zfs-discuss mailing list