Hi Jim, On Jan 6, 2012, at 3:33 PM, Jim Klimov wrote:
> Hello all, > > I have a new idea up for discussion. > > Several RAID systems have implemented "spread" spare drives > in the sense that there is not an idling disk waiting to > receive a burst of resilver data filling it up, but the > capacity of the spare disk is spread among all drives in > the array. As a result, the healthy array gets one more > spindle and works a little faster, and rebuild times are > often decreased since more spindles can participate in > repairs at the same time. Xiotech has a distributed, relocatable model, but the FRU is the whole ISE. There have been other implementations of more distributed RAIDness in the past (RAID-1E, etc). The big question is whether they are worth the effort. Spares solve a serviceability problem and only impact availability in an indirect manner. For single-parity solutions, spares can make a big difference in MTTDL, but have almost no impact on MTTDL for double-parity solutions (eg. raidz2). > I don't think I've seen such idea proposed for ZFS, and > I do wonder if it is at all possible with variable-width > stripes? Although if the disk is sliced in 200 metaslabs > or so, implementing a spread-spare is a no-brainer as well. Put some thoughts down on paper and work through the math. If it all works out, let's implement it! -- richard > > To be honest, I've seen this a long time ago in (Falcon?) > RAID controllers, and recently - in a USEnix presentation > of IBM GPFS on YouTube. In the latter the speaker goes > a greater depth describing how their "declustered RAID" > approach (as they call it: all blocks - spare, redundancy > and data are intermixed evenly on all drives and not in > a single "group" or a mid-level VDEV as would be for ZFS). > > http://www.youtube.com/watch?v=2g5rx4gP6yU&feature=related > > GPFS with declustered RAID not only decreases rebuild > times and/or impact of rebuilds on end-user operations, > but it also happens to increase reliability - there is > a smaller time window in case of multiple-disk failure > in a large RAID-6 or RAID-7 array (in the example they > use 47-disk sets) that the data is left in a "critical > state" due to lack of redundancy, and there is less data > overall in such state - so the system goes from critical > to simply degraded (with some redundancy) in a few minutes. > > Another thing they have in GPFS is temporary offlining > of disks so that they can catch up when reattached - only > newer writes (bigger TXG numbers in ZFS terms) are added to > reinserted disks. I am not sure this exists in ZFS today, > either. This might simplify physical systems maintenance > (as it does for IBM boxes - see presentation if interested) > and quick recovery from temporarily unavailable disks, such > as when a disk gets a bus reset and is unavailable for writes > for a few seconds (or more) while the array keeps on writing. > > I find these ideas cool. I do believe that IBM might get > angry if ZFS development copy-pasted them "as is", but it > might get nonetheless get us inventing a similar wheel > that would be a bit different ;) > There are already several vendors doing this in some way, > so perhaps there is no (patent) monopoly in place already... > > And I think all the magic of spread spares and/or "declustered > RAID" would go into just making another write-block allocator > in the same league "raidz" or "mirror" are nowadays... > BTW, are such allocators pluggable (as software modules)? > > What do you think - can and should such ideas find their > way into ZFS? Or why not? Perhaps from theoretical or > real-life experience with such storage approaches? > > //Jim Klimov > > _______________________________________________ > zfs-discuss mailing list > zfs-discuss@opensolaris.org > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss -- ZFS and performance consulting http://www.RichardElling.com illumos meetup, Jan 10, 2012, Menlo Park, CA http://www.meetup.com/illumos-User-Group/events/41665962/ _______________________________________________ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss