Note: more analysis of the GPFS implementations is needed, but that will take
time than I'll spend this evening :-) Quick hits below...
On Jan 7, 2012, at 7:15 PM, Tim Cook wrote:
> On Sat, Jan 7, 2012 at 7:37 PM, Richard Elling <richard.ell...@gmail.com>
> Hi Jim,
> On Jan 6, 2012, at 3:33 PM, Jim Klimov wrote:
> > Hello all,
> > I have a new idea up for discussion.
> > Several RAID systems have implemented "spread" spare drives
> > in the sense that there is not an idling disk waiting to
> > receive a burst of resilver data filling it up, but the
> > capacity of the spare disk is spread among all drives in
> > the array. As a result, the healthy array gets one more
> > spindle and works a little faster, and rebuild times are
> > often decreased since more spindles can participate in
> > repairs at the same time.
> Xiotech has a distributed, relocatable model, but the FRU is the whole ISE.
> There have been other implementations of more distributed RAIDness in the
> past (RAID-1E, etc).
> The big question is whether they are worth the effort. Spares solve a
> problem and only impact availability in an indirect manner. For single-parity
> solutions, spares can make a big difference in MTTDL, but have almost no
> on MTTDL for double-parity solutions (eg. raidz2).
> I disagree. Dedicated spares impact far more than availability. During a
> rebuild performance is, in general, abysmal.
In ZFS, there is a resilver throttle that is designed to ensure that
does not impact interactive performance. Do you have data that suggests
> ZIL and L2ARC will obviously help (L2ARC more than ZIL),
ZIL makes zero impact on resilver. I'll have to check to see if L2ARC is still
due to the nature of the ARC design, read-once workloads like backup or
not tend to negatively impact frequently used data.
> but at the end of the day, if we've got a 12 hour rebuild (fairly
> conservative in the days of 2TB
> SATA drives), the performance degradation is going to be very real for
I'd like to see some data on this for modern ZFS implementations (post Summer
> With distributed parity and spares, you should in theory be able to cut this
> down an order of magnitude.
> I feel as though you're brushing this off as not a big deal when it's an
> EXTREMELY big deal (in my mind). In my opinion you can't just approach this
> from an MTTDL perspective, you also need to take into account user
> experience. Just because I haven't lost data, doesn't mean the system isn't
> (essentially) unavailable (sorry for the double negative and repeated
> parenthesis). If I can't use the system due to performance being a fraction
> of what it is during normal production, it might as well be an outage.
So we have a method to analyze the ability of a system to perform during
performability. This can be applied to computer systems and we've done some
specifically on RAID arrays. See also
Hence my comment about "doing some math" :-)
> > I don't think I've seen such idea proposed for ZFS, and
> > I do wonder if it is at all possible with variable-width
> > stripes? Although if the disk is sliced in 200 metaslabs
> > or so, implementing a spread-spare is a no-brainer as well.
> Put some thoughts down on paper and work through the math. If it all works
> out, let's implement it!
> -- richard
> I realize it's not intentional Richard, but that response is more than a bit
> condescending. If he could just put it down on paper and code something up,
> I strongly doubt he would be posting his thoughts here. He would be posting
> results. The intention of his post, as far as I can tell, is to perhaps
> inspire someone who CAN just write down the math and write up the code to do
> so. Or at least to have them review his thoughts and give him a dev's
> perspective on how viable bringing something like this to ZFS is. I fear
> responses like "the code is there, figure it out" makes the *aris community
> no better than the linux one.
When I talk about spares in tutorials, we discuss various tradeoffs and how to
the systems. Interestingly, for the GPFS case, the mirrors example clearly
benefit of declustered RAID. However, the triple-parity example (similar to
not so persuasive. If you have raidz3 + spares, then why not go ahead and do
In the tutorial we work through a raidz2 + spare vs raidz2 case and the raidz2
is better in both performance and dependability without sacrificing space (an
It is not very difficult to add a raidz4 or indeed any number of additional
there is a point of diminishing returns, usually when some other system
becomes more critical than the RAID protection. So, raidz4 + spare is less
than raidz5, and so on.
> > To be honest, I've seen this a long time ago in (Falcon?)
> > RAID controllers, and recently - in a USEnix presentation
> > of IBM GPFS on YouTube. In the latter the speaker goes
> > a greater depth describing how their "declustered RAID"
> > approach (as they call it: all blocks - spare, redundancy
> > and data are intermixed evenly on all drives and not in
> > a single "group" or a mid-level VDEV as would be for ZFS).
> > http://www.youtube.com/watch?v=2g5rx4gP6yU&feature=related
> > GPFS with declustered RAID not only decreases rebuild
> > times and/or impact of rebuilds on end-user operations,
> > but it also happens to increase reliability - there is
> > a smaller time window in case of multiple-disk failure
> > in a large RAID-6 or RAID-7 array (in the example they
> > use 47-disk sets) that the data is left in a "critical
> > state" due to lack of redundancy, and there is less data
> > overall in such state - so the system goes from critical
> > to simply degraded (with some redundancy) in a few minutes.
> > Another thing they have in GPFS is temporary offlining
> > of disks so that they can catch up when reattached - only
> > newer writes (bigger TXG numbers in ZFS terms) are added to
> > reinserted disks. I am not sure this exists in ZFS today,
> > either. This might simplify physical systems maintenance
> > (as it does for IBM boxes - see presentation if interested)
> > and quick recovery from temporarily unavailable disks, such
> > as when a disk gets a bus reset and is unavailable for writes
> > for a few seconds (or more) while the array keeps on writing.
> > I find these ideas cool. I do believe that IBM might get
> > angry if ZFS development copy-pasted them "as is", but it
> > might get nonetheless get us inventing a similar wheel
> > that would be a bit different ;)
> > There are already several vendors doing this in some way,
> > so perhaps there is no (patent) monopoly in place already...
> > And I think all the magic of spread spares and/or "declustered
> > RAID" would go into just making another write-block allocator
> > in the same league "raidz" or "mirror" are nowadays...
> > BTW, are such allocators pluggable (as software modules)?
> > What do you think - can and should such ideas find their
> > way into ZFS? Or why not? Perhaps from theoretical or
> > real-life experience with such storage approaches?
> > //Jim Klimov
> > _______________________________________________
> > zfs-discuss mailing list
> > email@example.com
> > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
> ZFS and performance consulting
> illumos meetup, Jan 10, 2012, Menlo Park, CA
> As always, feel free to tell me why my rant is completely off base ;)
ZFS and performance consulting
illumos meetup, Jan 10, 2012, Menlo Park, CA
zfs-discuss mailing list