Re: [zfs-discuss] ZFS raidz on top of hardware raid0

2011-08-30 Thread Daniel Carosone
On Mon, Aug 29, 2011 at 11:40:34PM -0400, Edward Ned Harvey wrote:
  On Sat, Aug 27, 2011 at 08:44:13AM -0700, Richard Elling wrote:
   I'm getting a but tired of people designing for fast resilvering.
  
  It is a design consideration, regardless, though your point is valid
  that it shouldn't be the overriding consideration.
 
 I disagree.  I think if you build a system that will literally never
 complete a resilver, or if the resilver requires weeks or months to
 complete, then you've fundamentally misconfigured your system.  Avoiding
 such situations should be a top priority.  Such a misconfiguration is
 sometimes the case with people building 21-disk raidz3 and similar
 configurations...

Ok, yes, for these extreme cases, any of the considerations gets a
veto for pool is unservicable. 

Beyond that, though, Richard's point is that optimising for resilver
time to the exclusion of other requirements will produce bad designs.
In my extended example, I mentioned resilver and recovery times and
impacts, but only in amongst other factors.

Another way of putting it is that pool configs that will be pessimal for
resilver will likely also be pessimal for other considerations
(general iops performance being the obvious closely-linked case).

--
Dan.

pgpiAdCd7AGFq.pgp
Description: PGP signature
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] ZFS raidz on top of hardware raid0

2011-08-29 Thread Edward Ned Harvey
 From: zfs-discuss-boun...@opensolaris.org [mailto:zfs-discuss-
 boun...@opensolaris.org] On Behalf Of Daniel Carosone
 
 On Sat, Aug 27, 2011 at 08:44:13AM -0700, Richard Elling wrote:
  I'm getting a but tired of people designing for fast resilvering.
 
 It is a design consideration, regardless, though your point is valid
 that it shouldn't be the overriding consideration.

I disagree.  I think if you build a system that will literally never
complete a resilver, or if the resilver requires weeks or months to
complete, then you've fundamentally misconfigured your system.  Avoiding
such situations should be a top priority.  Such a misconfiguration is
sometimes the case with people building 21-disk raidz3 and similar
configurations...

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] ZFS raidz on top of hardware raid0

2011-08-28 Thread Richard Elling
On Aug 26, 2011, at 4:02 PM, Brandon High bh...@freaks.com wrote:

 On Fri, Aug 12, 2011 at 6:34 PM, Tom Tang thomps...@supermicro.com wrote:
 Suppose I want to build a 100-drive storage system, wondering if there is 
 any disadvantages for me to setup 20 arrays of HW RAID0 (5 drives each), 
 then setup ZFS file system on these 20 virtual drives and configure them as 
 RAIDZ?
 
 A 20-device wide raidz is a bad idea. Making those devices from
 stripes just compounds the issue.

Yes... you need to think in the reverse. Instead of making highly dependable
solutions out of unreliable components, you need to make judicious use
of reliable components. In other words, RAID-10 is much better than RAID-01,
or in this case, RAID-z0 is much better than RAID-0z.

 The biggest problem is that resilvering would be a nightmare, and
 you're practically guaranteed to have additional failures or read
 errors while degraded.

I'm getting a but tired of people designing for fast resilvering. This is
akin to buying a car based on how easy it is to change a flat tire. It
is a better idea to base your decision on cost, fuel economy, safety,
or even color.

 You would achieve better performance, error detection and recovery by
 using several top-level raidz. 20 x 5-disk raidz would give you very
 good read and write performance with decent resilver times and 20%
 overhead for redundancy. 10 x 10-disk raidz2 would give more
 protection, but a little less performance, and higher resilver times.

A 20 x 5-disk raidz (RAID-z0) is a superior design in every way. Using 
the simple Mean Time To Data Loss (MTTDL) model, for disks with 1 million
hours rated Mean Time Between Failure (MTBF) and a Mean Time To Repair 
(MTTR) of 10 days:
   5 disk RAID-0 has MTTDL of 199,728 hours
   20-way raidz of those disks has a MTTDL of 437,124 hours
as compared to:
   20 x 5 raidz has MTTDL of 12,395,400 hours

For MTTDL, 12,395,400 hours is better than 437,124 hours
QED

  -- richard


___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] ZFS raidz on top of hardware raid0

2011-08-28 Thread Daniel Carosone
On Sat, Aug 27, 2011 at 08:44:13AM -0700, Richard Elling wrote:
 I'm getting a but tired of people designing for fast resilvering. 

It is a design consideration, regardless, though your point is valid
that it shouldn't be the overriding consideration. 

To the original question and poster: 

This often arises out of another type of consideration, that of the
size of a failure unit.  When plugging together systems at almost
any scale beyond a handful of disks, there are many kinds of groupings
of disks whereby the whole group may disappear if a certain component
fails: controllers, power supplies, backplanes, cables, network/fabric
switches, backplanes, etc.  The probabilities of each of these varies,
often greatly, but they can shape and constrain a design.

I'm going to choose a deliberately exaggerated example, to illustrate
the discussion and recommendations in the thread, using the OP's
numbers.

Let's say that I have 20 5-disk little NAS boxes, each with their own
single power supply and NIC.  Each is an iSCSI target, and can serve
up either 5 bare-disk LUNs, or a single LUN for the whole box, backed
by internal RAID. Internal RAID can be 0 or 5. 

Clearly, a-box-of-5-disks is an independent failure unit, at
non-trivial probability via a variety of possible causes. I better
plan my pool accordingly. 

The first option is to simplify the configuration, representing
the obvious failure unit as a single LUN, just a big disk.  There is
merit in simplicity, especially for the humans involved if they're not
sophisticated and experienced ZFS users (or else why would they be
asking these questions?). This may prevent confusion and possible
mistakes (at 3am under pressure, even experienced admins make those). 

This gives us 20 disks to make a pool, of whatever layout suits our
performance and resiliency needs.  Regardless of what disks are used,
a 20-way RAIDZ is unlikely to be a good answer.  2x 10-way raidz2, 4x
5-way raidz1, 2-way and 3-way mirrors, might all be useful depending
on circumstances. (As an aside, mirrors might be the layout of choice
if switch failures are also to be taken into consideration, for
practical network topologies.)

The second option is to give ZFS all the disks individually. We will
embed our knowledge of the failure domains into the pool structure,
choosing which disks go in which vdev accordingly. 

The simplest expression of this is to take the same layout we chose
above for 20 big disks, and make 5 of them, each as a top-level vdev
in the same pattern, for each of the 5 individual disks. Think about
making 5 separate pools with the same layout as the previous case, and
stacking them together into one. (As another aside, in previous
discussions I've also recommended considering multiple pools vs
multiple vdevs, that still applies but I won't reiterate here.)

If our pool had enough redundancy for our needs before, we will now
have 5 times as many top-level vdevs, which will experience tolerable
failures in groups of 5 if a disk box dies, for the same overall
result.  

ZFS generally does better this way.  We will have more direct
concurrency, because ZFS's device tree maps to spindles, rather than
to a more complex interaction of underlying components. Physical disk
failures can now be seen by ZFS as such, and don't get amplified to
whole LUN failures (RAID0) or performance degradations during internal
reconstruction (RAID5). ZFS will prefer not to allocate new data on a
degraded vdev until it is repaired, but needs to know about it in the
first place. Even before we talk about recovery, ZFS can likely report
errors better than the internal RAID, which may just hide an issue
long enough for it to become a real problem during another later event.

If we can (e.g.) assign the WWN's of the exported LUNs according to a
scheme that makes disk location obvious, we're less likely to get
confused because of all the extra disks.  The structure is still
apparent.  

(There are more layouts we can now create using the extra disks, but
we lose the simplicity, and they don't really enhance this example for
the general case.  Very careful analysis will be required, and
errors under pressure might result in a situation where the system
works, but later resiliency is compromised.  This is especially true
if hot-spares are involved.) 

So, the ZFS preference is definitely for individual disks.  What might
override this preference, and cause us to use LUNs over the internal
raid, other than the perception of simplicity due to inexperience?
Some possibilities are below.

Because local reconstructions within a box may be much faster than
over the network.  Remember, though, that we trust ZFS more than
RAID5 (even before any specific implementation has a chance to add its
own bugs and wrinkles). So, effectively, after such a local RAID5
reconstruction, we'd want to run a scrub anyway - at which point we
might as well just have let ZFS resilver.  If we have more than one
top-level vdev, which we 

Re: [zfs-discuss] ZFS raidz on top of hardware raid0

2011-08-26 Thread Brandon High
On Fri, Aug 12, 2011 at 6:34 PM, Tom Tang thomps...@supermicro.com wrote:
 Suppose I want to build a 100-drive storage system, wondering if there is any 
 disadvantages for me to setup 20 arrays of HW RAID0 (5 drives each), then 
 setup ZFS file system on these 20 virtual drives and configure them as RAIDZ?

A 20-device wide raidz is a bad idea. Making those devices from
stripes just compounds the issue.

The biggest problem is that resilvering would be a nightmare, and
you're practically guaranteed to have additional failures or read
errors while degraded.

You would achieve better performance, error detection and recovery by
using several top-level raidz. 20 x 5-disk raidz would give you very
good read and write performance with decent resilver times and 20%
overhead for redundancy. 10 x 10-disk raidz2 would give more
protection, but a little less performance, and higher resilver times.

-B

-- 
Brandon High : bh...@freaks.com
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] ZFS raidz on top of hardware raid0

2011-08-15 Thread Cindy Swearingen

D'oh. I shouldn't answer questions first thing Monday morning.

I think you test this configuration with and without the
underlying hardware RAID.

If RAIDZ is the right redundancy level for your workload,
you might be pleasantly surprised with a RAIDZ configuration
built on the h/w raid array in JBOD mode.

http://www.solarisinternals.com/wiki/index.php/ZFS_Best_Practices_Guide

cs

On 08/15/11 08:41, Cindy Swearingen wrote:


Hi Tom,

I think you test this configuration with and without the
underlying hardware RAID.

If RAIDZ is the right redundancy level for your workload,
you might be pleasantly surprised.

http://www.solarisinternals.com/wiki/index.php/ZFS_Best_Practices_Guide

Thanks,

Cindy

On 08/12/11 19:34, Tom Tang wrote:
Suppose I want to build a 100-drive storage system, wondering if there 
is any disadvantages for me to setup 20 arrays of HW RAID0 (5 drives 
each), then setup ZFS file system on these 20 virtual drives and 
configure them as RAIDZ?


I understand people always say ZFS doesn't prefer HW RAID.  Under this 
case, the HW RAID0 is only for stripping (allows higher data transfer 
rate), while the actual RAID5 (i.e. RAIDZ) is done via ZFS which takes 
care all the checksum/error detection/auto-repair.  I guess this will 
not affect any advantages of using ZFS, while I could get higher data 
transfer rate.  Wondering if it's the case? 
Any suggestion or comment?  Please kindly advise.  Thanks!



___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] ZFS raidz on top of hardware raid0

2011-08-15 Thread Bob Friesenhahn

On Fri, 12 Aug 2011, Tom Tang wrote:

Suppose I want to build a 100-drive storage system, wondering if 
there is any disadvantages for me to setup 20 arrays of HW RAID0 (5 
drives each), then setup ZFS file system on these 20 virtual drives 
and configure them as RAIDZ?


The main concern would be resilver times if a drive in one of the HW 
RAID0's fails.  The resilver time would be similar to one huge disk 
drive since there would not be any useful concurrency.


Bob
--
Bob Friesenhahn
bfrie...@simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/
GraphicsMagick Maintainer,http://www.GraphicsMagick.org/
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] ZFS raidz on top of hardware raid0

2011-08-15 Thread LaoTsao
imho, not a good idea, any two hdd in your raid0 fail zpool is dead
if possible just do one hdd raid0 then use zfs to do mirror
raidz or raidz2 will be the last choice

Sent from my iPad
Hung-Sheng Tsao ( LaoTsao) Ph.D

On Aug 12, 2011, at 21:34, Tom Tang thomps...@supermicro.com wrote:

 Suppose I want to build a 100-drive storage system, wondering if there is any 
 disadvantages for me to setup 20 arrays of HW RAID0 (5 drives each), then 
 setup ZFS file system on these 20 virtual drives and configure them as RAIDZ?
 
 I understand people always say ZFS doesn't prefer HW RAID.  Under this case, 
 the HW RAID0 is only for stripping (allows higher data transfer rate), while 
 the actual RAID5 (i.e. RAIDZ) is done via ZFS which takes care all the 
 checksum/error detection/auto-repair.  I guess this will not affect any 
 advantages of using ZFS, while I could get higher data transfer rate.  
 Wondering if it's the case?  
 
 Any suggestion or comment?  Please kindly advise.  Thanks!
 -- 
 This message posted from opensolaris.org
 ___
 zfs-discuss mailing list
 zfs-discuss@opensolaris.org
 http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss