Re: [zfs-discuss] Recomandations

2010-11-29 Thread Paul Piscuc
Hi,

Thanks for the quick reply. Now that you have mentioned , we have a
different issue. What is the advantage of using spare disks instead of
including them in the raid-z array? If the system pool is on mirrored disks,
I think that this would be enough (hopefully).  When one disk fails, isn't
it better to have a spare disk on hold, instead of one more disk in the
raid-z and no spares(or just a few)? or, rephrased, is it safer and faster
to replace a disk in a raid-z3 and restore the data from the other disks, or
to have a raid-z2 with a spare disk?

Thank you,

On Mon, Nov 29, 2010 at 6:03 AM, Erik Trimble erik.trim...@oracle.comwrote:

 On 11/28/2010 1:51 PM, Paul Piscuc wrote:

 Hi,

 We are a company that want to replace our current  storage layout with one
 that uses ZFS. We have been testing it for a month now, and everything looks
 promising. One element that we cannot determine is the optimum number of
 disks in a raid-z pool. In the ZFS best practice guide, 7,9 and 11 disks are
 recommended to be used in a single raid-z2.  On the other hand, another user
 specifies that the most important part is the distribution of the defaul
 128k record size to all the disks. So, the recommended layout would be:

 4-disk RAID-Z2 = 128KiB / 2 = 64KiB = good
 5-disk RAID-Z2 = 128KiB / 3 = ~43KiB = not good
 6-disk RAID-Z2 = 128KiB / 4 = 32KiB = good
 10-disk RAID-Z2 = 128KiB / 8 = 16KiB = good

 What is your recommendations regarding the number of disks? We are
 planning to use 2 raid-z2 pools with 8+2 disks, 2 spare, 2 SSDs for L2ARC, 2
 SSDs for ZIL, 2 for syspool, and a similar machine for replication.

 Thanks in advance,


 You've hit on one of the hardest parts of using ZFS - optimization.   Truth
 of the matter is that there is NO one-size-fits-all best solution. It
 heavily depends on your workload type - access patterns, write patterns,
 type of I/O, and size of average I/O request.

 A couple of things here:

 (1) Unless you are using Zvols for raw disk partitions (for use with
 something like a database), the recordsize value is a MAXIMUM value, NOT an
 absolute value.  Thus, if you have a ZFS filesystem with a record size of
 128k, it will break up I/O into 128k chunks for writing, but it will also
 write smaller chunks.  I forget what the minimum size is (512b or 1k, IIRC),
 but what ZFS does is use a Variable block size, up to the maximum size
 specified in the recordsize property.   So, if recordsize=128k and you
 have a 190k write I/O op, it will write a 128k chunk, and a 64k chunk (64
 being the smallest multiple of 2 greater than the remaining 62 bits of
 info).  It WON'T write two 128k chunks.

 (2) #1 comes up a bit when you have a mix of file sizes - for instance,
 home directories, where you have lots of small files (initialization files,
 source code, etc.) combined with some much larger files (images, mp3s,
 executable binaries, etc.).  Thus, such a filesystem will have a wide
 variety of chunk sizes, which makes optimization difficult, to say the
 least.

 (3) For *random* I/O, a raidZ of any number of disks performs roughly like
 a *single* disk in terms of IOPs and a little better than a single disk in
 terms of throughput.  So, if you have considerable amounts of random I/O,
 you should really either use small raidz configs (no more than 4 data
 disks), or switch to mirrors instead.

 (4) For *sequential* or large-size I/O, a raidZ performs roughly equivalent
 to a stripe of the same number of data disks. That is, a N-disk raidz2 will
 perform about the same as a (N-2) disk stripe in terms of throughput and
 IOPS.

 (5) As I mentioned in #1, *all* ZFS I/O is broken up into
 powers-of-two-sized chunks, even if the last chunk must have some padding in
 it to get to a power-of-two.   This has implications as to the best number
 of disks in a raidZ(n).


 I'd have to re-look at the ZFS Best Practices Guide, but I'm pretty sure
 the recommendation of 7, 9, or 11 disks was for a raidz1, NOT a raidz2.  Due
 to #5 above, best performance comes with an EVEN number of data disks in any
 raidZ, so a write to any disks is always a full portion of the chunk, rather
 than a partial one (that sounds funny, but trust me).  The best balance of
 size, IOPs, and throughput is found in the mid-size raidZ(n) configs, where
 there are 4, 6 or 8 data disks.


 Honestly, even with you describing a workload, it will be hard for us to
 give you a real exact answer. My best suggestion is to do some testing with
 raidZ(n) of different sizes, to see the tradeoffs between size and
 performance.


 Also, in your sample config, unless you plan to use the spare disks for
 redundancy on the boot mirror, it would be better to configure 2 x 11-disk
 raidZ3 than 2 x 10-disk raidZ2 + 2 spares. Better reliability.


 --
 Erik Trimble
 Java System Support
 Mailstop:  usca22-123
 Phone:  x17195
 Santa Clara, CA


___
zfs-discuss mailing list
zfs-discuss@opensolaris.org

Re: [zfs-discuss] Recomandations

2010-11-29 Thread taemun
On 29 November 2010 15:03, Erik Trimble erik.trim...@oracle.com wrote:

 I'd have to re-look at the ZFS Best Practices Guide, but I'm pretty sure
 the recommendation of 7, 9, or 11 disks was for a raidz1, NOT a raidz2.  Due
 to #5 above, best performance comes with an EVEN number of data disks in any
 raidZ, so a write to any disks is always a full portion of the chunk, rather
 than a partial one (that sounds funny, but trust me).  The best balance of
 size, IOPs, and throughput is found in the mid-size raidZ(n) configs, where
 there are 4, 6 or 8 data disks.


Let the maximum block size of 128KiB = s

If the number of disks in a raidz vdev = n, p = number of parity disks used
and d = data drives.

Hence, n = d + p

So, for some given numbers of d:
d s/d
1 128
2 64
3 42.67
4 32
5 25.6
6 21.33
7 18.29
8 16
9 14.22
10 12.8

Hence, for a raidz vdev with a width of 7, d = 6; s/d = 21.33KiB. This isn't
an ideal block size by any stretch of the imagination. Same thing for a
width of 11, d = 10, s/d = 12.8KiB.

What you were aiming for: for ideal performance, one should keep the vdev
width to the form 2^x + p. So, for raidz: 2, 3, 5, 9, 17. raidz2: 3, 4, 6,
10, 18, etc.

Cheers,
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Recomandations

2010-11-29 Thread Edward Ned Harvey
 From: zfs-discuss-boun...@opensolaris.org [mailto:zfs-discuss-
 boun...@opensolaris.org] On Behalf Of Erik Trimble
 
 (1) Unless you are using Zvols for raw disk partitions (for use with
 something like a database), the recordsize value is a MAXIMUM value, NOT
 an absolute value.  Thus, if you have a ZFS filesystem with a record size
of
 128k, it will break up I/O into 128k chunks for writing, but it will also
write
 smaller chunks.  I forget what the minimum size is (512b or 1k, IIRC), but
what
 ZFS does is use a Variable block size, up to the
 maximum size specified in the recordsize property.   So, if
 recordsize=128k and you have a 190k write I/O op, it will write a 128k
chunk,
 and a 64k chunk (64 being the smallest multiple of 2 greater than the
 remaining 62 bits of info).  It WON'T write two 128k chunks.

So ... Suppose there is a raidz2 with 8+2 disks.  You write a 128K chunk,
which gets divided up into 8 parts, and each disk writes a 16K block, right?


It seems to me, limiting the max size of data that a disk can write will
ultimately result in more random scattering of information about in the
drives, and degrade performance.  

We previously calculated (in some other thread) that in order for a drive to
be efficient which we defined as 99% useful and 1% wasted time seeking,
then each disk would need to be read/writing 40M blocks consistently.  (Of
course, depending on the specs of the drive, but typical consumer 
enterprise disks were consistently around 40M.)  

So wouldn't it be useful to set the recordsize to something huge?  Then if
you've got a large chunk of data to be written, it's actually *permitted* to
be written as a large chunk instead of forcibly breaking it up?

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Recomandations

2010-11-29 Thread Edward Ned Harvey
 From: zfs-discuss-boun...@opensolaris.org [mailto:zfs-discuss-
 boun...@opensolaris.org] On Behalf Of Paul Piscuc
 
 looks promising. One element that we cannot determine is the optimum
 number of disks in a raid-z pool. In the ZFS best practice guide, 7,9 and
11

There are several important things to consider:

-1- Performance in usage.
-2- Cost to buy disks  slots to hold disks.
-3- Resilver / scrub time.

You're already on the right track to answer #1 and #2.  So I want to talk a
little bit about #3

For typical usage on spindle hard disks, ZFS has a problem with resilver and
scrub time.  It will only resilver or scrub the used areas of disk, which
seems like it would be faster than doing the whole disk, but since it ends
up being a whole bunch of small sectors scattered about the disk, and
typically most of the disk is used, and the order of resilver/scrub is not
in disk order, it means you end up needing to do random seeks all over the
disk, to read/write nearly the whole disk.  The end result is a resilver
time that can be 1-2 orders of magnitude larger than you expected.  Like a
week or three, if you have a bad configuration (lots of disks in a vdev) ...
or 12-24 hours in the best case (mirrors and nothing else).

The problem is linearly related to the number of used chunks in the degraded
vdev, which is itself, usually approximated as a fraction of the total pool.
So you minimize the problem if you use mirrors, and you maximize the problem
if you make your pool from one huge raidzN vdev.

On my disks, for a sun server where this was an issue for me ...  If I
needed to resilver the entire disk sequentially, including unused space, it
would have required 2 hrs.  I use ZFS mirrors, and it actually took 12 hrs.
If I had made the pool one big raidzN, it would have needed 20 days.

Until this problem is fixed, I recommend using mirrors only, and staying
away from raidzN, unless you're going to build your whole pool out of SSD's.

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


[zfs-discuss] Recomandations

2010-11-28 Thread Paul Piscuc
Hi,

We are a company that want to replace our current  storage layout with one
that uses ZFS. We have been testing it for a month now, and everything looks
promising. One element that we cannot determine is the optimum number of
disks in a raid-z pool. In the ZFS best practice guide, 7,9 and 11 disks
are recommended to be used in a single raid-z2.  On the other hand, another
user specifies that the most important part is the distribution of the
defaul 128k record size to all the disks. So, the recommended layout would
be:

4-disk RAID-Z2 = 128KiB / 2 = 64KiB = good
5-disk RAID-Z2 = 128KiB / 3 = ~43KiB = not good
6-disk RAID-Z2 = 128KiB / 4 = 32KiB = good
10-disk RAID-Z2 = 128KiB / 8 = 16KiB = good

What is your recommendations regarding the number of disks? We are planning
to use 2 raid-z2 pools with 8+2 disks, 2 spare, 2 SSDs for L2ARC, 2 SSDs for
ZIL, 2 for syspool, and a similar machine for replication.

Thanks in advance,
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Recomandations

2010-11-28 Thread Erik Trimble

On 11/28/2010 1:51 PM, Paul Piscuc wrote:

Hi,

We are a company that want to replace our current  storage layout with 
one that uses ZFS. We have been testing it for a month now, and 
everything looks promising. One element that we cannot determine is 
the optimum number of disks in a raid-z pool. In the ZFS best practice 
guide, 7,9 and 11 disks are recommended to be used in a single 
raid-z2.  On the other hand, another user specifies that the most 
important part is the distribution of the defaul 128k record size to 
all the disks. So, the recommended layout would be:


4-disk RAID-Z2 = 128KiB / 2 = 64KiB = good
5-disk RAID-Z2 = 128KiB / 3 = ~43KiB = not good
6-disk RAID-Z2 = 128KiB / 4 = 32KiB = good
10-disk RAID-Z2 = 128KiB / 8 = 16KiB = good

What is your recommendations regarding the number of disks? We are 
planning to use 2 raid-z2 pools with 8+2 disks, 2 spare, 2 SSDs for 
L2ARC, 2 SSDs for ZIL, 2 for syspool, and a similar machine for 
replication.


Thanks in advance,



You've hit on one of the hardest parts of using ZFS - optimization.   
Truth of the matter is that there is NO one-size-fits-all best 
solution. It heavily depends on your workload type - access patterns, 
write patterns, type of I/O, and size of average I/O request.


A couple of things here:

(1) Unless you are using Zvols for raw disk partitions (for use with 
something like a database), the recordsize value is a MAXIMUM value, NOT 
an absolute value.  Thus, if you have a ZFS filesystem with a record 
size of 128k, it will break up I/O into 128k chunks for writing, but it 
will also write smaller chunks.  I forget what the minimum size is (512b 
or 1k, IIRC), but what ZFS does is use a Variable block size, up to the 
maximum size specified in the recordsize property.   So, if 
recordsize=128k and you have a 190k write I/O op, it will write a 128k 
chunk, and a 64k chunk (64 being the smallest multiple of 2 greater than 
the remaining 62 bits of info).  It WON'T write two 128k chunks.


(2) #1 comes up a bit when you have a mix of file sizes - for instance, 
home directories, where you have lots of small files (initialization 
files, source code, etc.) combined with some much larger files (images, 
mp3s, executable binaries, etc.).  Thus, such a filesystem will have a 
wide variety of chunk sizes, which makes optimization difficult, to say 
the least.


(3) For *random* I/O, a raidZ of any number of disks performs roughly 
like a *single* disk in terms of IOPs and a little better than a single 
disk in terms of throughput.  So, if you have considerable amounts of 
random I/O, you should really either use small raidz configs (no more 
than 4 data disks), or switch to mirrors instead.


(4) For *sequential* or large-size I/O, a raidZ performs roughly 
equivalent to a stripe of the same number of data disks. That is, a 
N-disk raidz2 will perform about the same as a (N-2) disk stripe in 
terms of throughput and IOPS.


(5) As I mentioned in #1, *all* ZFS I/O is broken up into 
powers-of-two-sized chunks, even if the last chunk must have some 
padding in it to get to a power-of-two.   This has implications as to 
the best number of disks in a raidZ(n).



I'd have to re-look at the ZFS Best Practices Guide, but I'm pretty sure 
the recommendation of 7, 9, or 11 disks was for a raidz1, NOT a raidz2.  
Due to #5 above, best performance comes with an EVEN number of data 
disks in any raidZ, so a write to any disks is always a full portion of 
the chunk, rather than a partial one (that sounds funny, but trust me).  
The best balance of size, IOPs, and throughput is found in the mid-size 
raidZ(n) configs, where there are 4, 6 or 8 data disks.



Honestly, even with you describing a workload, it will be hard for us to 
give you a real exact answer. My best suggestion is to do some testing 
with raidZ(n) of different sizes, to see the tradeoffs between size and 
performance.



Also, in your sample config, unless you plan to use the spare disks for 
redundancy on the boot mirror, it would be better to configure 2 x 
11-disk raidZ3 than 2 x 10-disk raidZ2 + 2 spares. Better reliability.



--
Erik Trimble
Java System Support
Mailstop:  usca22-123
Phone:  x17195
Santa Clara, CA

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss