Re: [zfs-discuss] raid-z - not even iops distribution

2010-06-24 Thread Robert Milkowski

On 24/06/2010 20:52, Arne Jansen wrote:

Ross Walker wrote:

Raidz is definitely made for sequential IO patterns not random. To 
get good random IO with raidz you need a zpool with X raidz vdevs 
where X = desired IOPS/IOPS of single drive.




I have seen statements like this repeated several times, though
I haven't been able to find an in-depth discussion of why this
is the case. From what I've gathered every block (what is the
correct term for this? zio block?) written is spread across the
whole raid-z. But in what units? will a 4k write be split into
512 byte writes? And in the opposite direction, every block needs
to be read fully, even if only parts of it are being requested,
because the checksum needs to be checked? Will the parity be
read, too?
If this is all the case, I can see why raid-z reduces the performance
of an array effectively to one device w.r.t. random reads.



http://blogs.sun.com/roch/entry/when_to_and_not_to

--
Robert Milkowski
http://milek.blogspot.com

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] raid-z - not even iops distribution

2010-06-24 Thread Arne Jansen

Ross Walker wrote:


Raidz is definitely made for sequential IO patterns not random. To get good 
random IO with raidz you need a zpool with X raidz vdevs where X = desired 
IOPS/IOPS of single drive.



I have seen statements like this repeated several times, though
I haven't been able to find an in-depth discussion of why this
is the case. From what I've gathered every block (what is the
correct term for this? zio block?) written is spread across the
whole raid-z. But in what units? will a 4k write be split into
512 byte writes? And in the opposite direction, every block needs
to be read fully, even if only parts of it are being requested,
because the checksum needs to be checked? Will the parity be
read, too?
If this is all the case, I can see why raid-z reduces the performance
of an array effectively to one device w.r.t. random reads.

Thanks,
Arne
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] raid-z - not even iops distribution

2010-06-24 Thread Adam Leventhal
Hey Robert,

I've filed a bug to track this issue. We'll try to reproduce the problem and 
evaluate the cause. Thanks for bringing this to our attention.

Adam

On Jun 24, 2010, at 2:40 AM, Robert Milkowski wrote:

> On 23/06/2010 18:50, Adam Leventhal wrote:
>>> Does it mean that for dataset used for databases and similar environments 
>>> where basically all blocks have fixed size and there is no other data all 
>>> parity information will end-up on one (z1) or two (z2) specific disks?
>>> 
>> No. There are always smaller writes to metadata that will distribute parity. 
>> What is the total width of your raidz1 stripe?
>> 
>>   
> 
> 4x disks, 16KB recordsize, 128GB file, random read with 16KB block.
> 
> -- 
> Robert Milkowski
> http://milek.blogspot.com
> 
> 


--
Adam Leventhal, Fishworkshttp://blogs.sun.com/ahl

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] raid-z - not even iops distribution

2010-06-24 Thread Ross Walker
On Jun 24, 2010, at 10:42 AM, Robert Milkowski  wrote:

> On 24/06/2010 14:32, Ross Walker wrote:
>> On Jun 24, 2010, at 5:40 AM, Robert Milkowski  wrote:
>> 
>>   
>>> On 23/06/2010 18:50, Adam Leventhal wrote:
>>> 
> Does it mean that for dataset used for databases and similar environments 
> where basically all blocks have fixed size and there is no other data all 
> parity information will end-up on one (z1) or two (z2) specific disks?
> 
> 
 No. There are always smaller writes to metadata that will distribute 
 parity. What is the total width of your raidz1 stripe?
 
 
   
>>> 4x disks, 16KB recordsize, 128GB file, random read with 16KB block.
>>> 
>> From what I gather each 16KB record (plus parity) is spread across the raidz 
>> disks. This causes the total random IOPS (write AND read) of the raidz to be 
>> that of the slowest disk in the raidz.
>> 
>> Raidz is definitely made for sequential IO patterns not random. To get good 
>> random IO with raidz you need a zpool with X raidz vdevs where X = desired 
>> IOPS/IOPS of single drive.
>>   
> 
> I know that and it wasn't mine question.

Sorry, for the OP...


___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] raid-z - not even iops distribution

2010-06-24 Thread Robert Milkowski

On 24/06/2010 15:54, Bob Friesenhahn wrote:

On Thu, 24 Jun 2010, Ross Walker wrote:


Raidz is definitely made for sequential IO patterns not random. To 
get good random IO with raidz you need a zpool with X raidz vdevs 
where X = desired IOPS/IOPS of single drive.


Remarkably, I have yet to see mention of someone testing a raidz which 
is comprised entirely of FLASH SSDs.  This should help with the IOPS, 
particularly when reading.


I have.

Briefly:


  X4270 2x Quad-core 2.93GHz, 72GB RAM
  Open Solaris 2009.06 (snv_111b)
  ARC limited to 4GB
  44x SSD in a F5100.
  4x SAS HBAs, 4x physical SAS connections to the f5100 (16x SAS 
channels in total), each to a different domain.



1. RAID-10 pool

22x mirrors across domains
ZFS: 16KB recordsize, atime=off
randomread filebennch benchmark with a 16KB block size with 1, 16, 
..., 128 threads, 128GB working set.


maximum performance when 128 threads: ~137,000 ops/s

2. RAID-Z pool

11x 4-way RAID-z, each raid-z vdev across domains
ZFS: recordsize=16k, atime=off
randomread filebennch benchmark with a 16KB block size with 1, 16, 
..., 128 threads, 128GB working set.


maximum performance when 64-128 threads: ~34,000 ops/s

With a ZFS recordsize of 32KB it got up-to ~41,000 ops/s.
Larger ZFS record sizes produced worse results.



RAID-Z delivered about 3.3X less ops/s compared to RAID-10 here.
SSDs do not make any fundamental chanage here and RAID-Z characteristics 
are basically the same whether it is configured out of SSDs or HDDs.


However SSDs could of course provide a good-enough performance even with 
RAID-Z, as at the end of a day it is not about benchmarks but your 
environment requirements.


A given number of SSDs in a RAID-Z configuration is able to deliver the 
same performance as a much greater number of disk drives in RAID-10 
configuration and if you don't need much space it could make sense.



--
Robert Milkowski
http://milek.blogspot.com

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] raid-z - not even iops distribution

2010-06-24 Thread Bob Friesenhahn

On Thu, 24 Jun 2010, Ross Walker wrote:


Raidz is definitely made for sequential IO patterns not random. To 
get good random IO with raidz you need a zpool with X raidz vdevs 
where X = desired IOPS/IOPS of single drive.


Remarkably, I have yet to see mention of someone testing a raidz which 
is comprised entirely of FLASH SSDs.  This should help with the IOPS, 
particularly when reading.


Bob
--
Bob Friesenhahn
bfrie...@simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/
GraphicsMagick Maintainer,http://www.GraphicsMagick.org/
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] raid-z - not even iops distribution

2010-06-24 Thread Robert Milkowski

On 24/06/2010 14:32, Ross Walker wrote:

On Jun 24, 2010, at 5:40 AM, Robert Milkowski  wrote:

   

On 23/06/2010 18:50, Adam Leventhal wrote:
 

Does it mean that for dataset used for databases and similar environments where 
basically all blocks have fixed size and there is no other data all parity 
information will end-up on one (z1) or two (z2) specific disks?

 

No. There are always smaller writes to metadata that will distribute parity. 
What is the total width of your raidz1 stripe?


   

4x disks, 16KB recordsize, 128GB file, random read with 16KB block.
 

 From what I gather each 16KB record (plus parity) is spread across the raidz 
disks. This causes the total random IOPS (write AND read) of the raidz to be 
that of the slowest disk in the raidz.

Raidz is definitely made for sequential IO patterns not random. To get good 
random IO with raidz you need a zpool with X raidz vdevs where X = desired 
IOPS/IOPS of single drive.
   


I know that and it wasn't mine question.

--
Robert Milkowski
http://milek.blogspot.com

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] raid-z - not even iops distribution

2010-06-24 Thread Ross Walker
On Jun 24, 2010, at 5:40 AM, Robert Milkowski  wrote:

> On 23/06/2010 18:50, Adam Leventhal wrote:
>>> Does it mean that for dataset used for databases and similar environments 
>>> where basically all blocks have fixed size and there is no other data all 
>>> parity information will end-up on one (z1) or two (z2) specific disks?
>>> 
>> No. There are always smaller writes to metadata that will distribute parity. 
>> What is the total width of your raidz1 stripe?
>> 
>>   
> 
> 4x disks, 16KB recordsize, 128GB file, random read with 16KB block.

>From what I gather each 16KB record (plus parity) is spread across the raidz 
>disks. This causes the total random IOPS (write AND read) of the raidz to be 
>that of the slowest disk in the raidz.

Raidz is definitely made for sequential IO patterns not random. To get good 
random IO with raidz you need a zpool with X raidz vdevs where X = desired 
IOPS/IOPS of single drive.

-Ross


___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] raid-z - not even iops distribution

2010-06-24 Thread Robert Milkowski

On 23/06/2010 19:29, Ross Walker wrote:

On Jun 23, 2010, at 1:48 PM, Robert Milkowski  wrote:

   

128GB.

Does it mean that for dataset used for databases and similar environments where 
basically all blocks have fixed size and there is no other data all parity 
information will end-up on one (z1) or two (z2) specific disks?
 

What's the record size on those datasets?

8k?

   


16K

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] raid-z - not even iops distribution

2010-06-24 Thread Robert Milkowski

On 23/06/2010 18:50, Adam Leventhal wrote:

Does it mean that for dataset used for databases and similar environments where 
basically all blocks have fixed size and there is no other data all parity 
information will end-up on one (z1) or two (z2) specific disks?
 

No. There are always smaller writes to metadata that will distribute parity. 
What is the total width of your raidz1 stripe?

   


4x disks, 16KB recordsize, 128GB file, random read with 16KB block.

--
Robert Milkowski
http://milek.blogspot.com


___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] raid-z - not even iops distribution

2010-06-23 Thread Ross Walker
On Jun 23, 2010, at 1:48 PM, Robert Milkowski  wrote:

> 
> 128GB.
> 
> Does it mean that for dataset used for databases and similar environments 
> where basically all blocks have fixed size and there is no other data all 
> parity information will end-up on one (z1) or two (z2) specific disks?

What's the record size on those datasets?

8k?

-Ross

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] raid-z - not even iops distribution

2010-06-23 Thread Adam Leventhal
> Does it mean that for dataset used for databases and similar environments 
> where basically all blocks have fixed size and there is no other data all 
> parity information will end-up on one (z1) or two (z2) specific disks?

No. There are always smaller writes to metadata that will distribute parity. 
What is the total width of your raidz1 stripe?

Adam

--
Adam Leventhal, Fishworkshttp://blogs.sun.com/ahl

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] raid-z - not even iops distribution

2010-06-23 Thread Robert Milkowski


128GB.

Does it mean that for dataset used for databases and similar 
environments where basically all blocks have fixed size and there is no 
other data all parity information will end-up on one (z1) or two (z2) 
specific disks?




On 23/06/2010 17:51, Adam Leventhal wrote:

Hey Robert,

How big of a file are you making? RAID-Z does not explicitly do the parity 
distribution that RAID-5 does. Instead, it relies on non-uniform stripe widths 
to distribute IOPS.

Adam

On Jun 18, 2010, at 7:26 AM, Robert Milkowski wrote:

   

Hi,


zpool create test raidz c0t0d0 c1t0d0 c2t0d0 c3t0d0 \
  raidz c0t1d0 c1t1d0 c2t1d0 c3t1d0 \
  raidz c0t2d0 c1t2d0 c2t2d0 c3t2d0 \
  raidz c0t3d0 c1t3d0 c2t3d0 c3t3d0 \
  [...]
  raidz c0t10d0 c1t10d0 c2t10d0 c3t10d0

zfs set atime=off test
zfs set recordsize=16k test
(I know...)

now if I create a one large file with filebench and simulate a randomread 
workload with 1 or more threads then disks on c2 and c3 controllers are getting 
about 80% more reads. This happens both on 111b and snv_134. I would rather 
except all of them to get about the same number of iops.

Any idea why?


--
Robert Milkowski
http://milek.blogspot.com

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
 


--
Adam Leventhal, Fishworkshttp://blogs.sun.com/ahl


   


___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] raid-z - not even iops distribution

2010-06-23 Thread Adam Leventhal
Hey Robert,

How big of a file are you making? RAID-Z does not explicitly do the parity 
distribution that RAID-5 does. Instead, it relies on non-uniform stripe widths 
to distribute IOPS.

Adam

On Jun 18, 2010, at 7:26 AM, Robert Milkowski wrote:

> Hi,
> 
> 
> zpool create test raidz c0t0d0 c1t0d0 c2t0d0 c3t0d0 \
>  raidz c0t1d0 c1t1d0 c2t1d0 c3t1d0 \
>  raidz c0t2d0 c1t2d0 c2t2d0 c3t2d0 \
>  raidz c0t3d0 c1t3d0 c2t3d0 c3t3d0 \
>  [...]
>  raidz c0t10d0 c1t10d0 c2t10d0 c3t10d0
> 
> zfs set atime=off test
> zfs set recordsize=16k test
> (I know...)
> 
> now if I create a one large file with filebench and simulate a randomread 
> workload with 1 or more threads then disks on c2 and c3 controllers are 
> getting about 80% more reads. This happens both on 111b and snv_134. I would 
> rather except all of them to get about the same number of iops.
> 
> Any idea why?
> 
> 
> -- 
> Robert Milkowski
> http://milek.blogspot.com
> 
> ___
> zfs-discuss mailing list
> zfs-discuss@opensolaris.org
> http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


--
Adam Leventhal, Fishworkshttp://blogs.sun.com/ahl

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] raid-z - not even iops distribution

2010-06-23 Thread Scott Meilicke
Reaching into the dusty regions of my brain, I seem to recall that since RAIDz 
does not work like a traditional RAID 5, particularly because of variably sized 
stripes, that the data may not hit all of the disks, but it will always be 
redundant. 

I apologize for not having a reference for this assertion, so I may be 
completely wrong.

I assume your hardware is recent, the controllers are on PCIe x4 buses, etc.

-Scott
-- 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


[zfs-discuss] raid-z - not even iops distribution

2010-06-18 Thread Robert Milkowski

Hi,


zpool create test raidz c0t0d0 c1t0d0 c2t0d0 c3t0d0 \
  raidz c0t1d0 c1t1d0 c2t1d0 c3t1d0 \
  raidz c0t2d0 c1t2d0 c2t2d0 c3t2d0 \
  raidz c0t3d0 c1t3d0 c2t3d0 c3t3d0 \
  [...]
  raidz c0t10d0 c1t10d0 c2t10d0 c3t10d0

zfs set atime=off test
zfs set recordsize=16k test
(I know...)

now if I create a one large file with filebench and simulate a 
randomread workload with 1 or more threads then disks on c2 and c3 
controllers are getting about 80% more reads. This happens both on 111b 
and snv_134. I would rather except all of them to get about the same 
number of iops.


Any idea why?


--
Robert Milkowski
http://milek.blogspot.com

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss