Re: [zfs-discuss] raidz stripe size (not stripe width)

2010-01-05 Thread Kjetil Torgrim Homme
Brad bene...@yahoo.com writes:

 Hi Adam,

I'm not Adam, but I'll take a stab at it anyway.

BTW, your crossposting is a bit confusing to follow, at least when using
gmane.org.  I think it is better to stick to one mailing list anyway?

 From your the picture, it looks like the data is distributed evenly
 (with the exception of parity) across each spindle then wrapping
 around again (final 4K) - is this one single write operation or two?

it is a single write operation per device.  actually, it may be less
than one write operation since the transaction group, which probably
contains many more updates, is written as a whole.

-- 
Kjetil T. Homme
Redpill Linpro AS - Changing the game

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] raidz stripe size (not stripe width)

2010-01-05 Thread Richard Elling


On Jan 4, 2010, at 7:08 PM, Brad wrote:


Hi Adam,

From your the picture, it looks like the data is distributed evenly  
(with the exception of parity) across each spindle then wrapping  
around again (final 4K) - is this one single write operation or two?


| P | D00 | D01 | D02 | D03 | D04 | D05 | D06 | D07 | - 
one write op??
| P | D08 | D09 | D10 | D11 | D12 | D13 | D14 | D15 | - 
one write op??


One physical write op per vdev because the columns will likely be
coalesced at the vdev.  Obviously, one physical write cannot span
multiple vdevs.


For a stripe configuration, is this would it would like look for 8K?

| D00 D01 D02 D03 D04 D05 D06 D07 D08 |
| D09 D10 D11 D12 D13 D14 D15 D16 D17 |


No.  It is very likely the entire write will be to one vdev.  Again,  
this is

dynamic striping, not RAID-0. RAID-0 is defined by SNIA as A disk array
data mapping technique in which fixed-length sequences of virtual disk
data addresses are mapped to sequences of member disk addresses
in a regular rotating pattern.  In ZFS, there is no fixed-length  
sequence.

The next column is chosen approximately every MB or so. You get the
benefit of sequential access to the media, with the stochastic spreading
across vdevs as well.

When you have multiple top-level vdevs, such as multiple mirrors or
multiple raidz sets, then you get the ~ 1MB spread across the top level
and the normal allocations in the sets.  In other words, any given  
record

should be in one set.  Again, this limits hyperspreading and allows you
to scale to very large numbers of disks.  It seems to work reasonably
well in practice. I attempted to describe this in pictures for my ZFS
tutorials.  You can be the judge, and suggestions are always welcome.
See slide 27 at
http://www.slideshare.net/relling/zfs-tutorial-usenix-lisa09-conference

[for the alias, I've only today succeeded in uploading the slides to
slideshare... been trying off and on for more than a month :-(]
 -- richard

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


[zfs-discuss] raidz stripe size (not stripe width)

2010-01-04 Thread Brad
If a 8K file system block is written on a 9 disk raidz vdev, how is the data 
distributed (writtened) between all devices in the vdev since a zfs write is 
one continuously IO operation?

Is it distributed evenly (1.125KB) per device?
-- 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] raidz stripe size (not stripe width)

2010-01-04 Thread Adam Leventhal
Hi Brad,

RAID-Z will carve up the 8K blocks into chunks at the granularity of the sector 
size -- today 512 bytes but soon going to 4K. In this case a 9-disk RAID-Z vdev 
will look like this:

|  P  | D00 | D01 | D02 | D03 | D04 | D05 | D06 | D07 |
|  P  | D08 | D09 | D10 | D11 | D12 | D13 | D14 | D15 |

1K per device with an additional 1K for parity.

Adam

On Jan 4, 2010, at 3:17 PM, Brad wrote:

 If a 8K file system block is written on a 9 disk raidz vdev, how is the data 
 distributed (writtened) between all devices in the vdev since a zfs write is 
 one continuously IO operation?
 
 Is it distributed evenly (1.125KB) per device?
 -- 
 This message posted from opensolaris.org
 ___
 zfs-discuss mailing list
 zfs-discuss@opensolaris.org
 http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


--
Adam Leventhal, Fishworkshttp://blogs.sun.com/ahl

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] raidz stripe size (not stripe width)

2010-01-04 Thread Brad
Hi Adam,

From your the picture, it looks like the data is distributed evenly (with the 
exception of parity) across each spindle then wrapping around again (final 4K) 
- is this one single write operation or two?

| P | D00 | D01 | D02 | D03 | D04 | D05 | D06 | D07 | -one write 
op??
| P | D08 | D09 | D10 | D11 | D12 | D13 | D14 | D15 | -one write 
op??

For a stripe configuration, is this would it would like look for 8K?

| D00 D01 D02 D03 D04 D05 D06 D07 D08 |
| D09 D10 D11 D12 D13 D14 D15 D16 D17 |
-- 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss