Re: O_DIRECT to md raid 6 is slow

2012-08-21 Thread Stan Hoeppner
On 8/21/2012 9:51 AM, Miquel van Smoorenburg wrote:
> On 08/20/2012 01:34 AM, Stan Hoeppner wrote:
>> I'm glad you jumped in David.  You made a critical statement of fact
>> below which clears some things up.  If you had stated it early on,
>> before Miquel stole the thread and moved it to LKML proper, it would
>> have short circuited a lot of this discussion.  Which is:
> 
> I'm sorry about that, that's because of the software that I use to
> follow most mailinglist. I didn't notice that the discussion was cc'ed
> to both lkml and l-r. I should fix that.

Oh, my bad.  I thought it was intentional.

Don't feel too bad about it.  When I tried to copy lkml back in on the
one message I screwed up as well.  I though Tbird had filled in the full
address but it didn't.

>> Thus my original statement was correct, or at least half correct[1], as
>> it pertained to md/RAID6.  Then Miquel switched the discussion to
>> md/RAID5 and stated I was all wet.  I wasn't, and neither was Dave
>> Chinner.  I was simply unaware of this md/RAID5 single block write RMW
>> shortcut
> 
> Well, all I tried to say is that a small write of, say, 4K, to a
> raid5/raid6 array does not need to re-write the whole stripe (i.e.
> chunksize * nr_disks) but just 4K * nr_disks, or the RMW variant of that.

And I'm glad you did.  Before that I didn't know about these efficiency
shortcuts and exactly how md does writeback on partial stripe updates.

Even with these optimizations, a default 512KB chunk is too big, for the
reasons I stated, the big one being the fact that you'll rarely fill a
full stripe, meaning nearly every write will incur an RMW cycle.

-- 
Stan

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: O_DIRECT to md raid 6 is slow

2012-08-21 Thread Miquel van Smoorenburg

On 08/20/2012 01:34 AM, Stan Hoeppner wrote:

I'm glad you jumped in David.  You made a critical statement of fact
below which clears some things up.  If you had stated it early on,
before Miquel stole the thread and moved it to LKML proper, it would
have short circuited a lot of this discussion.  Which is:


I'm sorry about that, that's because of the software that I use to 
follow most mailinglist. I didn't notice that the discussion was cc'ed 
to both lkml and l-r. I should fix that.



Thus my original statement was correct, or at least half correct[1], as
it pertained to md/RAID6.  Then Miquel switched the discussion to
md/RAID5 and stated I was all wet.  I wasn't, and neither was Dave
Chinner.  I was simply unaware of this md/RAID5 single block write RMW
shortcut


Well, all I tried to say is that a small write of, say, 4K, to a 
raid5/raid6 array does not need to re-write the whole stripe (i.e. 
chunksize * nr_disks) but just 4K * nr_disks, or the RMW variant of that.


Mike.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: O_DIRECT to md raid 6 is slow

2012-08-21 Thread Stan Hoeppner
On 8/21/2012 9:51 AM, Miquel van Smoorenburg wrote:
 On 08/20/2012 01:34 AM, Stan Hoeppner wrote:
 I'm glad you jumped in David.  You made a critical statement of fact
 below which clears some things up.  If you had stated it early on,
 before Miquel stole the thread and moved it to LKML proper, it would
 have short circuited a lot of this discussion.  Which is:
 
 I'm sorry about that, that's because of the software that I use to
 follow most mailinglist. I didn't notice that the discussion was cc'ed
 to both lkml and l-r. I should fix that.

Oh, my bad.  I thought it was intentional.

Don't feel too bad about it.  When I tried to copy lkml back in on the
one message I screwed up as well.  I though Tbird had filled in the full
address but it didn't.

 Thus my original statement was correct, or at least half correct[1], as
 it pertained to md/RAID6.  Then Miquel switched the discussion to
 md/RAID5 and stated I was all wet.  I wasn't, and neither was Dave
 Chinner.  I was simply unaware of this md/RAID5 single block write RMW
 shortcut
 
 Well, all I tried to say is that a small write of, say, 4K, to a
 raid5/raid6 array does not need to re-write the whole stripe (i.e.
 chunksize * nr_disks) but just 4K * nr_disks, or the RMW variant of that.

And I'm glad you did.  Before that I didn't know about these efficiency
shortcuts and exactly how md does writeback on partial stripe updates.

Even with these optimizations, a default 512KB chunk is too big, for the
reasons I stated, the big one being the fact that you'll rarely fill a
full stripe, meaning nearly every write will incur an RMW cycle.

-- 
Stan

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: O_DIRECT to md raid 6 is slow

2012-08-21 Thread Miquel van Smoorenburg

On 08/20/2012 01:34 AM, Stan Hoeppner wrote:

I'm glad you jumped in David.  You made a critical statement of fact
below which clears some things up.  If you had stated it early on,
before Miquel stole the thread and moved it to LKML proper, it would
have short circuited a lot of this discussion.  Which is:


I'm sorry about that, that's because of the software that I use to 
follow most mailinglist. I didn't notice that the discussion was cc'ed 
to both lkml and l-r. I should fix that.



Thus my original statement was correct, or at least half correct[1], as
it pertained to md/RAID6.  Then Miquel switched the discussion to
md/RAID5 and stated I was all wet.  I wasn't, and neither was Dave
Chinner.  I was simply unaware of this md/RAID5 single block write RMW
shortcut


Well, all I tried to say is that a small write of, say, 4K, to a 
raid5/raid6 array does not need to re-write the whole stripe (i.e. 
chunksize * nr_disks) but just 4K * nr_disks, or the RMW variant of that.


Mike.
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: O_DIRECT to md raid 6 is slow

2012-08-20 Thread David Brown

On 20/08/2012 02:01, NeilBrown wrote:

On Sun, 19 Aug 2012 18:34:28 -0500 Stan Hoeppner 
wrote:


Since we are trying to set the record straight


md/RAID6 must read all devices in a RMW cycle.


md/RAID6 must read all data devices (i.e. not parity devices) which it is not
going to write to, in an RWM cycle (which the code actually calls RCW -
reconstruct-write).



md/RAID5 takes a shortcut for single block writes, and must only read
one drive for the RMW cycle.


md/RAID5 uses an alternate mechanism when the number of data blocks that need
to be written is less than half the number of data blocks in a stripe.  In
this alternate mechansim (which the code calls RMW - read-modify-write),
md/RAID5 reads all the blocks that it is about to write to, plus the parity
block.  It then computes the new parity and writes it out along with the new
data.



I've learned something here too - I thought this mechanism was only used 
for a single block write.  Thanks for the correction, Neil.


If you (or anyone else) are ever interested in implementing the same 
thing in raid6, the maths is not actually too bad (now that I've thought 
about it).  (I understand the theory here, but I'm afraid I don't have 
the experience with kernel programming to do the implementation.)


To change a few data blocks, you need to read in the old data blocks 
(Da, Db, etc.) and the old parities (P, Q).


Calculate the xor differences Xa = Da + D'a, Xb = Db + D'b, etc.

The new P parity is P' = P + Xa + Xb +...

The new Q parity is Q' = P + (g^a).Xa + (g^b).Xb + ...
The power series there is just the normal raid6 Q-parity calculation 
with most entries set to 0, and the Xa, Xb, etc. in the appropriate spots.


If the raid6 Q-parity function already has short-cuts for handling zero 
entries (I haven't looked, but the mechanism might be in place to 
slightly speed up dual-failure recovery), then all the blocks are in place.



--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: O_DIRECT to md raid 6 is slow

2012-08-20 Thread David Brown

On 20/08/2012 02:01, NeilBrown wrote:

On Sun, 19 Aug 2012 18:34:28 -0500 Stan Hoeppner s...@hardwarefreak.com
wrote:


Since we are trying to set the record straight


md/RAID6 must read all devices in a RMW cycle.


md/RAID6 must read all data devices (i.e. not parity devices) which it is not
going to write to, in an RWM cycle (which the code actually calls RCW -
reconstruct-write).



md/RAID5 takes a shortcut for single block writes, and must only read
one drive for the RMW cycle.


md/RAID5 uses an alternate mechanism when the number of data blocks that need
to be written is less than half the number of data blocks in a stripe.  In
this alternate mechansim (which the code calls RMW - read-modify-write),
md/RAID5 reads all the blocks that it is about to write to, plus the parity
block.  It then computes the new parity and writes it out along with the new
data.



I've learned something here too - I thought this mechanism was only used 
for a single block write.  Thanks for the correction, Neil.


If you (or anyone else) are ever interested in implementing the same 
thing in raid6, the maths is not actually too bad (now that I've thought 
about it).  (I understand the theory here, but I'm afraid I don't have 
the experience with kernel programming to do the implementation.)


To change a few data blocks, you need to read in the old data blocks 
(Da, Db, etc.) and the old parities (P, Q).


Calculate the xor differences Xa = Da + D'a, Xb = Db + D'b, etc.

The new P parity is P' = P + Xa + Xb +...

The new Q parity is Q' = P + (g^a).Xa + (g^b).Xb + ...
The power series there is just the normal raid6 Q-parity calculation 
with most entries set to 0, and the Xa, Xb, etc. in the appropriate spots.


If the raid6 Q-parity function already has short-cuts for handling zero 
entries (I haven't looked, but the mechanism might be in place to 
slightly speed up dual-failure recovery), then all the blocks are in place.



--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: O_DIRECT to md raid 6 is slow

2012-08-19 Thread NeilBrown
On Sun, 19 Aug 2012 18:34:28 -0500 Stan Hoeppner 
wrote:

> On 8/19/2012 9:01 AM, David Brown wrote:
> > I'm sort of jumping in to this thread, so my apologies if I repeat
> > things other people have said already.
> 
> I'm glad you jumped in David.  You made a critical statement of fact
> below which clears some things up.  If you had stated it early on,
> before Miquel stole the thread and moved it to LKML proper, it would
> have short circuited a lot of this discussion.  Which is:
> 
> > AFAIK, there is scope for a few performance optimisations in raid6.  One
> > is that for small writes which only need to change one block, raid5 uses
> > a "short-cut" RMW cycle (read the old data block, read the old parity
> > block, calculate the new parity block, write the new data and parity
> > blocks).  A similar short-cut could be implemented in raid6, though it
> > is not clear how much a difference it would really make.
> 
> Thus my original statement was correct, or at least half correct[1], as
> it pertained to md/RAID6.  Then Miquel switched the discussion to
> md/RAID5 and stated I was all wet.  I wasn't, and neither was Dave
> Chinner.  I was simply unaware of this md/RAID5 single block write RMW
> shortcut.  I'm copying lkml proper on this simply to set the record
> straight.  Not that anyone was paying attention, but it needs to be in
> the same thread in the archives.  The takeaway:
> 

Since we are trying to set the record straight

> md/RAID6 must read all devices in a RMW cycle.

md/RAID6 must read all data devices (i.e. not parity devices) which it is not
going to write to, in an RWM cycle (which the code actually calls RCW -
reconstruct-write).

> 
> md/RAID5 takes a shortcut for single block writes, and must only read
> one drive for the RMW cycle.

md/RAID5 uses an alternate mechanism when the number of data blocks that need
to be written is less than half the number of data blocks in a stripe.  In
this alternate mechansim (which the code calls RMW - read-modify-write),
md/RAID5 reads all the blocks that it is about to write to, plus the parity
block.  It then computes the new parity and writes it out along with the new
data.

> 
> [1}The only thing that's not clear at this point is if md/RAID6 also
> always writes back all chunks during RMW, or only the chunk that has
> changed.

Do you seriously imagine anyone would write code to write out data which it
is known has not changed?  Sad. :-)

NeilBrown



signature.asc
Description: PGP signature


Re: O_DIRECT to md raid 6 is slow

2012-08-19 Thread Stan Hoeppner
On 8/19/2012 9:01 AM, David Brown wrote:
> I'm sort of jumping in to this thread, so my apologies if I repeat
> things other people have said already.

I'm glad you jumped in David.  You made a critical statement of fact
below which clears some things up.  If you had stated it early on,
before Miquel stole the thread and moved it to LKML proper, it would
have short circuited a lot of this discussion.  Which is:

> AFAIK, there is scope for a few performance optimisations in raid6.  One
> is that for small writes which only need to change one block, raid5 uses
> a "short-cut" RMW cycle (read the old data block, read the old parity
> block, calculate the new parity block, write the new data and parity
> blocks).  A similar short-cut could be implemented in raid6, though it
> is not clear how much a difference it would really make.

Thus my original statement was correct, or at least half correct[1], as
it pertained to md/RAID6.  Then Miquel switched the discussion to
md/RAID5 and stated I was all wet.  I wasn't, and neither was Dave
Chinner.  I was simply unaware of this md/RAID5 single block write RMW
shortcut.  I'm copying lkml proper on this simply to set the record
straight.  Not that anyone was paying attention, but it needs to be in
the same thread in the archives.  The takeaway:

md/RAID6 must read all devices in a RMW cycle.

md/RAID5 takes a shortcut for single block writes, and must only read
one drive for the RMW cycle.

[1}The only thing that's not clear at this point is if md/RAID6 also
always writes back all chunks during RMW, or only the chunk that has
changed.

-- 
Stan

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: O_DIRECT to md raid 6 is slow

2012-08-19 Thread Stan Hoeppner
On 8/19/2012 9:01 AM, David Brown wrote:
 I'm sort of jumping in to this thread, so my apologies if I repeat
 things other people have said already.

I'm glad you jumped in David.  You made a critical statement of fact
below which clears some things up.  If you had stated it early on,
before Miquel stole the thread and moved it to LKML proper, it would
have short circuited a lot of this discussion.  Which is:

 AFAIK, there is scope for a few performance optimisations in raid6.  One
 is that for small writes which only need to change one block, raid5 uses
 a short-cut RMW cycle (read the old data block, read the old parity
 block, calculate the new parity block, write the new data and parity
 blocks).  A similar short-cut could be implemented in raid6, though it
 is not clear how much a difference it would really make.

Thus my original statement was correct, or at least half correct[1], as
it pertained to md/RAID6.  Then Miquel switched the discussion to
md/RAID5 and stated I was all wet.  I wasn't, and neither was Dave
Chinner.  I was simply unaware of this md/RAID5 single block write RMW
shortcut.  I'm copying lkml proper on this simply to set the record
straight.  Not that anyone was paying attention, but it needs to be in
the same thread in the archives.  The takeaway:

md/RAID6 must read all devices in a RMW cycle.

md/RAID5 takes a shortcut for single block writes, and must only read
one drive for the RMW cycle.

[1}The only thing that's not clear at this point is if md/RAID6 also
always writes back all chunks during RMW, or only the chunk that has
changed.

-- 
Stan

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: O_DIRECT to md raid 6 is slow

2012-08-19 Thread NeilBrown
On Sun, 19 Aug 2012 18:34:28 -0500 Stan Hoeppner s...@hardwarefreak.com
wrote:

 On 8/19/2012 9:01 AM, David Brown wrote:
  I'm sort of jumping in to this thread, so my apologies if I repeat
  things other people have said already.
 
 I'm glad you jumped in David.  You made a critical statement of fact
 below which clears some things up.  If you had stated it early on,
 before Miquel stole the thread and moved it to LKML proper, it would
 have short circuited a lot of this discussion.  Which is:
 
  AFAIK, there is scope for a few performance optimisations in raid6.  One
  is that for small writes which only need to change one block, raid5 uses
  a short-cut RMW cycle (read the old data block, read the old parity
  block, calculate the new parity block, write the new data and parity
  blocks).  A similar short-cut could be implemented in raid6, though it
  is not clear how much a difference it would really make.
 
 Thus my original statement was correct, or at least half correct[1], as
 it pertained to md/RAID6.  Then Miquel switched the discussion to
 md/RAID5 and stated I was all wet.  I wasn't, and neither was Dave
 Chinner.  I was simply unaware of this md/RAID5 single block write RMW
 shortcut.  I'm copying lkml proper on this simply to set the record
 straight.  Not that anyone was paying attention, but it needs to be in
 the same thread in the archives.  The takeaway:
 

Since we are trying to set the record straight

 md/RAID6 must read all devices in a RMW cycle.

md/RAID6 must read all data devices (i.e. not parity devices) which it is not
going to write to, in an RWM cycle (which the code actually calls RCW -
reconstruct-write).

 
 md/RAID5 takes a shortcut for single block writes, and must only read
 one drive for the RMW cycle.

md/RAID5 uses an alternate mechanism when the number of data blocks that need
to be written is less than half the number of data blocks in a stripe.  In
this alternate mechansim (which the code calls RMW - read-modify-write),
md/RAID5 reads all the blocks that it is about to write to, plus the parity
block.  It then computes the new parity and writes it out along with the new
data.

 
 [1}The only thing that's not clear at this point is if md/RAID6 also
 always writes back all chunks during RMW, or only the chunk that has
 changed.

Do you seriously imagine anyone would write code to write out data which it
is known has not changed?  Sad. :-)

NeilBrown



signature.asc
Description: PGP signature


Re: O_DIRECT to md raid 6 is slow

2012-08-17 Thread Miquel van Smoorenburg

On 08/17/2012 09:31 AM, Stan Hoeppner wrote:

On 8/16/2012 4:50 PM, Miquel van Smoorenburg wrote:

I did a simple test:

* created a 1G partition on 3 seperate disks
* created a md raid5 array with 512K chunksize:
   mdadm -C /dev/md0 -l 5 -c $((1024*512)) -n 3 /dev/sdb1 /dev/sdc1
/dev/sdd1
* ran disk monitoring using 'iostat -k 5 /dev/sdb1 /dev/sdc1 /dev/sdd1'
* wrote a single 4K block:
   dd if=/dev/zero bs=4K count=1 oflag=direct seek=30 of=/dev/md0

Output from iostat over the period in which the 4K write was done. Look
at kB read and kB written:

Device:tpskB_read/skB_wrtn/skB_readkB_wrtn
sdb1  0.60 0.00 1.60  0  8
sdc1  0.60 0.80 0.80  4  4
sdd1  0.60 0.00 1.60  0  8

As you can see, a single 4K read, and a few writes. You see a few blocks
more written that you'd expect because the superblock is updated too.


I'm no dd expert, but this looks like you're simply writing a 4KB block
to a new stripe, using an offset, but not to an existing stripe, as the
array is in a virgin state.  So it doesn't appear this test is going to
trigger RMW.  Don't you need now need to do another write in the same
stripe to to trigger RMW?  Maybe I'm just reading this wrong.


That shouldn't matter, but that is easily checked ofcourse, by writing 
some random random data first, then doing the dd 4K write also with 
random data somewhere in the same area:


# dd if=/dev/urandom bs=1M count=3 of=/dev/md0
3+0 records in
3+0 records out
3145728 bytes (3.1 MB) copied, 0.794494 s, 4.0 MB/s

Now the first 6 chunks are filled with random data, let write 4K 
somewhere in there:


# dd if=/dev/urandom bs=4k count=1 seek=25 of=/dev/md0
1+0 records in
1+0 records out
4096 bytes (4.1 kB) copied, 0.10149 s, 40.4 kB/s

Output from iostat over the period in which the 4K write was done:

Device:tpskB_read/skB_wrtn/skB_readkB_wrtn
sdb1  0.60 0.00 1.60  0  8
sdc1  0.60 0.80 0.80  4  4
sdd1  0.60 0.00 1.60  0  8

Mike.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: O_DIRECT to md raid 6 is slow

2012-08-17 Thread Stan Hoeppner
On 8/16/2012 4:50 PM, Miquel van Smoorenburg wrote:
> On 16-08-12 1:05 PM, Stan Hoeppner wrote:
>> On 8/15/2012 6:07 PM, Miquel van Smoorenburg wrote:
>>> Ehrm no. If you modify, say, a 4K block on a RAID5 array, you just have
>>> to read that 4K block, and the corresponding 4K block on the
>>> parity drive, recalculate parity, and write back 4K of data and 4K
>>> of parity. (read|read) modify (write|write). You do not have to
>>> do I/O in chunksize, ehm, chunks, and you do not have to rmw all disks.
>>
>> See:  http://www.spinics.net/lists/xfs/msg12627.html
>>
>> Dave usually knows what he's talking about, and I didn't see Neil nor
>> anyone else correcting him on his description of md RMW behavior.
> 
> Well he's wrong, or you're interpreting it incorrectly.
> 
> I did a simple test:
> 
> * created a 1G partition on 3 seperate disks
> * created a md raid5 array with 512K chunksize:
>   mdadm -C /dev/md0 -l 5 -c $((1024*512)) -n 3 /dev/sdb1 /dev/sdc1
> /dev/sdd1
> * ran disk monitoring using 'iostat -k 5 /dev/sdb1 /dev/sdc1 /dev/sdd1'
> * wrote a single 4K block:
>   dd if=/dev/zero bs=4K count=1 oflag=direct seek=30 of=/dev/md0
> 
> Output from iostat over the period in which the 4K write was done. Look
> at kB read and kB written:
> 
> Device:tpskB_read/skB_wrtn/skB_readkB_wrtn
> sdb1  0.60 0.00 1.60  0  8
> sdc1  0.60 0.80 0.80  4  4
> sdd1  0.60 0.00 1.60  0  8
> 
> As you can see, a single 4K read, and a few writes. You see a few blocks
> more written that you'd expect because the superblock is updated too.

I'm no dd expert, but this looks like you're simply writing a 4KB block
to a new stripe, using an offset, but not to an existing stripe, as the
array is in a virgin state.  So it doesn't appear this test is going to
trigger RMW.  Don't you need now need to do another write in the same
stripe to to trigger RMW?  Maybe I'm just reading this wrong.

-- 
Stan

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: O_DIRECT to md raid 6 is slow

2012-08-17 Thread Stan Hoeppner
On 8/16/2012 4:50 PM, Miquel van Smoorenburg wrote:
 On 16-08-12 1:05 PM, Stan Hoeppner wrote:
 On 8/15/2012 6:07 PM, Miquel van Smoorenburg wrote:
 Ehrm no. If you modify, say, a 4K block on a RAID5 array, you just have
 to read that 4K block, and the corresponding 4K block on the
 parity drive, recalculate parity, and write back 4K of data and 4K
 of parity. (read|read) modify (write|write). You do not have to
 do I/O in chunksize, ehm, chunks, and you do not have to rmw all disks.

 See:  http://www.spinics.net/lists/xfs/msg12627.html

 Dave usually knows what he's talking about, and I didn't see Neil nor
 anyone else correcting him on his description of md RMW behavior.
 
 Well he's wrong, or you're interpreting it incorrectly.
 
 I did a simple test:
 
 * created a 1G partition on 3 seperate disks
 * created a md raid5 array with 512K chunksize:
   mdadm -C /dev/md0 -l 5 -c $((1024*512)) -n 3 /dev/sdb1 /dev/sdc1
 /dev/sdd1
 * ran disk monitoring using 'iostat -k 5 /dev/sdb1 /dev/sdc1 /dev/sdd1'
 * wrote a single 4K block:
   dd if=/dev/zero bs=4K count=1 oflag=direct seek=30 of=/dev/md0
 
 Output from iostat over the period in which the 4K write was done. Look
 at kB read and kB written:
 
 Device:tpskB_read/skB_wrtn/skB_readkB_wrtn
 sdb1  0.60 0.00 1.60  0  8
 sdc1  0.60 0.80 0.80  4  4
 sdd1  0.60 0.00 1.60  0  8
 
 As you can see, a single 4K read, and a few writes. You see a few blocks
 more written that you'd expect because the superblock is updated too.

I'm no dd expert, but this looks like you're simply writing a 4KB block
to a new stripe, using an offset, but not to an existing stripe, as the
array is in a virgin state.  So it doesn't appear this test is going to
trigger RMW.  Don't you need now need to do another write in the same
stripe to to trigger RMW?  Maybe I'm just reading this wrong.

-- 
Stan

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: O_DIRECT to md raid 6 is slow

2012-08-17 Thread Miquel van Smoorenburg

On 08/17/2012 09:31 AM, Stan Hoeppner wrote:

On 8/16/2012 4:50 PM, Miquel van Smoorenburg wrote:

I did a simple test:

* created a 1G partition on 3 seperate disks
* created a md raid5 array with 512K chunksize:
   mdadm -C /dev/md0 -l 5 -c $((1024*512)) -n 3 /dev/sdb1 /dev/sdc1
/dev/sdd1
* ran disk monitoring using 'iostat -k 5 /dev/sdb1 /dev/sdc1 /dev/sdd1'
* wrote a single 4K block:
   dd if=/dev/zero bs=4K count=1 oflag=direct seek=30 of=/dev/md0

Output from iostat over the period in which the 4K write was done. Look
at kB read and kB written:

Device:tpskB_read/skB_wrtn/skB_readkB_wrtn
sdb1  0.60 0.00 1.60  0  8
sdc1  0.60 0.80 0.80  4  4
sdd1  0.60 0.00 1.60  0  8

As you can see, a single 4K read, and a few writes. You see a few blocks
more written that you'd expect because the superblock is updated too.


I'm no dd expert, but this looks like you're simply writing a 4KB block
to a new stripe, using an offset, but not to an existing stripe, as the
array is in a virgin state.  So it doesn't appear this test is going to
trigger RMW.  Don't you need now need to do another write in the same
stripe to to trigger RMW?  Maybe I'm just reading this wrong.


That shouldn't matter, but that is easily checked ofcourse, by writing 
some random random data first, then doing the dd 4K write also with 
random data somewhere in the same area:


# dd if=/dev/urandom bs=1M count=3 of=/dev/md0
3+0 records in
3+0 records out
3145728 bytes (3.1 MB) copied, 0.794494 s, 4.0 MB/s

Now the first 6 chunks are filled with random data, let write 4K 
somewhere in there:


# dd if=/dev/urandom bs=4k count=1 seek=25 of=/dev/md0
1+0 records in
1+0 records out
4096 bytes (4.1 kB) copied, 0.10149 s, 40.4 kB/s

Output from iostat over the period in which the 4K write was done:

Device:tpskB_read/skB_wrtn/skB_readkB_wrtn
sdb1  0.60 0.00 1.60  0  8
sdc1  0.60 0.80 0.80  4  4
sdd1  0.60 0.00 1.60  0  8

Mike.
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: O_DIRECT to md raid 6 is slow

2012-08-16 Thread Miquel van Smoorenburg

On 16-08-12 1:05 PM, Stan Hoeppner wrote:

On 8/15/2012 6:07 PM, Miquel van Smoorenburg wrote:

Ehrm no. If you modify, say, a 4K block on a RAID5 array, you just have
to read that 4K block, and the corresponding 4K block on the
parity drive, recalculate parity, and write back 4K of data and 4K
of parity. (read|read) modify (write|write). You do not have to
do I/O in chunksize, ehm, chunks, and you do not have to rmw all disks.


See:  http://www.spinics.net/lists/xfs/msg12627.html

Dave usually knows what he's talking about, and I didn't see Neil nor
anyone else correcting him on his description of md RMW behavior.


Well he's wrong, or you're interpreting it incorrectly.

I did a simple test:

* created a 1G partition on 3 seperate disks
* created a md raid5 array with 512K chunksize:
  mdadm -C /dev/md0 -l 5 -c $((1024*512)) -n 3 /dev/sdb1 /dev/sdc1 
/dev/sdd1

* ran disk monitoring using 'iostat -k 5 /dev/sdb1 /dev/sdc1 /dev/sdd1'
* wrote a single 4K block:
  dd if=/dev/zero bs=4K count=1 oflag=direct seek=30 of=/dev/md0

Output from iostat over the period in which the 4K write was done. Look 
at kB read and kB written:


Device:tpskB_read/skB_wrtn/skB_readkB_wrtn
sdb1  0.60 0.00 1.60  0  8
sdc1  0.60 0.80 0.80  4  4
sdd1  0.60 0.00 1.60  0  8

As you can see, a single 4K read, and a few writes. You see a few blocks 
more written that you'd expect because the superblock is updated too.


Mike.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: O_DIRECT to md raid 6 is slow

2012-08-16 Thread Stan Hoeppner
On 8/15/2012 6:07 PM, Miquel van Smoorenburg wrote:
> In article  you write:
>> It's time to blow away the array and start over.  You're already
>> misaligned, and a 512KB chunk is insanely unsuitable for parity RAID,
>> but for a handful of niche all streaming workloads with little/no
>> rewrite, such as video surveillance or DVR workloads.
>>
>> Yes, 512KB is the md 1.2 default.  And yes, it is insane.  Here's why:
>> Deleting a single file changes only a few bytes of directory metadata.
>> With your 6 drive md/RAID6 with 512KB chunk, you must read 3MB of data,
>> modify the directory block in question, calculate parity, then write out
>> 3MB of data to rust.  So you consume 6MB of bandwidth to write less than
>> a dozen bytes.  With a 12 drive RAID6 that's 12MB of bandwidth to modify
>> a few bytes of metadata.  Yes, insane.
> 
> Ehrm no. If you modify, say, a 4K block on a RAID5 array, you just have
> to read that 4K block, and the corresponding 4K block on the
> parity drive, recalculate parity, and write back 4K of data and 4K
> of parity. (read|read) modify (write|write). You do not have to
> do I/O in chunksize, ehm, chunks, and you do not have to rmw all disks.

See:  http://www.spinics.net/lists/xfs/msg12627.html

Dave usually knows what he's talking about, and I didn't see Neil nor
anyone else correcting him on his description of md RMW behavior.  What
I stated above is pretty much exactly what Dave stated, but for the fact
I got the RMW read bytes wrong--should be 2MB/3MB for a 6 drive md/RAID6
and 5MB/6MB for 12 drives.

>> Parity RAID sucks in general because of RMW, but it is orders of
>> magnitude worse when one chooses to use an insane chunk size to boot,
>> and especially so with a large drive count.
[snip]
> Also, 256K or 512K isn't all that big nowadays, there's not much
> latency difference between reading 32K or 512K..

You're forgetting 3 very important things:

1.  All filesystems have metadata
2.  All (worth using) filesystems have a metadata journal
3.  All workloads include some, if not major, metadata operations

When writing journal and directory metadata there is a huge difference
between a 32KB and 512KB chunk especially as the drive count in the
array increases.  Rarely does a filesystem pack enough journal
operations into a single writeout to fill a 512KB stripe, let alone a
4MB stripe.  With a 32KB chunk you see full stripe width journal writes
frequently, minimizing the number of RMW writes to the journal, even up
to 16 data spindle parity arrays (18 drive RAID6).   Using a 512KB chunk
will cause most journal writes to be partial stripe writes, triggering
RMW for most journal writes.  The same is true for directory metadata
writes.

Everyone knows that parity RAID sucks for anything but purely streaming
workloads with little metadata.  With most/all other workloads, using a
large chunk size, such as the md metadata 1.2 default of 512KB, with
parity RAID, simply makes it much worse, whether the RMW cycle affects
all disks or just one data disk and one parity disk.

>> Recreate your array, partition aligned, and manually specify a sane
>> chunk size of something like 32KB.  You'll be much happier with real
>> workloads.
> 
> Aligning is a good idea, 

Understatement of the century.  Just as critical, if not more so, FS
stripe alignment is mandatory with parity RAID lest full stripe writeout
can/will trigger RMW.

> and on modern distributions partitions,
> LVM lv's etc are generally created with 1MB alignment. But using
> a small chunksize like 32K? That depends on the workload, but
> in most cases I'd advise against it.

People should ignore your advice in this regard.  A small chunk size is
optimal for nearly all workloads on a parity array for the reasons I
stated above.  It's the large chunk that is extremely workload
dependent, as again, it only fits well with low metadata streaming
workloads.

-- 
Stan

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: O_DIRECT to md raid 6 is slow

2012-08-16 Thread Roman Mamedov
On Wed, 15 Aug 2012 18:50:44 -0500
Stan Hoeppner  wrote:

> TTBOMK there are two, and only two, COW filesystems in existence:  ZFS and 
> BTRFS.

There is also NILFS2: http://www.nilfs.org/en/
And in general, any https://en.wikipedia.org/wiki/Log-structured_file_system
is COW by design, but afaik of those only NILFS is also in the mainline Linux
kernel AND is not aimed just for some niche like flash-based devices, but for
general-purpose usage.

-- 
With respect,
Roman

~~~
"Stallman had a printer,
with code he could not see.
So he began to tinker,
and set the software free."


signature.asc
Description: PGP signature


Re: O_DIRECT to md raid 6 is slow

2012-08-16 Thread Roman Mamedov
On Wed, 15 Aug 2012 18:50:44 -0500
Stan Hoeppner s...@hardwarefreak.com wrote:

 TTBOMK there are two, and only two, COW filesystems in existence:  ZFS and 
 BTRFS.

There is also NILFS2: http://www.nilfs.org/en/
And in general, any https://en.wikipedia.org/wiki/Log-structured_file_system
is COW by design, but afaik of those only NILFS is also in the mainline Linux
kernel AND is not aimed just for some niche like flash-based devices, but for
general-purpose usage.

-- 
With respect,
Roman

~~~
Stallman had a printer,
with code he could not see.
So he began to tinker,
and set the software free.


signature.asc
Description: PGP signature


Re: O_DIRECT to md raid 6 is slow

2012-08-16 Thread Stan Hoeppner
On 8/15/2012 6:07 PM, Miquel van Smoorenburg wrote:
 In article xs4all.502c1c01.1040...@hardwarefreak.com you write:
 It's time to blow away the array and start over.  You're already
 misaligned, and a 512KB chunk is insanely unsuitable for parity RAID,
 but for a handful of niche all streaming workloads with little/no
 rewrite, such as video surveillance or DVR workloads.

 Yes, 512KB is the md 1.2 default.  And yes, it is insane.  Here's why:
 Deleting a single file changes only a few bytes of directory metadata.
 With your 6 drive md/RAID6 with 512KB chunk, you must read 3MB of data,
 modify the directory block in question, calculate parity, then write out
 3MB of data to rust.  So you consume 6MB of bandwidth to write less than
 a dozen bytes.  With a 12 drive RAID6 that's 12MB of bandwidth to modify
 a few bytes of metadata.  Yes, insane.
 
 Ehrm no. If you modify, say, a 4K block on a RAID5 array, you just have
 to read that 4K block, and the corresponding 4K block on the
 parity drive, recalculate parity, and write back 4K of data and 4K
 of parity. (read|read) modify (write|write). You do not have to
 do I/O in chunksize, ehm, chunks, and you do not have to rmw all disks.

See:  http://www.spinics.net/lists/xfs/msg12627.html

Dave usually knows what he's talking about, and I didn't see Neil nor
anyone else correcting him on his description of md RMW behavior.  What
I stated above is pretty much exactly what Dave stated, but for the fact
I got the RMW read bytes wrong--should be 2MB/3MB for a 6 drive md/RAID6
and 5MB/6MB for 12 drives.

 Parity RAID sucks in general because of RMW, but it is orders of
 magnitude worse when one chooses to use an insane chunk size to boot,
 and especially so with a large drive count.
[snip]
 Also, 256K or 512K isn't all that big nowadays, there's not much
 latency difference between reading 32K or 512K..

You're forgetting 3 very important things:

1.  All filesystems have metadata
2.  All (worth using) filesystems have a metadata journal
3.  All workloads include some, if not major, metadata operations

When writing journal and directory metadata there is a huge difference
between a 32KB and 512KB chunk especially as the drive count in the
array increases.  Rarely does a filesystem pack enough journal
operations into a single writeout to fill a 512KB stripe, let alone a
4MB stripe.  With a 32KB chunk you see full stripe width journal writes
frequently, minimizing the number of RMW writes to the journal, even up
to 16 data spindle parity arrays (18 drive RAID6).   Using a 512KB chunk
will cause most journal writes to be partial stripe writes, triggering
RMW for most journal writes.  The same is true for directory metadata
writes.

Everyone knows that parity RAID sucks for anything but purely streaming
workloads with little metadata.  With most/all other workloads, using a
large chunk size, such as the md metadata 1.2 default of 512KB, with
parity RAID, simply makes it much worse, whether the RMW cycle affects
all disks or just one data disk and one parity disk.

 Recreate your array, partition aligned, and manually specify a sane
 chunk size of something like 32KB.  You'll be much happier with real
 workloads.
 
 Aligning is a good idea, 

Understatement of the century.  Just as critical, if not more so, FS
stripe alignment is mandatory with parity RAID lest full stripe writeout
can/will trigger RMW.

 and on modern distributions partitions,
 LVM lv's etc are generally created with 1MB alignment. But using
 a small chunksize like 32K? That depends on the workload, but
 in most cases I'd advise against it.

People should ignore your advice in this regard.  A small chunk size is
optimal for nearly all workloads on a parity array for the reasons I
stated above.  It's the large chunk that is extremely workload
dependent, as again, it only fits well with low metadata streaming
workloads.

-- 
Stan

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: O_DIRECT to md raid 6 is slow

2012-08-16 Thread Miquel van Smoorenburg

On 16-08-12 1:05 PM, Stan Hoeppner wrote:

On 8/15/2012 6:07 PM, Miquel van Smoorenburg wrote:

Ehrm no. If you modify, say, a 4K block on a RAID5 array, you just have
to read that 4K block, and the corresponding 4K block on the
parity drive, recalculate parity, and write back 4K of data and 4K
of parity. (read|read) modify (write|write). You do not have to
do I/O in chunksize, ehm, chunks, and you do not have to rmw all disks.


See:  http://www.spinics.net/lists/xfs/msg12627.html

Dave usually knows what he's talking about, and I didn't see Neil nor
anyone else correcting him on his description of md RMW behavior.


Well he's wrong, or you're interpreting it incorrectly.

I did a simple test:

* created a 1G partition on 3 seperate disks
* created a md raid5 array with 512K chunksize:
  mdadm -C /dev/md0 -l 5 -c $((1024*512)) -n 3 /dev/sdb1 /dev/sdc1 
/dev/sdd1

* ran disk monitoring using 'iostat -k 5 /dev/sdb1 /dev/sdc1 /dev/sdd1'
* wrote a single 4K block:
  dd if=/dev/zero bs=4K count=1 oflag=direct seek=30 of=/dev/md0

Output from iostat over the period in which the 4K write was done. Look 
at kB read and kB written:


Device:tpskB_read/skB_wrtn/skB_readkB_wrtn
sdb1  0.60 0.00 1.60  0  8
sdc1  0.60 0.80 0.80  4  4
sdd1  0.60 0.00 1.60  0  8

As you can see, a single 4K read, and a few writes. You see a few blocks 
more written that you'd expect because the superblock is updated too.


Mike.
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: O_DIRECT to md raid 6 is slow

2012-08-15 Thread Andy Lutomirski
On Wed, Aug 15, 2012 at 4:50 PM, Stan Hoeppner  wrote:
> On 8/15/2012 5:10 PM, Andy Lutomirski wrote:
>> On Wed, Aug 15, 2012 at 3:00 PM, Stan Hoeppner  
>> wrote:
>>> On 8/15/2012 12:57 PM, Andy Lutomirski wrote:
 On Wed, Aug 15, 2012 at 4:50 AM, John Robinson
  wrote:
> On 15/08/2012 01:49, Andy Lutomirski wrote:
>>
>> If I do:
>> # dd if=/dev/zero of=/dev/md0p1 bs=8M
>
> [...]
>>
>> Grr.  I thought the bad old days of filesystem and related defaults
>> sucking were over.
>
> The previous md chunk default of 64KB wasn't horribly bad, though still
> maybe a bit high for alot of common workloads.  I didn't have eyes/ears
> on the discussion and/or testing process that led to the 'new' 512KB
> default.  Obviously something went horribly wrong here.  512KB isn't a
> show stopper as a default for 0/1/10, but is 8-16 times too large for
> parity RAID.
>
>> cryptsetup aligns sanely these days, xfs is
>> sensible, etc.
>
> XFS won't align with the 512KB chunk default of metadata 1.2.  The
> largest XFS journal stripe unit (su--chunk) is 256KB, and even that
> isn't recommended.  Thus mkfs.xfs throws an error due to the 512KB
> stripe.  See the md and xfs archives for more details, specifically Dave
> Chinner's colorful comments on the md 512KB default.

Heh -- that's why the math didn't make any sense :)

>
>> wtf?  Why is there no sensible filesystem for
>> huge disks?  zfs can't cp --reflink and has all kinds of source
>> availability and licensing issues, xfs can't dedupe at all, and btrfs
>> isn't nearly stable enough.
>
> Deduplication isn't a responsibility of a filesystem.  TTBOMK there are
> two, and only two, COW filesystems in existence:  ZFS and BTRFS.  And
> these are the only two to offer a native dedupe capability.  They did it
> because they could, with COW, not necessarily because they *should*.
> There are dozens of other single node, cluster, and distributed
> filesystems in use today and none of them support COW, and thus none
> support dedup.  So to *expect* a 'sensible' filesystem to include dedupe
> is wishful thinking at best.

I should clarify my rant for the record.  I don't care about in-fs
dedupe.  I want COW so userspace can dedupe and generally replace
hardlinks with sensible cowlinks.  I'm also working on some fun tools
that *require* reflinks for anything resembling decent performance.

--Andy
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: O_DIRECT to md raid 6 is slow

2012-08-15 Thread Stan Hoeppner
On 8/15/2012 5:10 PM, Andy Lutomirski wrote:
> On Wed, Aug 15, 2012 at 3:00 PM, Stan Hoeppner  wrote:
>> On 8/15/2012 12:57 PM, Andy Lutomirski wrote:
>>> On Wed, Aug 15, 2012 at 4:50 AM, John Robinson
>>>  wrote:
 On 15/08/2012 01:49, Andy Lutomirski wrote:
>
> If I do:
> # dd if=/dev/zero of=/dev/md0p1 bs=8M

 [...]

> It looks like md isn't recognizing that I'm writing whole stripes when
> I'm in O_DIRECT mode.


 I see your md device is partitioned. Is the partition itself 
 stripe-aligned?
>>>
>>> Crud.
>>>
>>> md0 : active raid6 sdg1[5] sdf1[4] sde1[3] sdd1[2] sdc1[1] sdb1[0]
>>>   11720536064 blocks super 1.2 level 6, 512k chunk, algorithm 2
>>> [6/6] [UU]
>>>
>>> IIUC this means that I/O should be aligned on 2MB boundaries (512k
>>> chunk * 4 non-parity disks).  gdisk put my partition on a 2048 sector
>>> (i.e. 1MB) boundary.
>>
>> It's time to blow away the array and start over.  You're already
>> misaligned, and a 512KB chunk is insanely unsuitable for parity RAID,
>> but for a handful of niche all streaming workloads with little/no
>> rewrite, such as video surveillance or DVR workloads.
>>
>> Yes, 512KB is the md 1.2 default.  And yes, it is insane.  Here's why:
>> Deleting a single file changes only a few bytes of directory metadata.
>> With your 6 drive md/RAID6 with 512KB chunk, you must read 3MB of data,
>> modify the directory block in question, calculate parity, then write out
>> 3MB of data to rust.  So you consume 6MB of bandwidth to write less than
>> a dozen bytes.  With a 12 drive RAID6 that's 12MB of bandwidth to modify
>> a few bytes of metadata.  Yes, insane.
> 
> Grr.  I thought the bad old days of filesystem and related defaults
> sucking were over.  

The previous md chunk default of 64KB wasn't horribly bad, though still
maybe a bit high for alot of common workloads.  I didn't have eyes/ears
on the discussion and/or testing process that led to the 'new' 512KB
default.  Obviously something went horribly wrong here.  512KB isn't a
show stopper as a default for 0/1/10, but is 8-16 times too large for
parity RAID.

> cryptsetup aligns sanely these days, xfs is
> sensible, etc.  

XFS won't align with the 512KB chunk default of metadata 1.2.  The
largest XFS journal stripe unit (su--chunk) is 256KB, and even that
isn't recommended.  Thus mkfs.xfs throws an error due to the 512KB
stripe.  See the md and xfs archives for more details, specifically Dave
Chinner's colorful comments on the md 512KB default.

> wtf?  Why is there no sensible filesystem for
> huge disks?  zfs can't cp --reflink and has all kinds of source
> availability and licensing issues, xfs can't dedupe at all, and btrfs
> isn't nearly stable enough.

Deduplication isn't a responsibility of a filesystem.  TTBOMK there are
two, and only two, COW filesystems in existence:  ZFS and BTRFS.  And
these are the only two to offer a native dedupe capability.  They did it
because they could, with COW, not necessarily because they *should*.
There are dozens of other single node, cluster, and distributed
filesystems in use today and none of them support COW, and thus none
support dedup.  So to *expect* a 'sensible' filesystem to include dedupe
is wishful thinking at best.

> Anyhow, I'll try the patch from Wu Fengguang.  There's still a bug here...

Always one somewhere.

-- 
Stan

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: O_DIRECT to md raid 6 is slow

2012-08-15 Thread Miquel van Smoorenburg
In article  you write:
>It's time to blow away the array and start over.  You're already
>misaligned, and a 512KB chunk is insanely unsuitable for parity RAID,
>but for a handful of niche all streaming workloads with little/no
>rewrite, such as video surveillance or DVR workloads.
>
>Yes, 512KB is the md 1.2 default.  And yes, it is insane.  Here's why:
>Deleting a single file changes only a few bytes of directory metadata.
>With your 6 drive md/RAID6 with 512KB chunk, you must read 3MB of data,
>modify the directory block in question, calculate parity, then write out
>3MB of data to rust.  So you consume 6MB of bandwidth to write less than
>a dozen bytes.  With a 12 drive RAID6 that's 12MB of bandwidth to modify
>a few bytes of metadata.  Yes, insane.

Ehrm no. If you modify, say, a 4K block on a RAID5 array, you just have
to read that 4K block, and the corresponding 4K block on the
parity drive, recalculate parity, and write back 4K of data and 4K
of parity. (read|read) modify (write|write). You do not have to
do I/O in chunksize, ehm, chunks, and you do not have to rmw all disks.

>Parity RAID sucks in general because of RMW, but it is orders of
>magnitude worse when one chooses to use an insane chunk size to boot,
>and especially so with a large drive count.

If you have a lot of parallel readers (readers >> disks) then
you want chunk sizes of about 2*mean_read_size, so that for each
read you just have 1 seek on 1 disk.

If you have just a few readers (readers  disks) that read
really large blocks then you want a small chunk size to keep
all disks busy.

If you have no readers and just writers and you write large
blocks, then you might want a small chunk size too, so that
you can write data+parity over the stripe in one go, bypassing rmw.

Also, 256K or 512K isn't all that big nowadays, there's not much
latency difference between reading 32K or 512K..

>Recreate your array, partition aligned, and manually specify a sane
>chunk size of something like 32KB.  You'll be much happier with real
>workloads.

Aligning is a good idea, and on modern distributions partitions,
LVM lv's etc are generally created with 1MB alignment. But using
a small chunksize like 32K? That depends on the workload, but
in most cases I'd advise against it.

Mike.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: O_DIRECT to md raid 6 is slow

2012-08-15 Thread Andy Lutomirski
On Wed, Aug 15, 2012 at 3:00 PM, Stan Hoeppner  wrote:
> On 8/15/2012 12:57 PM, Andy Lutomirski wrote:
>> On Wed, Aug 15, 2012 at 4:50 AM, John Robinson
>>  wrote:
>>> On 15/08/2012 01:49, Andy Lutomirski wrote:

 If I do:
 # dd if=/dev/zero of=/dev/md0p1 bs=8M
>>>
>>> [...]
>>>
 It looks like md isn't recognizing that I'm writing whole stripes when
 I'm in O_DIRECT mode.
>>>
>>>
>>> I see your md device is partitioned. Is the partition itself stripe-aligned?
>>
>> Crud.
>>
>> md0 : active raid6 sdg1[5] sdf1[4] sde1[3] sdd1[2] sdc1[1] sdb1[0]
>>   11720536064 blocks super 1.2 level 6, 512k chunk, algorithm 2
>> [6/6] [UU]
>>
>> IIUC this means that I/O should be aligned on 2MB boundaries (512k
>> chunk * 4 non-parity disks).  gdisk put my partition on a 2048 sector
>> (i.e. 1MB) boundary.
>
> It's time to blow away the array and start over.  You're already
> misaligned, and a 512KB chunk is insanely unsuitable for parity RAID,
> but for a handful of niche all streaming workloads with little/no
> rewrite, such as video surveillance or DVR workloads.
>
> Yes, 512KB is the md 1.2 default.  And yes, it is insane.  Here's why:
> Deleting a single file changes only a few bytes of directory metadata.
> With your 6 drive md/RAID6 with 512KB chunk, you must read 3MB of data,
> modify the directory block in question, calculate parity, then write out
> 3MB of data to rust.  So you consume 6MB of bandwidth to write less than
> a dozen bytes.  With a 12 drive RAID6 that's 12MB of bandwidth to modify
> a few bytes of metadata.  Yes, insane.

Grr.  I thought the bad old days of filesystem and related defaults
sucking were over.  cryptsetup aligns sanely these days, xfs is
sensible, etc.  wtf?  Why is there no sensible filesystem for
huge disks?  zfs can't cp --reflink and has all kinds of source
availability and licensing issues, xfs can't dedupe at all, and btrfs
isn't nearly stable enough.

Anyhow, I'll try the patch from Wu Fengguang.  There's still a bug here...

--Andy
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: O_DIRECT to md raid 6 is slow

2012-08-15 Thread Stan Hoeppner
On 8/15/2012 12:57 PM, Andy Lutomirski wrote:
> On Wed, Aug 15, 2012 at 4:50 AM, John Robinson
>  wrote:
>> On 15/08/2012 01:49, Andy Lutomirski wrote:
>>>
>>> If I do:
>>> # dd if=/dev/zero of=/dev/md0p1 bs=8M
>>
>> [...]
>>
>>> It looks like md isn't recognizing that I'm writing whole stripes when
>>> I'm in O_DIRECT mode.
>>
>>
>> I see your md device is partitioned. Is the partition itself stripe-aligned?
> 
> Crud.
> 
> md0 : active raid6 sdg1[5] sdf1[4] sde1[3] sdd1[2] sdc1[1] sdb1[0]
>   11720536064 blocks super 1.2 level 6, 512k chunk, algorithm 2
> [6/6] [UU]
> 
> IIUC this means that I/O should be aligned on 2MB boundaries (512k
> chunk * 4 non-parity disks).  gdisk put my partition on a 2048 sector
> (i.e. 1MB) boundary.

It's time to blow away the array and start over.  You're already
misaligned, and a 512KB chunk is insanely unsuitable for parity RAID,
but for a handful of niche all streaming workloads with little/no
rewrite, such as video surveillance or DVR workloads.

Yes, 512KB is the md 1.2 default.  And yes, it is insane.  Here's why:
Deleting a single file changes only a few bytes of directory metadata.
With your 6 drive md/RAID6 with 512KB chunk, you must read 3MB of data,
modify the directory block in question, calculate parity, then write out
3MB of data to rust.  So you consume 6MB of bandwidth to write less than
a dozen bytes.  With a 12 drive RAID6 that's 12MB of bandwidth to modify
a few bytes of metadata.  Yes, insane.

Parity RAID sucks in general because of RMW, but it is orders of
magnitude worse when one chooses to use an insane chunk size to boot,
and especially so with a large drive count.

It seems people tend to use large chunk sizes because array
initialization is a bit faster, and running block x-fer "tests" with dd
buffered sequential reads/writes makes their Levi's expand.  Then they
are confused when their actual workloads are horribly slow.

Recreate your array, partition aligned, and manually specify a sane
chunk size of something like 32KB.  You'll be much happier with real
workloads.

-- 
Stan


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: O_DIRECT to md raid 6 is slow

2012-08-15 Thread Andy Lutomirski
On Wed, Aug 15, 2012 at 4:50 AM, John Robinson
 wrote:
> On 15/08/2012 01:49, Andy Lutomirski wrote:
>>
>> If I do:
>> # dd if=/dev/zero of=/dev/md0p1 bs=8M
>
> [...]
>
>> It looks like md isn't recognizing that I'm writing whole stripes when
>> I'm in O_DIRECT mode.
>
>
> I see your md device is partitioned. Is the partition itself stripe-aligned?

Crud.

md0 : active raid6 sdg1[5] sdf1[4] sde1[3] sdd1[2] sdc1[1] sdb1[0]
  11720536064 blocks super 1.2 level 6, 512k chunk, algorithm 2
[6/6] [UU]

IIUC this means that I/O should be aligned on 2MB boundaries (512k
chunk * 4 non-parity disks).  gdisk put my partition on a 2048 sector
(i.e. 1MB) boundary.

Sadly, /sys/block/md0/md0p1/alignment_offset reports 0 (instead of 1MB).

Fixing this has no effect, though.

--Andy

-- 
Andy Lutomirski
AMA Capital Management, LLC
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: O_DIRECT to md raid 6 is slow

2012-08-15 Thread John Robinson

On 15/08/2012 01:49, Andy Lutomirski wrote:

If I do:
# dd if=/dev/zero of=/dev/md0p1 bs=8M

[...]

It looks like md isn't recognizing that I'm writing whole stripes when
I'm in O_DIRECT mode.


I see your md device is partitioned. Is the partition itself stripe-aligned?

Cheers,

John.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: O_DIRECT to md raid 6 is slow

2012-08-15 Thread John Robinson

On 15/08/2012 01:49, Andy Lutomirski wrote:

If I do:
# dd if=/dev/zero of=/dev/md0p1 bs=8M

[...]

It looks like md isn't recognizing that I'm writing whole stripes when
I'm in O_DIRECT mode.


I see your md device is partitioned. Is the partition itself stripe-aligned?

Cheers,

John.

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: O_DIRECT to md raid 6 is slow

2012-08-15 Thread Andy Lutomirski
On Wed, Aug 15, 2012 at 4:50 AM, John Robinson
john.robin...@anonymous.org.uk wrote:
 On 15/08/2012 01:49, Andy Lutomirski wrote:

 If I do:
 # dd if=/dev/zero of=/dev/md0p1 bs=8M

 [...]

 It looks like md isn't recognizing that I'm writing whole stripes when
 I'm in O_DIRECT mode.


 I see your md device is partitioned. Is the partition itself stripe-aligned?

Crud.

md0 : active raid6 sdg1[5] sdf1[4] sde1[3] sdd1[2] sdc1[1] sdb1[0]
  11720536064 blocks super 1.2 level 6, 512k chunk, algorithm 2
[6/6] [UU]

IIUC this means that I/O should be aligned on 2MB boundaries (512k
chunk * 4 non-parity disks).  gdisk put my partition on a 2048 sector
(i.e. 1MB) boundary.

Sadly, /sys/block/md0/md0p1/alignment_offset reports 0 (instead of 1MB).

Fixing this has no effect, though.

--Andy

-- 
Andy Lutomirski
AMA Capital Management, LLC
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: O_DIRECT to md raid 6 is slow

2012-08-15 Thread Stan Hoeppner
On 8/15/2012 12:57 PM, Andy Lutomirski wrote:
 On Wed, Aug 15, 2012 at 4:50 AM, John Robinson
 john.robin...@anonymous.org.uk wrote:
 On 15/08/2012 01:49, Andy Lutomirski wrote:

 If I do:
 # dd if=/dev/zero of=/dev/md0p1 bs=8M

 [...]

 It looks like md isn't recognizing that I'm writing whole stripes when
 I'm in O_DIRECT mode.


 I see your md device is partitioned. Is the partition itself stripe-aligned?
 
 Crud.
 
 md0 : active raid6 sdg1[5] sdf1[4] sde1[3] sdd1[2] sdc1[1] sdb1[0]
   11720536064 blocks super 1.2 level 6, 512k chunk, algorithm 2
 [6/6] [UU]
 
 IIUC this means that I/O should be aligned on 2MB boundaries (512k
 chunk * 4 non-parity disks).  gdisk put my partition on a 2048 sector
 (i.e. 1MB) boundary.

It's time to blow away the array and start over.  You're already
misaligned, and a 512KB chunk is insanely unsuitable for parity RAID,
but for a handful of niche all streaming workloads with little/no
rewrite, such as video surveillance or DVR workloads.

Yes, 512KB is the md 1.2 default.  And yes, it is insane.  Here's why:
Deleting a single file changes only a few bytes of directory metadata.
With your 6 drive md/RAID6 with 512KB chunk, you must read 3MB of data,
modify the directory block in question, calculate parity, then write out
3MB of data to rust.  So you consume 6MB of bandwidth to write less than
a dozen bytes.  With a 12 drive RAID6 that's 12MB of bandwidth to modify
a few bytes of metadata.  Yes, insane.

Parity RAID sucks in general because of RMW, but it is orders of
magnitude worse when one chooses to use an insane chunk size to boot,
and especially so with a large drive count.

It seems people tend to use large chunk sizes because array
initialization is a bit faster, and running block x-fer tests with dd
buffered sequential reads/writes makes their Levi's expand.  Then they
are confused when their actual workloads are horribly slow.

Recreate your array, partition aligned, and manually specify a sane
chunk size of something like 32KB.  You'll be much happier with real
workloads.

-- 
Stan


--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: O_DIRECT to md raid 6 is slow

2012-08-15 Thread Andy Lutomirski
On Wed, Aug 15, 2012 at 3:00 PM, Stan Hoeppner s...@hardwarefreak.com wrote:
 On 8/15/2012 12:57 PM, Andy Lutomirski wrote:
 On Wed, Aug 15, 2012 at 4:50 AM, John Robinson
 john.robin...@anonymous.org.uk wrote:
 On 15/08/2012 01:49, Andy Lutomirski wrote:

 If I do:
 # dd if=/dev/zero of=/dev/md0p1 bs=8M

 [...]

 It looks like md isn't recognizing that I'm writing whole stripes when
 I'm in O_DIRECT mode.


 I see your md device is partitioned. Is the partition itself stripe-aligned?

 Crud.

 md0 : active raid6 sdg1[5] sdf1[4] sde1[3] sdd1[2] sdc1[1] sdb1[0]
   11720536064 blocks super 1.2 level 6, 512k chunk, algorithm 2
 [6/6] [UU]

 IIUC this means that I/O should be aligned on 2MB boundaries (512k
 chunk * 4 non-parity disks).  gdisk put my partition on a 2048 sector
 (i.e. 1MB) boundary.

 It's time to blow away the array and start over.  You're already
 misaligned, and a 512KB chunk is insanely unsuitable for parity RAID,
 but for a handful of niche all streaming workloads with little/no
 rewrite, such as video surveillance or DVR workloads.

 Yes, 512KB is the md 1.2 default.  And yes, it is insane.  Here's why:
 Deleting a single file changes only a few bytes of directory metadata.
 With your 6 drive md/RAID6 with 512KB chunk, you must read 3MB of data,
 modify the directory block in question, calculate parity, then write out
 3MB of data to rust.  So you consume 6MB of bandwidth to write less than
 a dozen bytes.  With a 12 drive RAID6 that's 12MB of bandwidth to modify
 a few bytes of metadata.  Yes, insane.

Grr.  I thought the bad old days of filesystem and related defaults
sucking were over.  cryptsetup aligns sanely these days, xfs is
sensible, etc.  wtf?  rantWhy is there no sensible filesystem for
huge disks?  zfs can't cp --reflink and has all kinds of source
availability and licensing issues, xfs can't dedupe at all, and btrfs
isn't nearly stable enough./rant

Anyhow, I'll try the patch from Wu Fengguang.  There's still a bug here...

--Andy
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: O_DIRECT to md raid 6 is slow

2012-08-15 Thread Miquel van Smoorenburg
In article xs4all.502c1c01.1040...@hardwarefreak.com you write:
It's time to blow away the array and start over.  You're already
misaligned, and a 512KB chunk is insanely unsuitable for parity RAID,
but for a handful of niche all streaming workloads with little/no
rewrite, such as video surveillance or DVR workloads.

Yes, 512KB is the md 1.2 default.  And yes, it is insane.  Here's why:
Deleting a single file changes only a few bytes of directory metadata.
With your 6 drive md/RAID6 with 512KB chunk, you must read 3MB of data,
modify the directory block in question, calculate parity, then write out
3MB of data to rust.  So you consume 6MB of bandwidth to write less than
a dozen bytes.  With a 12 drive RAID6 that's 12MB of bandwidth to modify
a few bytes of metadata.  Yes, insane.

Ehrm no. If you modify, say, a 4K block on a RAID5 array, you just have
to read that 4K block, and the corresponding 4K block on the
parity drive, recalculate parity, and write back 4K of data and 4K
of parity. (read|read) modify (write|write). You do not have to
do I/O in chunksize, ehm, chunks, and you do not have to rmw all disks.

Parity RAID sucks in general because of RMW, but it is orders of
magnitude worse when one chooses to use an insane chunk size to boot,
and especially so with a large drive count.

If you have a lot of parallel readers (readers  disks) then
you want chunk sizes of about 2*mean_read_size, so that for each
read you just have 1 seek on 1 disk.

If you have just a few readers (readers  disks) that read
really large blocks then you want a small chunk size to keep
all disks busy.

If you have no readers and just writers and you write large
blocks, then you might want a small chunk size too, so that
you can write data+parity over the stripe in one go, bypassing rmw.

Also, 256K or 512K isn't all that big nowadays, there's not much
latency difference between reading 32K or 512K..

Recreate your array, partition aligned, and manually specify a sane
chunk size of something like 32KB.  You'll be much happier with real
workloads.

Aligning is a good idea, and on modern distributions partitions,
LVM lv's etc are generally created with 1MB alignment. But using
a small chunksize like 32K? That depends on the workload, but
in most cases I'd advise against it.

Mike.
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: O_DIRECT to md raid 6 is slow

2012-08-15 Thread Stan Hoeppner
On 8/15/2012 5:10 PM, Andy Lutomirski wrote:
 On Wed, Aug 15, 2012 at 3:00 PM, Stan Hoeppner s...@hardwarefreak.com wrote:
 On 8/15/2012 12:57 PM, Andy Lutomirski wrote:
 On Wed, Aug 15, 2012 at 4:50 AM, John Robinson
 john.robin...@anonymous.org.uk wrote:
 On 15/08/2012 01:49, Andy Lutomirski wrote:

 If I do:
 # dd if=/dev/zero of=/dev/md0p1 bs=8M

 [...]

 It looks like md isn't recognizing that I'm writing whole stripes when
 I'm in O_DIRECT mode.


 I see your md device is partitioned. Is the partition itself 
 stripe-aligned?

 Crud.

 md0 : active raid6 sdg1[5] sdf1[4] sde1[3] sdd1[2] sdc1[1] sdb1[0]
   11720536064 blocks super 1.2 level 6, 512k chunk, algorithm 2
 [6/6] [UU]

 IIUC this means that I/O should be aligned on 2MB boundaries (512k
 chunk * 4 non-parity disks).  gdisk put my partition on a 2048 sector
 (i.e. 1MB) boundary.

 It's time to blow away the array and start over.  You're already
 misaligned, and a 512KB chunk is insanely unsuitable for parity RAID,
 but for a handful of niche all streaming workloads with little/no
 rewrite, such as video surveillance or DVR workloads.

 Yes, 512KB is the md 1.2 default.  And yes, it is insane.  Here's why:
 Deleting a single file changes only a few bytes of directory metadata.
 With your 6 drive md/RAID6 with 512KB chunk, you must read 3MB of data,
 modify the directory block in question, calculate parity, then write out
 3MB of data to rust.  So you consume 6MB of bandwidth to write less than
 a dozen bytes.  With a 12 drive RAID6 that's 12MB of bandwidth to modify
 a few bytes of metadata.  Yes, insane.
 
 Grr.  I thought the bad old days of filesystem and related defaults
 sucking were over.  

The previous md chunk default of 64KB wasn't horribly bad, though still
maybe a bit high for alot of common workloads.  I didn't have eyes/ears
on the discussion and/or testing process that led to the 'new' 512KB
default.  Obviously something went horribly wrong here.  512KB isn't a
show stopper as a default for 0/1/10, but is 8-16 times too large for
parity RAID.

 cryptsetup aligns sanely these days, xfs is
 sensible, etc.  

XFS won't align with the 512KB chunk default of metadata 1.2.  The
largest XFS journal stripe unit (su--chunk) is 256KB, and even that
isn't recommended.  Thus mkfs.xfs throws an error due to the 512KB
stripe.  See the md and xfs archives for more details, specifically Dave
Chinner's colorful comments on the md 512KB default.

 wtf?  rantWhy is there no sensible filesystem for
 huge disks?  zfs can't cp --reflink and has all kinds of source
 availability and licensing issues, xfs can't dedupe at all, and btrfs
 isn't nearly stable enough./rant

Deduplication isn't a responsibility of a filesystem.  TTBOMK there are
two, and only two, COW filesystems in existence:  ZFS and BTRFS.  And
these are the only two to offer a native dedupe capability.  They did it
because they could, with COW, not necessarily because they *should*.
There are dozens of other single node, cluster, and distributed
filesystems in use today and none of them support COW, and thus none
support dedup.  So to *expect* a 'sensible' filesystem to include dedupe
is wishful thinking at best.

 Anyhow, I'll try the patch from Wu Fengguang.  There's still a bug here...

Always one somewhere.

-- 
Stan

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: O_DIRECT to md raid 6 is slow

2012-08-15 Thread Andy Lutomirski
On Wed, Aug 15, 2012 at 4:50 PM, Stan Hoeppner s...@hardwarefreak.com wrote:
 On 8/15/2012 5:10 PM, Andy Lutomirski wrote:
 On Wed, Aug 15, 2012 at 3:00 PM, Stan Hoeppner s...@hardwarefreak.com 
 wrote:
 On 8/15/2012 12:57 PM, Andy Lutomirski wrote:
 On Wed, Aug 15, 2012 at 4:50 AM, John Robinson
 john.robin...@anonymous.org.uk wrote:
 On 15/08/2012 01:49, Andy Lutomirski wrote:

 If I do:
 # dd if=/dev/zero of=/dev/md0p1 bs=8M

 [...]

 Grr.  I thought the bad old days of filesystem and related defaults
 sucking were over.

 The previous md chunk default of 64KB wasn't horribly bad, though still
 maybe a bit high for alot of common workloads.  I didn't have eyes/ears
 on the discussion and/or testing process that led to the 'new' 512KB
 default.  Obviously something went horribly wrong here.  512KB isn't a
 show stopper as a default for 0/1/10, but is 8-16 times too large for
 parity RAID.

 cryptsetup aligns sanely these days, xfs is
 sensible, etc.

 XFS won't align with the 512KB chunk default of metadata 1.2.  The
 largest XFS journal stripe unit (su--chunk) is 256KB, and even that
 isn't recommended.  Thus mkfs.xfs throws an error due to the 512KB
 stripe.  See the md and xfs archives for more details, specifically Dave
 Chinner's colorful comments on the md 512KB default.

Heh -- that's why the math didn't make any sense :)


 wtf?  rantWhy is there no sensible filesystem for
 huge disks?  zfs can't cp --reflink and has all kinds of source
 availability and licensing issues, xfs can't dedupe at all, and btrfs
 isn't nearly stable enough./rant

 Deduplication isn't a responsibility of a filesystem.  TTBOMK there are
 two, and only two, COW filesystems in existence:  ZFS and BTRFS.  And
 these are the only two to offer a native dedupe capability.  They did it
 because they could, with COW, not necessarily because they *should*.
 There are dozens of other single node, cluster, and distributed
 filesystems in use today and none of them support COW, and thus none
 support dedup.  So to *expect* a 'sensible' filesystem to include dedupe
 is wishful thinking at best.

I should clarify my rant for the record.  I don't care about in-fs
dedupe.  I want COW so userspace can dedupe and generally replace
hardlinks with sensible cowlinks.  I'm also working on some fun tools
that *require* reflinks for anything resembling decent performance.

--Andy
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Re: O_DIRECT to md raid 6 is slow

2012-08-14 Thread kedacomkernel
On 2012-08-15 09:12 Andy Lutomirski  Wrote:
>Ubuntu's 3.2.0-27-generic.  I can test on a newer kernel tomorrow.
I guess maybe miss the blk_plug function.
Can you add this patch and retest.

Move unplugging for direct I/O from around ->direct_IO() down to
do_blockdev_direct_IO(). This implicitly adds plugging for direct
writes.
 
CC: Li Shaohua 
Acked-by: Jeff Moyer 
Signed-off-by: Wu Fengguang 
---
 fs/direct-io.c |5 +
 mm/filemap.c   |4 
 2 files changed, 5 insertions(+), 4 deletions(-)
 
--- linux-next.orig/mm/filemap.c 2012-08-05 16:24:47.859465122 +0800
+++ linux-next/mm/filemap.c 2012-08-05 16:24:48.407465135 +0800
@@ -1412,12 +1412,8 @@ generic_file_aio_read(struct kiocb *iocb
  retval = filemap_write_and_wait_range(mapping, pos,
  pos + iov_length(iov, nr_segs) - 1);
  if (!retval) {
- struct blk_plug plug;
-
- blk_start_plug();
  retval = mapping->a_ops->direct_IO(READ, iocb,
  iov, pos, nr_segs);
- blk_finish_plug();
  }
  if (retval > 0) {
  *ppos = pos + retval;
--- linux-next.orig/fs/direct-io.c 2012-07-07 21:46:39.531508198 +0800
+++ linux-next/fs/direct-io.c 2012-08-05 16:24:48.411465136 +0800
@@ -1062,6 +1062,7 @@ do_blockdev_direct_IO(int rw, struct kio
  unsigned long user_addr;
  size_t bytes;
  struct buffer_head map_bh = { 0, };
+ struct blk_plug plug;
 
  if (rw & WRITE)
  rw = WRITE_ODIRECT;
@@ -1177,6 +1178,8 @@ do_blockdev_direct_IO(int rw, struct kio
  PAGE_SIZE - user_addr / PAGE_SIZE);
  }
 
+ blk_start_plug();
+
  for (seg = 0; seg < nr_segs; seg++) {
  user_addr = (unsigned long)iov[seg].iov_base;
  sdio.size += bytes = iov[seg].iov_len;
@@ -1235,6 +1238,8 @@ do_blockdev_direct_IO(int rw, struct kio
  if (sdio.bio)
  dio_bio_submit(dio, );
 
+ blk_finish_plug();
+
  /*
   * It is possible that, we return short IO due to end of file.
   * In that case, we need to release all the pages we got hold on.
 
 
--


Re: O_DIRECT to md raid 6 is slow

2012-08-14 Thread Andy Lutomirski
Ubuntu's 3.2.0-27-generic.  I can test on a newer kernel tomorrow.

--Andy

On Tue, Aug 14, 2012 at 6:07 PM, kedacomkernel  wrote:
> On 2012-08-15 08:49 Andy Lutomirski  Wrote:
>>If I do:
>># dd if=/dev/zero of=/dev/md0p1 bs=8M
>>then iostat -m 5 says:
>>
>>avg-cpu:  %user   %nice %system %iowait  %steal   %idle
>>   0.000.00   26.88   35.270.00   37.85
>>
>>Device:tpsMB_read/sMB_wrtn/sMB_readMB_wrtn
>>sdb 265.20 1.1654.79  5273
>>sdc 266.20 1.4754.73  7273
>>sdd 264.20 1.3854.54  6272
>>sdf 286.00 1.8454.74  9273
>>sde 266.60 1.0454.75  5273
>>sdg 265.00 1.0254.74  5273
>>md0   55808.00 0.00   218.00  0   1090
>>
>>If I do:
>># dd if=/dev/zero of=/dev/md0p1 bs=8M oflag=direct
>>then iostat -m 5 says:
>>avg-cpu:  %user   %nice %system %iowait  %steal   %idle
>>   0.000.00   11.70   12.940.00   75.36
>>
>>Device:tpsMB_read/sMB_wrtn/sMB_readMB_wrtn
>>sdb 831.00 8.5830.42 42152
>>sdc 832.80 8.0529.99 40149
>>sdd 832.00 9.1029.78 45148
>>sdf 838.40 9.1129.72 45148
>>sde 828.80 7.9129.79 39148
>>sdg 850.80 8.0030.18 40150
>>md01012.60 0.00   101.27  0506
>>
>>It looks like md isn't recognizing that I'm writing whole stripes when
>>I'm in O_DIRECT mode.
>>
> kernel version?
>
>>--Andy
>>
>>--
>>Andy Lutomirski
>>AMA Capital Management, LLC
>>--
>>To unsubscribe from this list: send the line "unsubscribe linux-raid" in
>>the body of a message to majord...@vger.kernel.org
>>More majordomo info at  http://vger.kernel.org/majordomo-info.html



-- 
Andy Lutomirski
AMA Capital Management, LLC
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: O_DIRECT to md raid 6 is slow

2012-08-14 Thread kedacomkernel
On 2012-08-15 08:49 Andy Lutomirski  Wrote:
>If I do:
># dd if=/dev/zero of=/dev/md0p1 bs=8M
>then iostat -m 5 says:
>
>avg-cpu:  %user   %nice %system %iowait  %steal   %idle
>   0.000.00   26.88   35.270.00   37.85
>
>Device:tpsMB_read/sMB_wrtn/sMB_readMB_wrtn
>sdb 265.20 1.1654.79  5273
>sdc 266.20 1.4754.73  7273
>sdd 264.20 1.3854.54  6272
>sdf 286.00 1.8454.74  9273
>sde 266.60 1.0454.75  5273
>sdg 265.00 1.0254.74  5273
>md0   55808.00 0.00   218.00  0   1090
>
>If I do:
># dd if=/dev/zero of=/dev/md0p1 bs=8M oflag=direct
>then iostat -m 5 says:
>avg-cpu:  %user   %nice %system %iowait  %steal   %idle
>   0.000.00   11.70   12.940.00   75.36
>
>Device:tpsMB_read/sMB_wrtn/sMB_readMB_wrtn
>sdb 831.00 8.5830.42 42152
>sdc 832.80 8.0529.99 40149
>sdd 832.00 9.1029.78 45148
>sdf 838.40 9.1129.72 45148
>sde 828.80 7.9129.79 39148
>sdg 850.80 8.0030.18 40150
>md01012.60 0.00   101.27  0506
>
>It looks like md isn't recognizing that I'm writing whole stripes when
>I'm in O_DIRECT mode.
>
kernel version?

>--Andy
>
>-- 
>Andy Lutomirski
>AMA Capital Management, LLC
>--
>To unsubscribe from this list: send the line "unsubscribe linux-raid" in
>the body of a message to majord...@vger.kernel.org
>More majordomo info at  
>http://vger.kernel.org/majordomo-info.htmlN�Р骒r��yb�X�肚�v�^�)藓{.n�+�伐�{��赙zXФ�≤�}��财�z�:+v�����赙zZ+��+zf"�h���~i���z��wア�?�ㄨ��&�)撷f��^j谦y�m��@A�a囤�
>0鹅h���i

O_DIRECT to md raid 6 is slow

2012-08-14 Thread Andy Lutomirski
If I do:
# dd if=/dev/zero of=/dev/md0p1 bs=8M
then iostat -m 5 says:

avg-cpu:  %user   %nice %system %iowait  %steal   %idle
   0.000.00   26.88   35.270.00   37.85

Device:tpsMB_read/sMB_wrtn/sMB_readMB_wrtn
sdb 265.20 1.1654.79  5273
sdc 266.20 1.4754.73  7273
sdd 264.20 1.3854.54  6272
sdf 286.00 1.8454.74  9273
sde 266.60 1.0454.75  5273
sdg 265.00 1.0254.74  5273
md0   55808.00 0.00   218.00  0   1090

If I do:
# dd if=/dev/zero of=/dev/md0p1 bs=8M oflag=direct
then iostat -m 5 says:
avg-cpu:  %user   %nice %system %iowait  %steal   %idle
   0.000.00   11.70   12.940.00   75.36

Device:tpsMB_read/sMB_wrtn/sMB_readMB_wrtn
sdb 831.00 8.5830.42 42152
sdc 832.80 8.0529.99 40149
sdd 832.00 9.1029.78 45148
sdf 838.40 9.1129.72 45148
sde 828.80 7.9129.79 39148
sdg 850.80 8.0030.18 40150
md01012.60 0.00   101.27  0506

It looks like md isn't recognizing that I'm writing whole stripes when
I'm in O_DIRECT mode.

--Andy

-- 
Andy Lutomirski
AMA Capital Management, LLC
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


O_DIRECT to md raid 6 is slow

2012-08-14 Thread Andy Lutomirski
If I do:
# dd if=/dev/zero of=/dev/md0p1 bs=8M
then iostat -m 5 says:

avg-cpu:  %user   %nice %system %iowait  %steal   %idle
   0.000.00   26.88   35.270.00   37.85

Device:tpsMB_read/sMB_wrtn/sMB_readMB_wrtn
sdb 265.20 1.1654.79  5273
sdc 266.20 1.4754.73  7273
sdd 264.20 1.3854.54  6272
sdf 286.00 1.8454.74  9273
sde 266.60 1.0454.75  5273
sdg 265.00 1.0254.74  5273
md0   55808.00 0.00   218.00  0   1090

If I do:
# dd if=/dev/zero of=/dev/md0p1 bs=8M oflag=direct
then iostat -m 5 says:
avg-cpu:  %user   %nice %system %iowait  %steal   %idle
   0.000.00   11.70   12.940.00   75.36

Device:tpsMB_read/sMB_wrtn/sMB_readMB_wrtn
sdb 831.00 8.5830.42 42152
sdc 832.80 8.0529.99 40149
sdd 832.00 9.1029.78 45148
sdf 838.40 9.1129.72 45148
sde 828.80 7.9129.79 39148
sdg 850.80 8.0030.18 40150
md01012.60 0.00   101.27  0506

It looks like md isn't recognizing that I'm writing whole stripes when
I'm in O_DIRECT mode.

--Andy

-- 
Andy Lutomirski
AMA Capital Management, LLC
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: O_DIRECT to md raid 6 is slow

2012-08-14 Thread kedacomkernel
On 2012-08-15 08:49 Andy Lutomirski l...@amacapital.net Wrote:
If I do:
# dd if=/dev/zero of=/dev/md0p1 bs=8M
then iostat -m 5 says:

avg-cpu:  %user   %nice %system %iowait  %steal   %idle
   0.000.00   26.88   35.270.00   37.85

Device:tpsMB_read/sMB_wrtn/sMB_readMB_wrtn
sdb 265.20 1.1654.79  5273
sdc 266.20 1.4754.73  7273
sdd 264.20 1.3854.54  6272
sdf 286.00 1.8454.74  9273
sde 266.60 1.0454.75  5273
sdg 265.00 1.0254.74  5273
md0   55808.00 0.00   218.00  0   1090

If I do:
# dd if=/dev/zero of=/dev/md0p1 bs=8M oflag=direct
then iostat -m 5 says:
avg-cpu:  %user   %nice %system %iowait  %steal   %idle
   0.000.00   11.70   12.940.00   75.36

Device:tpsMB_read/sMB_wrtn/sMB_readMB_wrtn
sdb 831.00 8.5830.42 42152
sdc 832.80 8.0529.99 40149
sdd 832.00 9.1029.78 45148
sdf 838.40 9.1129.72 45148
sde 828.80 7.9129.79 39148
sdg 850.80 8.0030.18 40150
md01012.60 0.00   101.27  0506

It looks like md isn't recognizing that I'm writing whole stripes when
I'm in O_DIRECT mode.

kernel version?

--Andy

-- 
Andy Lutomirski
AMA Capital Management, LLC
--
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to majord...@vger.kernel.org
More majordomo info at  
http://vger.kernel.org/majordomo-info.htmlN�Р骒r��yb�X�肚�v�^�)藓{.n�+�伐�{��赙zXФ�≤�}��财�z�j:+v�����赙zZ+��+zf"�h���~i���z��wア�?�ㄨ���)撷f��^j谦y�m��@A�a囤�
0鹅h���i

Re: O_DIRECT to md raid 6 is slow

2012-08-14 Thread Andy Lutomirski
Ubuntu's 3.2.0-27-generic.  I can test on a newer kernel tomorrow.

--Andy

On Tue, Aug 14, 2012 at 6:07 PM, kedacomkernel kedacomker...@gmail.com wrote:
 On 2012-08-15 08:49 Andy Lutomirski l...@amacapital.net Wrote:
If I do:
# dd if=/dev/zero of=/dev/md0p1 bs=8M
then iostat -m 5 says:

avg-cpu:  %user   %nice %system %iowait  %steal   %idle
   0.000.00   26.88   35.270.00   37.85

Device:tpsMB_read/sMB_wrtn/sMB_readMB_wrtn
sdb 265.20 1.1654.79  5273
sdc 266.20 1.4754.73  7273
sdd 264.20 1.3854.54  6272
sdf 286.00 1.8454.74  9273
sde 266.60 1.0454.75  5273
sdg 265.00 1.0254.74  5273
md0   55808.00 0.00   218.00  0   1090

If I do:
# dd if=/dev/zero of=/dev/md0p1 bs=8M oflag=direct
then iostat -m 5 says:
avg-cpu:  %user   %nice %system %iowait  %steal   %idle
   0.000.00   11.70   12.940.00   75.36

Device:tpsMB_read/sMB_wrtn/sMB_readMB_wrtn
sdb 831.00 8.5830.42 42152
sdc 832.80 8.0529.99 40149
sdd 832.00 9.1029.78 45148
sdf 838.40 9.1129.72 45148
sde 828.80 7.9129.79 39148
sdg 850.80 8.0030.18 40150
md01012.60 0.00   101.27  0506

It looks like md isn't recognizing that I'm writing whole stripes when
I'm in O_DIRECT mode.

 kernel version?

--Andy

--
Andy Lutomirski
AMA Capital Management, LLC
--
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html



-- 
Andy Lutomirski
AMA Capital Management, LLC
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Re: O_DIRECT to md raid 6 is slow

2012-08-14 Thread kedacomkernel
On 2012-08-15 09:12 Andy Lutomirski l...@amacapital.net Wrote:
Ubuntu's 3.2.0-27-generic.  I can test on a newer kernel tomorrow.
I guess maybe miss the blk_plug function.
Can you add this patch and retest.

Move unplugging for direct I/O from around -direct_IO() down to
do_blockdev_direct_IO(). This implicitly adds plugging for direct
writes.
 
CC: Li Shaohua s...@fusionio.com
Acked-by: Jeff Moyer jmo...@redhat.com
Signed-off-by: Wu Fengguang fengguang...@intel.com
---
 fs/direct-io.c |5 +
 mm/filemap.c   |4 
 2 files changed, 5 insertions(+), 4 deletions(-)
 
--- linux-next.orig/mm/filemap.c 2012-08-05 16:24:47.859465122 +0800
+++ linux-next/mm/filemap.c 2012-08-05 16:24:48.407465135 +0800
@@ -1412,12 +1412,8 @@ generic_file_aio_read(struct kiocb *iocb
  retval = filemap_write_and_wait_range(mapping, pos,
  pos + iov_length(iov, nr_segs) - 1);
  if (!retval) {
- struct blk_plug plug;
-
- blk_start_plug(plug);
  retval = mapping-a_ops-direct_IO(READ, iocb,
  iov, pos, nr_segs);
- blk_finish_plug(plug);
  }
  if (retval  0) {
  *ppos = pos + retval;
--- linux-next.orig/fs/direct-io.c 2012-07-07 21:46:39.531508198 +0800
+++ linux-next/fs/direct-io.c 2012-08-05 16:24:48.411465136 +0800
@@ -1062,6 +1062,7 @@ do_blockdev_direct_IO(int rw, struct kio
  unsigned long user_addr;
  size_t bytes;
  struct buffer_head map_bh = { 0, };
+ struct blk_plug plug;
 
  if (rw  WRITE)
  rw = WRITE_ODIRECT;
@@ -1177,6 +1178,8 @@ do_blockdev_direct_IO(int rw, struct kio
  PAGE_SIZE - user_addr / PAGE_SIZE);
  }
 
+ blk_start_plug(plug);
+
  for (seg = 0; seg  nr_segs; seg++) {
  user_addr = (unsigned long)iov[seg].iov_base;
  sdio.size += bytes = iov[seg].iov_len;
@@ -1235,6 +1238,8 @@ do_blockdev_direct_IO(int rw, struct kio
  if (sdio.bio)
  dio_bio_submit(dio, sdio);
 
+ blk_finish_plug(plug);
+
  /*
   * It is possible that, we return short IO due to end of file.
   * In that case, we need to release all the pages we got hold on.
 
 
--