Re: [Lustre-discuss] mkfs options/tuning for RAID based OSTs

2010-10-20 Thread Brian J. Murrell
On Tue, 2010-10-19 at 21:00 -0400, Edward Walter wrote: 
 

Ed,

 That seems to validate how I'm interpreting the parameters. We have 10 data 
 disks and 2 parity disks per array so it looks like we need to be at 64 KB or 
 less.

I think you have been missing everyone's point in this thread.  The
magic value is not anything below 1MB, it's 1MB exactly.  No more, no
less (although I guess technically 256KB or 512KB would work).

The reason is that Lustre attempts to package up I/Os from the client to
the OST in 1MB chunks.  If the RAID stripe matches that 1MB then when
the OSS writes that 1MB to the OST, it's a single write to the RAID disk
underlying the OST of 1MB of data plus the parity.

Conversely, if the OSS receives 1MB of data for the OST and the RAID
stripe under the OST is not 1MB, but less, then 1MB-raid_stripe_size
will be written as data+parity to the first strip, but the remaining
portion of that 1MB of data from the client will be written into the
next RAID stripe only partially filling the stripe causing the RAID
layer to have to first read that whole stripe, insert the new data,
calculate a new parity and then write that whole RAID stripe back out
the disk.

So as you can see, when your RAID stripe is not exactly 1MB, the RAID
code has to do a lot more I/O, which impacts performance, obviously.

This is why the recommendations in this thread have continued to be
using a number of data disks that divides evenly into 1MB (i.e. powers
of 2: 2, 4, 8, etc.).  So for RAID6: 4+2 or 8+2, etc.

b.



signature.asc
Description: This is a digitally signed message part
___
Lustre-discuss mailing list
Lustre-discuss@lists.lustre.org
http://lists.lustre.org/mailman/listinfo/lustre-discuss


Re: [Lustre-discuss] mkfs options/tuning for RAID based OSTs

2010-10-20 Thread Edward Walter
Hi Brian,

Thanks for the clarification.  It didn't click that the optimal data 
size is exactly 1MB...  Everything you're saying makes sense though. 

Obviously with 12 disk arrays; there's tension between maximizing space 
and maximizing performance.  I was hoping/trying to get the best of 
both.  The difference between doing 10 data and 2 parity vs 4+2 or 8+2 
works out to a difference of 2 data disks (4 TB) per shelf for us or 24 
TB in total which is why I was trying to figure out how to make this 
work with more data disks.

Thanks to everyone for the input.  This has been very helpful.

-Ed

Brian J. Murrell wrote:
 On Tue, 2010-10-19 at 21:00 -0400, Edward Walter wrote: 
   

 Ed,

   
 That seems to validate how I'm interpreting the parameters. We have 10 data 
 disks and 2 parity disks per array so it looks like we need to be at 64 KB 
 or less.
 

 I think you have been missing everyone's point in this thread.  The
 magic value is not anything below 1MB, it's 1MB exactly.  No more, no
 less (although I guess technically 256KB or 512KB would work).

 The reason is that Lustre attempts to package up I/Os from the client to
 the OST in 1MB chunks.  If the RAID stripe matches that 1MB then when
 the OSS writes that 1MB to the OST, it's a single write to the RAID disk
 underlying the OST of 1MB of data plus the parity.

 Conversely, if the OSS receives 1MB of data for the OST and the RAID
 stripe under the OST is not 1MB, but less, then 1MB-raid_stripe_size
 will be written as data+parity to the first strip, but the remaining
 portion of that 1MB of data from the client will be written into the
 next RAID stripe only partially filling the stripe causing the RAID
 layer to have to first read that whole stripe, insert the new data,
 calculate a new parity and then write that whole RAID stripe back out
 the disk.

 So as you can see, when your RAID stripe is not exactly 1MB, the RAID
 code has to do a lot more I/O, which impacts performance, obviously.

 This is why the recommendations in this thread have continued to be
 using a number of data disks that divides evenly into 1MB (i.e. powers
 of 2: 2, 4, 8, etc.).  So for RAID6: 4+2 or 8+2, etc.

 b.

   
 

 ___
 Lustre-discuss mailing list
 Lustre-discuss@lists.lustre.org
 http://lists.lustre.org/mailman/listinfo/lustre-discuss
   
___
Lustre-discuss mailing list
Lustre-discuss@lists.lustre.org
http://lists.lustre.org/mailman/listinfo/lustre-discuss


Re: [Lustre-discuss] mkfs options/tuning for RAID based OSTs

2010-10-20 Thread Charland, Denis

Brian J. Murrell wrote:
 On Tue, 2010-10-19 at 21:00 -0400, Edward Walter wrote: 
   

 This is why the recommendations in this thread have continued to be
 using a number of data disks that divides evenly into 1MB (i.e. powers
 of 2: 2, 4, 8, etc.).  So for RAID6: 4+2 or 8+2, etc.

 
What about RAID5?

Denis
___
Lustre-discuss mailing list
Lustre-discuss@lists.lustre.org
http://lists.lustre.org/mailman/listinfo/lustre-discuss


Re: [Lustre-discuss] mkfs options/tuning for RAID based OSTs

2010-10-20 Thread Edward Walter
Hi Denis,

Changing the number of parity disks (RAID5 = 1, RAID6 = 2) doesn't 
change the math on the data disks and data segment size. You still need 
a power of 2 number of data disks to insure that the product of the RAID 
chunk size and the number of data disks is 1MB.

Aside from that; I wouldn't comfortably rely on RAID5 to protect my data 
at this point. We've seen to many dual-disk failures to trust it.

Thanks.

-Ed

Charland, Denis wrote:
 Brian J. Murrell wrote:
   
 On Tue, 2010-10-19 at 21:00 -0400, Edward Walter wrote: 
   

 This is why the recommendations in this thread have continued to be
 using a number of data disks that divides evenly into 1MB (i.e. powers
 of 2: 2, 4, 8, etc.).  So for RAID6: 4+2 or 8+2, etc.

 
  
 What about RAID5?

 Denis
 ___
 Lustre-discuss mailing list
 Lustre-discuss@lists.lustre.org
 http://lists.lustre.org/mailman/listinfo/lustre-discuss

   
___
Lustre-discuss mailing list
Lustre-discuss@lists.lustre.org
http://lists.lustre.org/mailman/listinfo/lustre-discuss


Re: [Lustre-discuss] mkfs options/tuning for RAID based OSTs

2010-10-20 Thread Bernd Schubert
On Wednesday, October 20, 2010, Charland, Denis wrote:
 Brian J. Murrell wrote:
  On Tue, 2010-10-19 at 21:00 -0400, Edward Walter wrote:
  
  
  This is why the recommendations in this thread have continued to be
  using a number of data disks that divides evenly into 1MB (i.e. powers
  of 2: 2, 4, 8, etc.).  So for RAID6: 4+2 or 8+2, etc.
 
 What about RAID5?

Personally I don't lile raid5 too much, but with raid5 it is obviously +1 
instead of +2


Cheers,
Bernd

-- 
Bernd Schubert
DataDirect Networks
___
Lustre-discuss mailing list
Lustre-discuss@lists.lustre.org
http://lists.lustre.org/mailman/listinfo/lustre-discuss


Re: [Lustre-discuss] mkfs options/tuning for RAID based OSTs

2010-10-20 Thread Wojciech Turek
Hi Edward,

As Andreas mentioned earlier the max OST size is 16TB if one uses ext4 based
ldiskfs. So creation of RAID group bigger than that will definitely hurt
your performance because you would have to split the large array into
smaller logical disks and that randomises IOs on the raid controller. With
2TB disks, RAID6 is the way to go as the rebuild time of the failed disk is
quite long which increases the chance of double disk failure to
uncomfortable level. So taking that into consideration I think that 8+2
RAID6 with 128kb segment size is the right choice. Spare disk can be used as
hotspares or for external journal.


On 20 October 2010 15:19, Edward Walter ewal...@cs.cmu.edu wrote:

 Hi Brian,

 Thanks for the clarification.  It didn't click that the optimal data
 size is exactly 1MB...  Everything you're saying makes sense though.

 Obviously with 12 disk arrays; there's tension between maximizing space
 and maximizing performance.  I was hoping/trying to get the best of
 both.  The difference between doing 10 data and 2 parity vs 4+2 or 8+2
 works out to a difference of 2 data disks (4 TB) per shelf for us or 24
 TB in total which is why I was trying to figure out how to make this
 work with more data disks.

 Thanks to everyone for the input.  This has been very helpful.

 -Ed

 Brian J. Murrell wrote:
  On Tue, 2010-10-19 at 21:00 -0400, Edward Walter wrote:
 
 
  Ed,
 
 
  That seems to validate how I'm interpreting the parameters. We have 10
 data disks and 2 parity disks per array so it looks like we need to be at 64
 KB or less.
 
 
  I think you have been missing everyone's point in this thread.  The
  magic value is not anything below 1MB, it's 1MB exactly.  No more, no
  less (although I guess technically 256KB or 512KB would work).
 
  The reason is that Lustre attempts to package up I/Os from the client to
  the OST in 1MB chunks.  If the RAID stripe matches that 1MB then when
  the OSS writes that 1MB to the OST, it's a single write to the RAID disk
  underlying the OST of 1MB of data plus the parity.
 
  Conversely, if the OSS receives 1MB of data for the OST and the RAID
  stripe under the OST is not 1MB, but less, then 1MB-raid_stripe_size
  will be written as data+parity to the first strip, but the remaining
  portion of that 1MB of data from the client will be written into the
  next RAID stripe only partially filling the stripe causing the RAID
  layer to have to first read that whole stripe, insert the new data,
  calculate a new parity and then write that whole RAID stripe back out
  the disk.
 
  So as you can see, when your RAID stripe is not exactly 1MB, the RAID
  code has to do a lot more I/O, which impacts performance, obviously.
 
  This is why the recommendations in this thread have continued to be
  using a number of data disks that divides evenly into 1MB (i.e. powers
  of 2: 2, 4, 8, etc.).  So for RAID6: 4+2 or 8+2, etc.
 
  b.
 
 
  
 
  ___
  Lustre-discuss mailing list
  Lustre-discuss@lists.lustre.org
  http://lists.lustre.org/mailman/listinfo/lustre-discuss
 
 ___
 Lustre-discuss mailing list
 Lustre-discuss@lists.lustre.org
 http://lists.lustre.org/mailman/listinfo/lustre-discuss

___
Lustre-discuss mailing list
Lustre-discuss@lists.lustre.org
http://lists.lustre.org/mailman/listinfo/lustre-discuss


Re: [Lustre-discuss] mkfs options/tuning for RAID based OSTs

2010-10-19 Thread Paul Nowoczynski
Ed,
Does 'segment size' refer to the amount of data written to each disk 
before proceeding to the next disk (e.g. stride)?  This is my guess 
since these values are usually powers of two and therefore 52KB 
[512KB/(10 data disks)] is probably not the stride size.  In any event I 
think you'll get the most bang for your buck by creating raid stripe 
where n_data_disks * stride = 1MB.  My recent experience when dealing 
with our software raid6 systems here is that elimination of 
read-modify-write is key for achieving good performance.  I would 
recommend exploring configurations where the the number of data disks is 
a power of 2 so that you can configure the stripe size to be 1MB.  I 
wouldn't be surprised if you see better performance by dividing the 12 
disks in 2x(4+2) raid6 luns. 
paul

Edward Walter wrote:
 Hello All,

 We're doing a fresh Lustre 1.8.4 install using Sun StorageTek 2540 
 arrays for our OST targets.  We've configured these as RAID6 with no 
 spares which means we have the equivalent of 10 data disks and 2 parity 
 disks in play on each OST.

 We configured the Segment Size on these arrays at 512 KB.  I believe 
 this is equivalent to the chunk size in the Lustre operations manual 
 (section 10.1.1).  Based on the formulae in the manual: in order to have 
 my stripe width fall below 1MB; I need to reconfigure my Segment Size 
 like this:

 Segment Size = 1024KB/(12-2) = 102.4 KB
 so 16KB, 32KB or 64KB are optimal values
 Does this seem right?

 Do I really need to do this (reinitialize the arrays/volumes) to get my 
 Segment Size below 1MB?  What impact will/won't this have on performance?

 When I format the OST filesystem; I need to provide options for both 
 stripe and stride.  The manual indicates that the units for these values 
 are 4096-byte (4KB) blocks.  Given that, I should use something like:

 -E stride= (one of)
 16KB/4KB = 4
 32KB/4KB = 8
 64KB/4KB = 16

 stripe= (one of)
 16KB*10/4KB = 40
 32KB*10/4KB = 80
 64KB*10/4KB = 160

 so for example I would issue the following:
 mkfs.lustre --mountfsoptions=stripe=160 --mkfsoptions=-E stride=16 -m 
 1 ...

 Is it better for to opt for the higher values or lower values here?

 Also, does anyone have recommendations for aligning the filesystem so 
 that the fs blocks align with the RAID chunks?  We've done things like 
 this for SSD drives.  We'd normally give Lustre the entire RAID device 
 (without partitions) so this hasn't been an issue in the past.  For this 
 installation though; we're creating multiple volumes (for size/space 
 reasons) so partitioning is a necessary evil now.

 Thanks for any feedback!

 -Ed Walter
 Carnegie Mellon University
 ___
 Lustre-discuss mailing list
 Lustre-discuss@lists.lustre.org
 http://lists.lustre.org/mailman/listinfo/lustre-discuss

___
Lustre-discuss mailing list
Lustre-discuss@lists.lustre.org
http://lists.lustre.org/mailman/listinfo/lustre-discuss


Re: [Lustre-discuss] mkfs options/tuning for RAID based OSTs

2010-10-19 Thread Dennis Nelson
Segment size should be 128.

128 KB * 8 data drives = 1 MB.


On 10/19/10 3:42 PM, Edward Walter ewal...@cs.cmu.edu wrote:

 Hello All,
 
 We're doing a fresh Lustre 1.8.4 install using Sun StorageTek 2540
 arrays for our OST targets.  We've configured these as RAID6 with no
 spares which means we have the equivalent of 10 data disks and 2 parity
 disks in play on each OST.
 
 We configured the Segment Size on these arrays at 512 KB.  I believe
 this is equivalent to the chunk size in the Lustre operations manual
 (section 10.1.1).  Based on the formulae in the manual: in order to have
 my stripe width fall below 1MB; I need to reconfigure my Segment Size
 like this:
 
 Segment Size = 1024KB/(12-2) = 102.4 KB
 so 16KB, 32KB or 64KB are optimal values
 Does this seem right?
 
 Do I really need to do this (reinitialize the arrays/volumes) to get my
 Segment Size below 1MB?  What impact will/won't this have on performance?
 
 When I format the OST filesystem; I need to provide options for both
 stripe and stride.  The manual indicates that the units for these values
 are 4096-byte (4KB) blocks.  Given that, I should use something like:
 
 -E stride= (one of)
 16KB/4KB = 4
 32KB/4KB = 8
 64KB/4KB = 16
 
 stripe= (one of)
 16KB*10/4KB = 40
 32KB*10/4KB = 80
 64KB*10/4KB = 160
 
 so for example I would issue the following:
 mkfs.lustre --mountfsoptions=stripe=160 --mkfsoptions=-E stride=16 -m
 1 ...
 
 Is it better for to opt for the higher values or lower values here?
 
 Also, does anyone have recommendations for aligning the filesystem so
 that the fs blocks align with the RAID chunks?  We've done things like
 this for SSD drives.  We'd normally give Lustre the entire RAID device
 (without partitions) so this hasn't been an issue in the past.  For this
 installation though; we're creating multiple volumes (for size/space
 reasons) so partitioning is a necessary evil now.
 
 Thanks for any feedback!
 
 -Ed Walter
 Carnegie Mellon University
 ___
 Lustre-discuss mailing list
 Lustre-discuss@lists.lustre.org
 http://lists.lustre.org/mailman/listinfo/lustre-discuss

___
Lustre-discuss mailing list
Lustre-discuss@lists.lustre.org
http://lists.lustre.org/mailman/listinfo/lustre-discuss


Re: [Lustre-discuss] mkfs options/tuning for RAID based OSTs

2010-10-19 Thread Andreas Dilger
On 2010-10-19, at 14:42, Edward Walter wrote:
 We're doing a fresh Lustre 1.8.4 install using Sun StorageTek 2540 
 arrays for our OST targets.  We've configured these as RAID6 with no 
 spares which means we have the equivalent of 10 data disks and 2 parity 
 disks in play on each OST.

As Paul mentioned, using something other than 8 data + N parity is bad for 
performance.  It is doubly bad if the stripe width (ndata * segment size) is  
1MB in size, because that means EVERY WRITE will be a read-modify-write, and 
kill performance.

 Also, does anyone have recommendations for aligning the filesystem so 
 that the fs blocks align with the RAID chunks?  We've done things like 
 this for SSD drives.  We'd normally give Lustre the entire RAID device 
 (without partitions) so this hasn't been an issue in the past.  For this 
 installation though; we're creating multiple volumes (for size/space 
 reasons) so partitioning is a necessary evil now.

Partitioning is doubly evil (unless done extremely carefully) because it will 
further mis-align the IO (due to the partition table and crazy MS-DOS odd 
sector alignment) so that you will always partially modify extra blocks at the 
beginning/end of each of each write (possibly causing data corruption in case 
of incomplete writes/cache loss/etc).

If you stick with 8 data disks, and assuming 2TB drives or smaller, with 1.8.4 
you can use the ext4-based ldiskfs (in a separate ldiskfs RPM on the download 
site) to format up to 16TB LUNs for a single OST.  That is really the best 
configuration, and will probably double your write performance.

Cheers, Andreas
--
Andreas Dilger
Lustre Technical Lead
Oracle Corporation Canada Inc.

___
Lustre-discuss mailing list
Lustre-discuss@lists.lustre.org
http://lists.lustre.org/mailman/listinfo/lustre-discuss


Re: [Lustre-discuss] mkfs options/tuning for RAID based OSTs

2010-10-19 Thread Edward Walter
Hi Dennis,

That seems to validate how I'm interpreting the parameters. We have 10 data 
disks and 2 parity disks per array so it looks like we need to be at 64 KB or 
less.

I'm guessing I'll just need to run some tests to see how performance changes as 
I adjust the segment size. 

Thanks,

-Ed


On  Oct 19, 2010, at 5:56 PM, Dennis Nelson dnel...@sgi.com wrote:

 Segment size should be 128.
 
 128 KB * 8 data drives = 1 MB.
 
 
 On 10/19/10 3:42 PM, Edward Walter ewal...@cs.cmu.edu wrote:
 
 Hello All,
 
 We're doing a fresh Lustre 1.8.4 install using Sun StorageTek 2540
 arrays for our OST targets.  We've configured these as RAID6 with no
 spares which means we have the equivalent of 10 data disks and 2 parity
 disks in play on each OST.
 
 We configured the Segment Size on these arrays at 512 KB.  I believe
 this is equivalent to the chunk size in the Lustre operations manual
 (section 10.1.1).  Based on the formulae in the manual: in order to have
 my stripe width fall below 1MB; I need to reconfigure my Segment Size
 like this:
 
 Segment Size = 1024KB/(12-2) = 102.4 KB
 so 16KB, 32KB or 64KB are optimal values
 Does this seem right?
 
 Do I really need to do this (reinitialize the arrays/volumes) to get my
 Segment Size below 1MB?  What impact will/won't this have on performance?
 
 When I format the OST filesystem; I need to provide options for both
 stripe and stride.  The manual indicates that the units for these values
 are 4096-byte (4KB) blocks.  Given that, I should use something like:
 
 -E stride= (one of)
   16KB/4KB = 4
   32KB/4KB = 8
   64KB/4KB = 16
 
 stripe= (one of)
   16KB*10/4KB = 40
   32KB*10/4KB = 80
   64KB*10/4KB = 160
 
 so for example I would issue the following:
 mkfs.lustre --mountfsoptions=stripe=160 --mkfsoptions=-E stride=16 -m
 1 ...
 
 Is it better for to opt for the higher values or lower values here?
 
 Also, does anyone have recommendations for aligning the filesystem so
 that the fs blocks align with the RAID chunks?  We've done things like
 this for SSD drives.  We'd normally give Lustre the entire RAID device
 (without partitions) so this hasn't been an issue in the past.  For this
 installation though; we're creating multiple volumes (for size/space
 reasons) so partitioning is a necessary evil now.
 
 Thanks for any feedback!
 
 -Ed Walter
 Carnegie Mellon University
 ___
 Lustre-discuss mailing list
 Lustre-discuss@lists.lustre.org
 http://lists.lustre.org/mailman/listinfo/lustre-discuss
 
 

___
Lustre-discuss mailing list
Lustre-discuss@lists.lustre.org
http://lists.lustre.org/mailman/listinfo/lustre-discuss