Re: Linux RAID Partition Offset 63 cylinders / 30% performance hit?

2007-12-29 Thread Michael Tokarev
Justin Piszcz wrote:
[]
> Good to know/have it confirmed by someone else, the alignment does not
> matter with Linux/SW RAID.

Alignment matters when one partitions Linux/SW raid array.
If the inside partitions will not be aligned on a stripe
boundary, esp. in the worst case when the filesystem blocks
cross the stripe boundary (wonder if it's ever possible...
and I think it is, if a partition starts at some odd 512
bytes boundary, and filesystem block size is 4Kb, there's
just no chance for an inside filesystem to do full-stripe
writes, ever, so (modulo stripe cache size) all writes will
go read-modify-write or similar way.

And that's what the original article is about, by the way.
It just happens that hardware raid array is more often split
into partitions (using native tools) than linux software raid
arrays.

And that's what has been pointed out in this thread, as well... ;)

/mjt
-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Linux RAID Partition Offset 63 cylinders / 30% performance hit?

2007-12-29 Thread Justin Piszcz



On Sat, 29 Dec 2007, dean gaudet wrote:


On Tue, 25 Dec 2007, Bill Davidsen wrote:


The issue I'm thinking about is hardware sector size, which on modern drives
may be larger than 512b and therefore entail a read-alter-rewrite (RAR) cycle
when writing a 512b block.


i'm not sure any shipping SATA disks have larger than 512B sectors yet...
do you know of any?  (or is this thread about SCSI which i don't pay
attention to...)

on a brand new WDC WD7500AAKS-00RBA0 with this partition layout:

255 heads, 63 sectors/track, 91201 cylinders
Units = cylinders of 16065 * 512 = 8225280 bytes

so sda1 starts at a non-multiple of 4096 into the disk.

i ran some random seek+write experiments using
, here are the results using 512 byte
and 4096 byte writes (fsync after each write), 8 threads, on sda1:

# ./randomio /dev/sda1 8 1 1 512 10 6
 total |  read: latency (ms)   |  write:latency (ms)
  iops |   iops   minavgmax   sdev |   iops   minavgmax   sdev
+---+--
 148.5 |0.0   infnan0.0nan |  148.5   0.2   53.7   89.3   19.5
 129.2 |0.0   infnan0.0nan |  129.2  37.2   61.9   96.79.3
 131.2 |0.0   infnan0.0nan |  131.2  40.3   61.0   90.49.3
 132.0 |0.0   infnan0.0nan |  132.0  39.6   60.6   89.39.1
 130.7 |0.0   infnan0.0nan |  130.7  39.8   61.3   98.18.9
 131.4 |0.0   infnan0.0nan |  131.4  40.0   60.8  101.09.6
# ./randomio /dev/sda1 8 1 1 4096 10 6
 total |  read: latency (ms)   |  write:latency (ms)
  iops |   iops   minavgmax   sdev |   iops   minavgmax   sdev
+---+--
 141.7 |0.0   infnan0.0nan |  141.7   0.3   56.3   99.3   21.1
 132.4 |0.0   infnan0.0nan |  132.4  43.3   60.4   91.88.5
 131.6 |0.0   infnan0.0nan |  131.6  41.4   60.9  111.09.6
 131.8 |0.0   infnan0.0nan |  131.8  41.4   60.7   85.38.6
 130.6 |0.0   infnan0.0nan |  130.6  41.7   61.3   95.09.4
 131.4 |0.0   infnan0.0nan |  131.4  42.2   60.8   90.58.4


i think the anomalous results in the first 10s samples are perhaps the drive
coming out of a standby state.

and here are the results aligned using the sda raw device itself:

# ./randomio /dev/sda 8 1 1 512 10 6
 total |  read: latency (ms)   |  write:latency (ms)
  iops |   iops   minavgmax   sdev |   iops   minavgmax   sdev
+---+--
 147.3 |0.0   infnan0.0nan |  147.3   0.3   54.1   93.7   20.1
 132.4 |0.0   infnan0.0nan |  132.4  37.4   60.6   91.89.2
 132.5 |0.0   infnan0.0nan |  132.5  37.7   60.3   93.79.3
 131.8 |0.0   infnan0.0nan |  131.8  39.4   60.7   92.79.0
 133.9 |0.0   infnan0.0nan |  133.9  41.7   59.8   90.78.5
 130.2 |0.0   infnan0.0nan |  130.2  40.8   61.5   88.68.9
# ./randomio /dev/sda 8 1 1 4096 10 6
 total |  read: latency (ms)   |  write:latency (ms)
  iops |   iops   minavgmax   sdev |   iops   minavgmax   sdev
+---+--
 145.4 |0.0   infnan0.0nan |  145.4   0.3   54.9   94.0   20.1
 130.3 |0.0   infnan0.0nan |  130.3  36.0   61.4   92.79.6
 130.6 |0.0   infnan0.0nan |  130.6  38.2   61.2   96.79.2
 132.1 |0.0   infnan0.0nan |  132.1  39.0   60.5   93.59.2
 131.8 |0.0   infnan0.0nan |  131.8  43.1   60.8   93.89.1
 129.0 |0.0   infnan0.0nan |  129.0  40.2   62.0   96.48.8

it looks pretty much the same to me...

-dean
-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html



Good to know/have it confirmed by someone else, the alignment does not 
matter with Linux/SW RAID.


Justin.
-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Linux RAID Partition Offset 63 cylinders / 30% performance hit?

2007-12-29 Thread dean gaudet
On Tue, 25 Dec 2007, Bill Davidsen wrote:

> The issue I'm thinking about is hardware sector size, which on modern drives
> may be larger than 512b and therefore entail a read-alter-rewrite (RAR) cycle
> when writing a 512b block.

i'm not sure any shipping SATA disks have larger than 512B sectors yet... 
do you know of any?  (or is this thread about SCSI which i don't pay 
attention to...)

on a brand new WDC WD7500AAKS-00RBA0 with this partition layout:

255 heads, 63 sectors/track, 91201 cylinders
Units = cylinders of 16065 * 512 = 8225280 bytes

so sda1 starts at a non-multiple of 4096 into the disk.

i ran some random seek+write experiments using
, here are the results using 512 byte
and 4096 byte writes (fsync after each write), 8 threads, on sda1:

# ./randomio /dev/sda1 8 1 1 512 10 6
  total |  read: latency (ms)   |  write:latency (ms)
   iops |   iops   minavgmax   sdev |   iops   minavgmax   sdev
+---+--
  148.5 |0.0   infnan0.0nan |  148.5   0.2   53.7   89.3   19.5
  129.2 |0.0   infnan0.0nan |  129.2  37.2   61.9   96.79.3
  131.2 |0.0   infnan0.0nan |  131.2  40.3   61.0   90.49.3
  132.0 |0.0   infnan0.0nan |  132.0  39.6   60.6   89.39.1
  130.7 |0.0   infnan0.0nan |  130.7  39.8   61.3   98.18.9
  131.4 |0.0   infnan0.0nan |  131.4  40.0   60.8  101.09.6
# ./randomio /dev/sda1 8 1 1 4096 10 6
  total |  read: latency (ms)   |  write:latency (ms)
   iops |   iops   minavgmax   sdev |   iops   minavgmax   sdev
+---+--
  141.7 |0.0   infnan0.0nan |  141.7   0.3   56.3   99.3   21.1
  132.4 |0.0   infnan0.0nan |  132.4  43.3   60.4   91.88.5
  131.6 |0.0   infnan0.0nan |  131.6  41.4   60.9  111.09.6
  131.8 |0.0   infnan0.0nan |  131.8  41.4   60.7   85.38.6
  130.6 |0.0   infnan0.0nan |  130.6  41.7   61.3   95.09.4
  131.4 |0.0   infnan0.0nan |  131.4  42.2   60.8   90.58.4


i think the anomalous results in the first 10s samples are perhaps the drive
coming out of a standby state.

and here are the results aligned using the sda raw device itself:

# ./randomio /dev/sda 8 1 1 512 10 6
  total |  read: latency (ms)   |  write:latency (ms)
   iops |   iops   minavgmax   sdev |   iops   minavgmax   sdev
+---+--
  147.3 |0.0   infnan0.0nan |  147.3   0.3   54.1   93.7   20.1
  132.4 |0.0   infnan0.0nan |  132.4  37.4   60.6   91.89.2
  132.5 |0.0   infnan0.0nan |  132.5  37.7   60.3   93.79.3
  131.8 |0.0   infnan0.0nan |  131.8  39.4   60.7   92.79.0
  133.9 |0.0   infnan0.0nan |  133.9  41.7   59.8   90.78.5
  130.2 |0.0   infnan0.0nan |  130.2  40.8   61.5   88.68.9
# ./randomio /dev/sda 8 1 1 4096 10 6
  total |  read: latency (ms)   |  write:latency (ms)
   iops |   iops   minavgmax   sdev |   iops   minavgmax   sdev
+---+--
  145.4 |0.0   infnan0.0nan |  145.4   0.3   54.9   94.0   20.1
  130.3 |0.0   infnan0.0nan |  130.3  36.0   61.4   92.79.6
  130.6 |0.0   infnan0.0nan |  130.6  38.2   61.2   96.79.2
  132.1 |0.0   infnan0.0nan |  132.1  39.0   60.5   93.59.2
  131.8 |0.0   infnan0.0nan |  131.8  43.1   60.8   93.89.1
  129.0 |0.0   infnan0.0nan |  129.0  40.2   62.0   96.48.8

it looks pretty much the same to me...

-dean
-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Linux RAID Partition Offset 63 cylinders / 30% performance hit?

2007-12-25 Thread Bill Davidsen

Robin Hill wrote:

On Wed Dec 19, 2007 at 09:50:16AM -0500, Justin Piszcz wrote:

  

The (up to) 30% percent figure is mentioned here:
http://insights.oetiker.ch/linux/raidoptimization.html



That looks to be referring to partitioning a RAID device - this'll only
apply to hardware RAID or partitionable software RAID, not to the normal
use case.  When you're creating an array out of standard partitions then
you know the array stripe size will align with the disks (there's no way
it cannot), and you can set the filesystem stripe size to align as well
(XFS will do this automatically).

I've actually done tests on this with hardware RAID to try to find the
correct partition offset, but wasn't able to see any difference (using
bonnie++ and moving the partition start by one sector at a time).

  

# fdisk -l /dev/sdc

Disk /dev/sdc: 150.0 GB, 150039945216 bytes
255 heads, 63 sectors/track, 18241 cylinders
Units = cylinders of 16065 * 512 = 8225280 bytes
Disk identifier: 0x5667c24a

   Device Boot  Start End  Blocks   Id  System
/dev/sdc1   1   18241   146520801   fd  Linux raid 
autodetect




This looks to be a normal disk - the partition offsets shouldn't be
relevant here (barring any knowledge of the actual physical disk layout
anyway, and block remapping may well make that rather irrelevant).
  
The issue I'm thinking about is hardware sector size, which on modern 
drives may be larger than 512b and therefore entail a read-alter-rewrite 
(RAR) cycle when writing a 512b block. With larger writes, if the 
alignment is poor and the write size is some multiple of 512, it's 
possible to have an RAR at each end of the write. The only way to have a 
hope of controlling the alignment is to write a raw device or use a 
filesystem which can be configured to have blocks which are a multiple 
of the sector size and to do all i/o in block size starting each file on 
a block boundary. That may be possible with ext[234] set up properly.


Why this is important: the physical layout of the drive is useful, but 
for a large write the drive will have to make some number of steps from 
on cylinder to another. By carefully choosing the starting point, the 
best improvement will be to eliminate 2 track-to-track seek times, one 
at the start and one at the end. If the writes are small only one t2t 
saving is possible.


Now consider a RAR process. The drive is spinning typically at 7200 rpm, 
or 8.333 ms/rev. A read might take .5 rev on average, and a RAR will 
take 1.5 rev, because it takes a full revolution after the original data 
is read before the altered data can be rewritten. Larger sectors give 
more capacity, but reduced performance for write. And doing small writes 
can result in paying the RAR penalty on every write. So there may be a 
measurable benefit to getting that alignment right at the drive level.


--
Bill Davidsen <[EMAIL PROTECTED]>
 "Woe unto the statesman who makes war without a reason that will still
 be valid when the war is over..." Otto von Bismark 



-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Linux RAID Partition Offset 63 cylinders / 30% performance hit?

2007-12-20 Thread Justin Piszcz



On Thu, 20 Dec 2007, Bill Davidsen wrote:


Justin Piszcz wrote:



On Wed, 19 Dec 2007, Bill Davidsen wrote:

I'm going to try another approach, I'll describe it when I get results (or 
not).


http://home.comcast.net/~jpiszcz/align_vs_noalign/

Hardly any difference at whatsoever, only on the per char for read/write is 
it any faster..?


Am I misreading what you are doing here... you have the underlying data on 
the actual hardware devices 64k aligned by using either the whole device or 
starting a partition on a 64k boundary? I'm dubious that you will see a 
difference any other way, after all the translations take place.


I'm trying creating a raid array using loop devices created with the "offset" 
parameter, but I suspect that I will wind up doing a test after just 
repartitioning the drives, painful as that will be.


Average of 3 runs taken:

$ cat align/*log|grep ,
p63,8G,57683,94,86479,13,55242,8,63495,98,147647,11,434.8,0,16:10:16/64,1334210,10,330,2,120,1,3978,10,312,2 
p63,8G,57973,95,76702,11,50830,7,62291,99,136477,10,388.3,0,16:10:16/64,1252548,6,296,1,115,1,7927,20,373,2 
p63,8G,57758,95,80847,12,52144,8,63874,98,144747,11,443.4,0,16:10:16/64,1242445,6,303,1,117,1,6767,17,359,2 


$ cat noalign/*log|grep ,
p63,8G,57641,94,85494,12,55669,8,63802,98,146925,11,434.8,0,16:10:16/64,1353180,8,314,1,117,1,8684,22,283,2 
p63,8G,57705,94,85929,12,56708,8,63855,99,143437,11,436.2,0,16:10:16/64,12211519,29,297,1,113,1,3218,8,325,2 
p63,8G,57783,94,78226,11,48580,7,63487,98,137721,10,438.7,0,16:10:16/64,1243229,8,307,1,120,1,4247,11,313,2 




--
Bill Davidsen <[EMAIL PROTECTED]>
"Woe unto the statesman who makes war without a reason that will still
be valid when the war is over..." Otto von Bismark 



1. The first I made partitions on each drive like I normally do.
2. The second test was I followed the EMC document on how to properly 
align the partitions and I followed Microsoft's document on how to 
calculate the correct offset, I used 512 for 256k stripe.


Justin.
-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Linux RAID Partition Offset 63 cylinders / 30% performance hit?

2007-12-20 Thread Bill Davidsen

Justin Piszcz wrote:



On Wed, 19 Dec 2007, Bill Davidsen wrote:

I'm going to try another approach, I'll describe it when I get 
results (or not).


http://home.comcast.net/~jpiszcz/align_vs_noalign/

Hardly any difference at whatsoever, only on the per char for 
read/write is it any faster..?


Am I misreading what you are doing here... you have the underlying data 
on the actual hardware devices 64k aligned by using either the whole 
device or starting a partition on a 64k boundary? I'm dubious that you 
will see a difference any other way, after all the translations take place.


I'm trying creating a raid array using loop devices created with the 
"offset" parameter, but I suspect that I will wind up doing a test after 
just repartitioning the drives, painful as that will be.


Average of 3 runs taken:

$ cat align/*log|grep ,
p63,8G,57683,94,86479,13,55242,8,63495,98,147647,11,434.8,0,16:10:16/64,1334210,10,330,2,120,1,3978,10,312,2 

p63,8G,57973,95,76702,11,50830,7,62291,99,136477,10,388.3,0,16:10:16/64,1252548,6,296,1,115,1,7927,20,373,2 

p63,8G,57758,95,80847,12,52144,8,63874,98,144747,11,443.4,0,16:10:16/64,1242445,6,303,1,117,1,6767,17,359,2 



$ cat noalign/*log|grep ,
p63,8G,57641,94,85494,12,55669,8,63802,98,146925,11,434.8,0,16:10:16/64,1353180,8,314,1,117,1,8684,22,283,2 

p63,8G,57705,94,85929,12,56708,8,63855,99,143437,11,436.2,0,16:10:16/64,12211519,29,297,1,113,1,3218,8,325,2 

p63,8G,57783,94,78226,11,48580,7,63487,98,137721,10,438.7,0,16:10:16/64,1243229,8,307,1,120,1,4247,11,313,2 






--
Bill Davidsen <[EMAIL PROTECTED]>
 "Woe unto the statesman who makes war without a reason that will still
 be valid when the war is over..." Otto von Bismark 



-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Linux RAID Partition Offset 63 cylinders / 30% performance hit?

2007-12-20 Thread Michal Soltys

Jon Nelson wrote:


That, for me, is the next question - how can one educate LVM about the
underlying block device such that logical volumes carved out of that
space align properly - many of us have experienced 30% (or so)
performance losses for the convenience of LVM (and mighty convenient
it is).



When you do pvcreate you can specify --metadatasize. It will add padding 
just to hit next 64K boundary.


So i.e. if you do

#pvcreate --metadatasize 256K /dev/sda

your metadata will have 320K

but with

#pvcreate --metadatasize 255K /dev/sda

it will have 256K

After that, lvm extents will follow - with the size specified during 
vgcreate. Also I tend to use rather large sized extents (512M), as I 
don't really need the granularity offered by the default 4M.


-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Linux RAID Partition Offset 63 cylinders / 30% performance hit?

2007-12-20 Thread Gabor Gombas
On Wed, Dec 19, 2007 at 10:31:12AM -0500, Justin Piszcz wrote:

> Some nice graphs found here:
> http://sqlblog.com/blogs/linchi_shea/archive/2007/02/01/performance-impact-of-disk-misalignment.aspx

Again, this is a HW RAID, and the partitioning is done _on top of_ the
RAID.

Gabor

-- 
 -
 MTA SZTAKI Computer and Automation Research Institute
Hungarian Academy of Sciences
 -
-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Linux RAID Partition Offset 63 cylinders / 30% performance hit?

2007-12-20 Thread Gabor Gombas
On Wed, Dec 19, 2007 at 04:01:43PM +0100, Mattias Wadenstein wrote:

> From that setup it seems simple, scrap the partition table and use the disk 
> device for raid. This is what we do for all data storage disks (hw raid) 
> and sw raid members.

And _exactly_ that's when you run into the alignment problem. The common
SW RAID case (first partitioning, then building RAID arrays from
individual partitions) does not suffer from this issue.

Gabor

-- 
 -
 MTA SZTAKI Computer and Automation Research Institute
Hungarian Academy of Sciences
 -
-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Linux RAID Partition Offset 63 cylinders / 30% performance hit?

2007-12-20 Thread Gabor Gombas
On Wed, Dec 19, 2007 at 12:55:16PM -0500, Justin Piszcz wrote:

> unligned, just fdisk /dev/sdc, mkpartition, fd raid.
>  aligned, fdisk, expert, start at 512 as the off-set

No, that won't show any difference. You need to partition _the RAID
device_. If the partitioning is below the RAID level, then alignment do
not matter.

What is missing from your original quote is that the original reporter
used fake-HW RAID which can only handle full disks, and not individual
partitions. So if you want to experience the same performance drop, you
should also RAID full disks together and then put partitions on top of
the RAID array.

Gabor

-- 
 -
 MTA SZTAKI Computer and Automation Research Institute
Hungarian Academy of Sciences
 -
-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Linux RAID Partition Offset 63 cylinders / 30% performance hit?

2007-12-19 Thread Jon Nelson
On 12/19/07, Michal Soltys <[EMAIL PROTECTED]> wrote:
> Justin Piszcz wrote:
> >
> > Or is there a better way to do this, does parted handle this situation
> > better?
> >
> > What is the best (and correct) way to calculate stripe-alignment on the
> > RAID5 device itself?
> >
> >
> > Does this also apply to Linux/SW RAID5?  Or are there any caveats that
> > are not taken into account since it is based in SW vs. HW?
> >
> > ---
>
> In case of SW or HW raid, when you place raid aware filesystem directly on
> it, I don't see any potential poblems
>
> Also, if md's superblock version/placement actually mattered, it'd be pretty
> strange. The space available for actual use - be it partitions or filesystem
> directly - should be always nicely aligned. I don't know that for sure though.
>
> If you use SW partitionable raid, or HW raid with partitions, then you would
> have to align it on a chunk boundary manually. Any selfrespecting os
> shouldn't complain a partition doesn't start on cylinder boundary these
> days. LVM can complicate life a bit too - if you want it's volumes to be
> chunk-aligned.

That, for me, is the next question - how can one educate LVM about the
underlying block device such that logical volumes carved out of that
space align properly - many of us have experienced 30% (or so)
performance losses for the convenience of LVM (and mighty convenient
it is).


-- 
Jon
-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Linux RAID Partition Offset 63 cylinders / 30% performance hit?

2007-12-19 Thread Michal Soltys

Justin Piszcz wrote:


Or is there a better way to do this, does parted handle this situation 
better?


What is the best (and correct) way to calculate stripe-alignment on the 
RAID5 device itself?



Does this also apply to Linux/SW RAID5?  Or are there any caveats that 
are not taken into account since it is based in SW vs. HW?


---


In case of SW or HW raid, when you place raid aware filesystem directly on 
it, I don't see any potential poblems


Also, if md's superblock version/placement actually mattered, it'd be pretty 
strange. The space available for actual use - be it partitions or filesystem 
directly - should be always nicely aligned. I don't know that for sure though.


If you use SW partitionable raid, or HW raid with partitions, then you would
have to align it on a chunk boundary manually. Any selfrespecting os 
shouldn't complain a partition doesn't start on cylinder boundary these 
days. LVM can complicate life a bit too - if you want it's volumes to be 
chunk-aligned.


With NTFS the problem is, that it's not aware of underlaying raid in any 
way. It starts with 16 sectors long boot sector, somewhat compatible with 
ancient FAT. My blind guess would be to try to align the very first sector 
of $Mft with your chunk. Also, mentioned bootsector is also referenced as 
$Boot, thus I don't know if large cluster won't automatically extend it to 
full cluster size. Experiment, YMMV :)



-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Linux RAID Partition Offset 63 cylinders / 30% performance hit?

2007-12-19 Thread Justin Piszcz



On Wed, 19 Dec 2007, Robin Hill wrote:


On Wed Dec 19, 2007 at 09:50:16AM -0500, Justin Piszcz wrote:


The (up to) 30% percent figure is mentioned here:
http://insights.oetiker.ch/linux/raidoptimization.html


That looks to be referring to partitioning a RAID device - this'll only
apply to hardware RAID or partitionable software RAID, not to the normal
use case.  When you're creating an array out of standard partitions then
you know the array stripe size will align with the disks (there's no way
it cannot), and you can set the filesystem stripe size to align as well
(XFS will do this automatically).

I've actually done tests on this with hardware RAID to try to find the
correct partition offset, but wasn't able to see any difference (using
bonnie++ and moving the partition start by one sector at a time).


# fdisk -l /dev/sdc

Disk /dev/sdc: 150.0 GB, 150039945216 bytes
255 heads, 63 sectors/track, 18241 cylinders
Units = cylinders of 16065 * 512 = 8225280 bytes
Disk identifier: 0x5667c24a

   Device Boot  Start End  Blocks   Id  System
/dev/sdc1   1   18241   146520801   fd  Linux raid
autodetect


This looks to be a normal disk - the partition offsets shouldn't be
relevant here (barring any knowledge of the actual physical disk layout
anyway, and block remapping may well make that rather irrelevant).

That's my take on this one anyway.

Cheers,
   Robin
--
___
   ( ' } |   Robin Hill<[EMAIL PROTECTED]> |
  / / )  | Little Jim says |
 // !!   |  "He fallen in de water !!" |



Interesting, yes, I am using XFS as well, thanks for the response.
-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Linux RAID Partition Offset 63 cylinders / 30% performance hit?

2007-12-19 Thread Robin Hill
On Wed Dec 19, 2007 at 09:50:16AM -0500, Justin Piszcz wrote:

> The (up to) 30% percent figure is mentioned here:
> http://insights.oetiker.ch/linux/raidoptimization.html
>
That looks to be referring to partitioning a RAID device - this'll only
apply to hardware RAID or partitionable software RAID, not to the normal
use case.  When you're creating an array out of standard partitions then
you know the array stripe size will align with the disks (there's no way
it cannot), and you can set the filesystem stripe size to align as well
(XFS will do this automatically).

I've actually done tests on this with hardware RAID to try to find the
correct partition offset, but wasn't able to see any difference (using
bonnie++ and moving the partition start by one sector at a time).

> # fdisk -l /dev/sdc
>
> Disk /dev/sdc: 150.0 GB, 150039945216 bytes
> 255 heads, 63 sectors/track, 18241 cylinders
> Units = cylinders of 16065 * 512 = 8225280 bytes
> Disk identifier: 0x5667c24a
>
>Device Boot  Start End  Blocks   Id  System
> /dev/sdc1   1   18241   146520801   fd  Linux raid 
> autodetect
>
This looks to be a normal disk - the partition offsets shouldn't be
relevant here (barring any knowledge of the actual physical disk layout
anyway, and block remapping may well make that rather irrelevant).

That's my take on this one anyway.

Cheers,
Robin
-- 
 ___
( ' } |   Robin Hill<[EMAIL PROTECTED]> |
   / / )  | Little Jim says |
  // !!   |  "He fallen in de water !!" |


pgpF38P14XDRA.pgp
Description: PGP signature


Re: Linux RAID Partition Offset 63 cylinders / 30% performance hit?

2007-12-19 Thread Justin Piszcz



On Wed, 19 Dec 2007, Bill Davidsen wrote:

I'm going to try another approach, I'll describe it when I get results (or 
not).


http://home.comcast.net/~jpiszcz/align_vs_noalign/

Hardly any difference at whatsoever, only on the per char for read/write 
is it any faster..?


Average of 3 runs taken:

$ cat align/*log|grep ,
p63,8G,57683,94,86479,13,55242,8,63495,98,147647,11,434.8,0,16:10:16/64,1334210,10,330,2,120,1,3978,10,312,2
p63,8G,57973,95,76702,11,50830,7,62291,99,136477,10,388.3,0,16:10:16/64,1252548,6,296,1,115,1,7927,20,373,2
p63,8G,57758,95,80847,12,52144,8,63874,98,144747,11,443.4,0,16:10:16/64,1242445,6,303,1,117,1,6767,17,359,2

$ cat noalign/*log|grep ,
p63,8G,57641,94,85494,12,55669,8,63802,98,146925,11,434.8,0,16:10:16/64,1353180,8,314,1,117,1,8684,22,283,2
p63,8G,57705,94,85929,12,56708,8,63855,99,143437,11,436.2,0,16:10:16/64,12211519,29,297,1,113,1,3218,8,325,2
p63,8G,57783,94,78226,11,48580,7,63487,98,137721,10,438.7,0,16:10:16/64,1243229,8,307,1,120,1,4247,11,313,2

-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Linux RAID Partition Offset 63 cylinders / 30% performance hit?

2007-12-19 Thread Justin Piszcz



On Wed, 19 Dec 2007, Bill Davidsen wrote:


Justin Piszcz wrote:



On Wed, 19 Dec 2007, Bill Davidsen wrote:


Justin Piszcz wrote:



On Wed, 19 Dec 2007, Mattias Wadenstein wrote:


On Wed, 19 Dec 2007, Justin Piszcz wrote:


--

Now to my setup / question:

# fdisk -l /dev/sdc

Disk /dev/sdc: 150.0 GB, 150039945216 bytes
255 heads, 63 sectors/track, 18241 cylinders
Units = cylinders of 16065 * 512 = 8225280 bytes
Disk identifier: 0x5667c24a

  Device Boot  Start End  Blocks   Id  System
/dev/sdc1   1   18241   146520801   fd  Linux raid 
autodetect


---

If I use 10-disk RAID5 with 1024 KiB stripe, what would be the correct 
start and end size if I wanted to make sure the RAID5 was stripe 
aligned?


Or is there a better way to do this, does parted handle this situation 
better?


From that setup it seems simple, scrap the partition table and use the 
disk device for raid. This is what we do for all data storage disks (hw 
raid) and sw raid members.


/Mattias Wadenstein



Is there any downside to doing that?  I remember when I had to take my 
machine apart for a BIOS downgrade when I plugged in the sata devices 
again I did not plug them back in the same order, everything worked of 
course but when I ran LILO it said it was not part of the RAID set, 
because /dev/sda had become /dev/sdg and overwrote the MBR on the disk, 
if I had not used partitions here, I'd have lost (or more of the drives) 
due to a bad LILO run?


As other posts have detailed, putting the partition on a 64k aligned 
boundary can address the performance problems. However, a poor choice of 
chunk size, cache_buffer size, or just random i/o in small sizes can eat 
up a lot of the benefit.


I don't think you need to give up your partitions to get the benefit of 
alignment.


--
Bill Davidsen <[EMAIL PROTECTED]>
"Woe unto the statesman who makes war without a reason that will still
be valid when the war is over..." Otto von Bismark


Hrmm..

I am doing a benchmark now with:

6 x 400GB (SATA) / 256 KiB stripe with unaligned vs. aligned raid setup.

unligned, just fdisk /dev/sdc, mkpartition, fd raid.
 aligned, fdisk, expert, start at 512 as the off-set

Per a Microsoft KB:

Example of alignment calculations in kilobytes for a 256-KB stripe unit 
size:

(63 * .5) / 256 = 0.123046875
(64 * .5) / 256 = 0.125
(128 * .5) / 256 = 0.25
(256 * .5) / 256 = 0.5
(512 * .5) / 256 = 1
These examples shows that the partition is not aligned correctly for a 
256-KB stripe unit size until the partition is created by using an offset 
of 512 sectors (512 bytes per sector).


So I should start at 512 for a 256k chunk size.

I ran bonnie++ three consecutive times and took the average for the 
unaligned, rebuilding the RAID5 now and then I will re-execute the test 3 
additional times and take the average of that.


I'm going to try another approach, I'll describe it when I get results (or 
not).


Waiting for the raid to rebuild then I will re-run thereafter.

  [=>...]  recovery = 86.7% (339104640/390708480) 
finish=30.8min speed=27835K/sec


...


-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Linux RAID Partition Offset 63 cylinders / 30% performance hit?

2007-12-19 Thread Bill Davidsen

Justin Piszcz wrote:



On Wed, 19 Dec 2007, Bill Davidsen wrote:


Justin Piszcz wrote:



On Wed, 19 Dec 2007, Mattias Wadenstein wrote:


On Wed, 19 Dec 2007, Justin Piszcz wrote:


--

Now to my setup / question:

# fdisk -l /dev/sdc

Disk /dev/sdc: 150.0 GB, 150039945216 bytes
255 heads, 63 sectors/track, 18241 cylinders
Units = cylinders of 16065 * 512 = 8225280 bytes
Disk identifier: 0x5667c24a

  Device Boot  Start End  Blocks   Id  System
/dev/sdc1   1   18241   146520801   fd  Linux raid 
autodetect


---

If I use 10-disk RAID5 with 1024 KiB stripe, what would be the 
correct start and end size if I wanted to make sure the RAID5 was 
stripe aligned?


Or is there a better way to do this, does parted handle this 
situation better?


From that setup it seems simple, scrap the partition table and use 
the 
disk device for raid. This is what we do for all data storage disks 
(hw raid) and sw raid members.


/Mattias Wadenstein



Is there any downside to doing that?  I remember when I had to take 
my machine apart for a BIOS downgrade when I plugged in the sata 
devices again I did not plug them back in the same order, everything 
worked of course but when I ran LILO it said it was not part of the 
RAID set, because /dev/sda had become /dev/sdg and overwrote the MBR 
on the disk, if I had not used partitions here, I'd have lost (or 
more of the drives) due to a bad LILO run?


As other posts have detailed, putting the partition on a 64k aligned 
boundary can address the performance problems. However, a poor choice 
of chunk size, cache_buffer size, or just random i/o in small sizes 
can eat up a lot of the benefit.


I don't think you need to give up your partitions to get the benefit 
of alignment.


--
Bill Davidsen <[EMAIL PROTECTED]>
"Woe unto the statesman who makes war without a reason that will still
be valid when the war is over..." Otto von Bismark


Hrmm..

I am doing a benchmark now with:

6 x 400GB (SATA) / 256 KiB stripe with unaligned vs. aligned raid setup.

unligned, just fdisk /dev/sdc, mkpartition, fd raid.
 aligned, fdisk, expert, start at 512 as the off-set

Per a Microsoft KB:

Example of alignment calculations in kilobytes for a 256-KB stripe 
unit size:

(63 * .5) / 256 = 0.123046875
(64 * .5) / 256 = 0.125
(128 * .5) / 256 = 0.25
(256 * .5) / 256 = 0.5
(512 * .5) / 256 = 1
These examples shows that the partition is not aligned correctly for a 
256-KB stripe unit size until the partition is created by using an 
offset of 512 sectors (512 bytes per sector).


So I should start at 512 for a 256k chunk size.

I ran bonnie++ three consecutive times and took the average for the 
unaligned, rebuilding the RAID5 now and then I will re-execute the 
test 3 additional times and take the average of that.


I'm going to try another approach, I'll describe it when I get results 
(or not).


--
Bill Davidsen <[EMAIL PROTECTED]>
 "Woe unto the statesman who makes war without a reason that will still
 be valid when the war is over..." Otto von Bismark 



-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Linux RAID Partition Offset 63 cylinders / 30% performance hit?

2007-12-19 Thread Justin Piszcz



On Wed, 19 Dec 2007, Bill Davidsen wrote:


Justin Piszcz wrote:



On Wed, 19 Dec 2007, Mattias Wadenstein wrote:


On Wed, 19 Dec 2007, Justin Piszcz wrote:


--

Now to my setup / question:

# fdisk -l /dev/sdc

Disk /dev/sdc: 150.0 GB, 150039945216 bytes
255 heads, 63 sectors/track, 18241 cylinders
Units = cylinders of 16065 * 512 = 8225280 bytes
Disk identifier: 0x5667c24a

  Device Boot  Start End  Blocks   Id  System
/dev/sdc1   1   18241   146520801   fd  Linux raid 
autodetect


---

If I use 10-disk RAID5 with 1024 KiB stripe, what would be the correct 
start and end size if I wanted to make sure the RAID5 was stripe aligned?


Or is there a better way to do this, does parted handle this situation 
better?


From that setup it seems simple, scrap the partition table and use the 
disk device for raid. This is what we do for all data storage disks (hw 
raid) and sw raid members.


/Mattias Wadenstein



Is there any downside to doing that?  I remember when I had to take my 
machine apart for a BIOS downgrade when I plugged in the sata devices again 
I did not plug them back in the same order, everything worked of course but 
when I ran LILO it said it was not part of the RAID set, because /dev/sda 
had become /dev/sdg and overwrote the MBR on the disk, if I had not used 
partitions here, I'd have lost (or more of the drives) due to a bad LILO 
run?


As other posts have detailed, putting the partition on a 64k aligned boundary 
can address the performance problems. However, a poor choice of chunk size, 
cache_buffer size, or just random i/o in small sizes can eat up a lot of the 
benefit.


I don't think you need to give up your partitions to get the benefit of 
alignment.


--
Bill Davidsen <[EMAIL PROTECTED]>
"Woe unto the statesman who makes war without a reason that will still
be valid when the war is over..." Otto von Bismark 



Hrmm..

I am doing a benchmark now with:

6 x 400GB (SATA) / 256 KiB stripe with unaligned vs. aligned raid setup.

unligned, just fdisk /dev/sdc, mkpartition, fd raid.
 aligned, fdisk, expert, start at 512 as the off-set

Per a Microsoft KB:

Example of alignment calculations in kilobytes for a 256-KB stripe unit 
size:

(63 * .5) / 256 = 0.123046875
(64 * .5) / 256 = 0.125
(128 * .5) / 256 = 0.25
(256 * .5) / 256 = 0.5
(512 * .5) / 256 = 1
These examples shows that the partition is not aligned correctly for a 
256-KB stripe unit size until the partition is created by using an offset 
of 512 sectors (512 bytes per sector).


So I should start at 512 for a 256k chunk size.

I ran bonnie++ three consecutive times and took the average for the 
unaligned, rebuilding the RAID5 now and then I will re-execute the test 3 
additional times and take the average of that.


Justin.
-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Linux RAID Partition Offset 63 cylinders / 30% performance hit?

2007-12-19 Thread Jon Nelson
On 12/19/07, Bill Davidsen <[EMAIL PROTECTED]> wrote:
> As other posts have detailed, putting the partition on a 64k aligned
> boundary can address the performance problems. However, a poor choice of
> chunk size, cache_buffer size, or just random i/o in small sizes can eat
> up a lot of the benefit.
>
> I don't think you need to give up your partitions to get the benefit of
> alignment.

How might that benefit be realized?
Assume I have 3 disks, /dev/sd{b,c,d} all partitioned identically with
4 partitions, and I want to use /dev/sd{b,c,d}3 for a new SW raid.

What sequence of steps can I take to ensure that my raid is aligned on
a 64K boundary?
What effect do the different superblock formats have, if any, in this situation?

-- 
Jon
-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Linux RAID Partition Offset 63 cylinders / 30% performance hit?

2007-12-19 Thread Jon Nelson
On 12/19/07, Bill Davidsen <[EMAIL PROTECTED]> wrote:
> As other posts have detailed, putting the partition on a 64k aligned
> boundary can address the performance problems. However, a poor choice of
> chunk size, cache_buffer size, or just random i/o in small sizes can eat
> up a lot of the benefit.
>
> I don't think you need to give up your partitions to get the benefit of
> alignment.

How might that benefit be realized?
Assume I have 3 disks, /dev/sd{b,c,d} all partitioned identically with
4 partitions, and I want to use /dev/sd{b,c,d}3 for a new SW raid.

What sequence of steps can I take to ensure that my raid is aligned on
a 64K boundary?
What effect do the different superblock formats have, if any, in this situation?


-- 
Jon
-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Linux RAID Partition Offset 63 cylinders / 30% performance hit?

2007-12-19 Thread Bill Davidsen

Justin Piszcz wrote:



On Wed, 19 Dec 2007, Mattias Wadenstein wrote:


On Wed, 19 Dec 2007, Justin Piszcz wrote:


--

Now to my setup / question:

# fdisk -l /dev/sdc

Disk /dev/sdc: 150.0 GB, 150039945216 bytes
255 heads, 63 sectors/track, 18241 cylinders
Units = cylinders of 16065 * 512 = 8225280 bytes
Disk identifier: 0x5667c24a

  Device Boot  Start End  Blocks   Id  System
/dev/sdc1   1   18241   146520801   fd  Linux raid 
autodetect


---

If I use 10-disk RAID5 with 1024 KiB stripe, what would be the 
correct start and end size if I wanted to make sure the RAID5 was 
stripe aligned?


Or is there a better way to do this, does parted handle this 
situation better?


From that setup it seems simple, scrap the partition table and use the 
disk device for raid. This is what we do for all data storage disks 
(hw raid) and sw raid members.


/Mattias Wadenstein



Is there any downside to doing that?  I remember when I had to take my 
machine apart for a BIOS downgrade when I plugged in the sata devices 
again I did not plug them back in the same order, everything worked of 
course but when I ran LILO it said it was not part of the RAID set, 
because /dev/sda had become /dev/sdg and overwrote the MBR on the 
disk, if I had not used partitions here, I'd have lost (or more of the 
drives) due to a bad LILO run?


As other posts have detailed, putting the partition on a 64k aligned 
boundary can address the performance problems. However, a poor choice of 
chunk size, cache_buffer size, or just random i/o in small sizes can eat 
up a lot of the benefit.


I don't think you need to give up your partitions to get the benefit of 
alignment.


--
Bill Davidsen <[EMAIL PROTECTED]>
 "Woe unto the statesman who makes war without a reason that will still
 be valid when the war is over..." Otto von Bismark 



-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Linux RAID Partition Offset 63 cylinders / 30% performance hit?

2007-12-19 Thread Justin Piszcz



On Wed, 19 Dec 2007, Jon Nelson wrote:


On 12/19/07, Justin Piszcz <[EMAIL PROTECTED]> wrote:



On Wed, 19 Dec 2007, Mattias Wadenstein wrote:

From that setup it seems simple, scrap the partition table and use the

disk device for raid. This is what we do for all data storage disks (hw raid)
and sw raid members.

/Mattias Wadenstein



Is there any downside to doing that?  I remember when I had to take my


There is one (just pointed out to me yesterday): having the partition
and having it labeled as raid makes identification quite a bit easier
for humans and software, too.

--
Jon



Some nice graphs found here:
http://sqlblog.com/blogs/linchi_shea/archive/2007/02/01/performance-impact-of-disk-misalignment.aspx

-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Linux RAID Partition Offset 63 cylinders / 30% performance hit?

2007-12-19 Thread Jon Nelson
On 12/19/07, Justin Piszcz <[EMAIL PROTECTED]> wrote:
>
>
> On Wed, 19 Dec 2007, Mattias Wadenstein wrote:
> >> From that setup it seems simple, scrap the partition table and use the
> > disk device for raid. This is what we do for all data storage disks (hw 
> > raid)
> > and sw raid members.
> >
> > /Mattias Wadenstein
> >
>
> Is there any downside to doing that?  I remember when I had to take my

There is one (just pointed out to me yesterday): having the partition
and having it labeled as raid makes identification quite a bit easier
for humans and software, too.

-- 
Jon
-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Linux RAID Partition Offset 63 cylinders / 30% performance hit?

2007-12-19 Thread Justin Piszcz



On Wed, 19 Dec 2007, Mattias Wadenstein wrote:


On Wed, 19 Dec 2007, Justin Piszcz wrote:


--

Now to my setup / question:

# fdisk -l /dev/sdc

Disk /dev/sdc: 150.0 GB, 150039945216 bytes
255 heads, 63 sectors/track, 18241 cylinders
Units = cylinders of 16065 * 512 = 8225280 bytes
Disk identifier: 0x5667c24a

  Device Boot  Start End  Blocks   Id  System
/dev/sdc1   1   18241   146520801   fd  Linux raid 
autodetect


---

If I use 10-disk RAID5 with 1024 KiB stripe, what would be the correct 
start and end size if I wanted to make sure the RAID5 was stripe aligned?


Or is there a better way to do this, does parted handle this situation 
better?


From that setup it seems simple, scrap the partition table and use the 
disk device for raid. This is what we do for all data storage disks (hw raid) 
and sw raid members.


/Mattias Wadenstein



Is there any downside to doing that?  I remember when I had to take my 
machine apart for a BIOS downgrade when I plugged in the sata devices 
again I did not plug them back in the same order, everything worked of 
course but when I ran LILO it said it was not part of the RAID set, 
because /dev/sda had become /dev/sdg and overwrote the MBR on the disk, if 
I had not used partitions here, I'd have lost (or more of the drives) due 
to a bad LILO run?


Justin.
-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Linux RAID Partition Offset 63 cylinders / 30% performance hit?

2007-12-19 Thread Mattias Wadenstein

On Wed, 19 Dec 2007, Justin Piszcz wrote:


--

Now to my setup / question:

# fdisk -l /dev/sdc

Disk /dev/sdc: 150.0 GB, 150039945216 bytes
255 heads, 63 sectors/track, 18241 cylinders
Units = cylinders of 16065 * 512 = 8225280 bytes
Disk identifier: 0x5667c24a

  Device Boot  Start End  Blocks   Id  System
/dev/sdc1   1   18241   146520801   fd  Linux raid autodetect

---

If I use 10-disk RAID5 with 1024 KiB stripe, what would be the correct start 
and end size if I wanted to make sure the RAID5 was stripe aligned?


Or is there a better way to do this, does parted handle this situation 
better?


From that setup it seems simple, scrap the partition table and use the 
disk device for raid. This is what we do for all data storage disks (hw 
raid) and sw raid members.


/Mattias Wadenstein
-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html