mdadm: unable to add a disk to degraded raid1 array

2007-12-29 Thread Dan Williams
In case someone else happens upon this I have found that mdadm =
v2.6.2 cannot add a disk to a degraded raid1 array created with mdadm
 2.6.2.

I bisected the problem down to mdadm git commit
2fb749d1b7588985b1834e43de4ec5685d0b8d26 which appears to make an
incompatible change to the super block's 'data_size' field.

--- sdb1-sb-good.hex2007-12-12 14:31:42.0 +
+++ sdb1-sb-bad.hex 2007-12-12 14:31:36.0 +
@@ -6,12 +6,12 @@
 050 60d8 0077     0004 
 060        
 *
-080     60d8 0077  
+080     60d0 0077  

Which trips up the if (rdev-size  le64_to_cpu(sb-data_size)/2)
check in super_1_load [1], resulting in:

mdadm: add new device failed for /dev/sdb1 as 4: Invalid argument

--
Dan

[1] http://lxr.linux.no/linux/drivers/md/md.c#L1148
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: 2.6.24-rc6 reproducible raid5 hang

2007-12-29 Thread dean gaudet
hmm bummer, i'm doing another test (rsync 3.5M inodes from another box) on 
the same 64k chunk array and had raised the stripe_cache_size to 1024... 
and got a hang.  this time i grabbed stripe_cache_active before bumping 
the size again -- it was only 905 active.  as i recall the bug we were 
debugging a year+ ago the active was at the size when it would hang.  so 
this is probably something new.

anyhow raising it to 2048 got it unstuck, but i'm guessing i'll be able to 
hit that limit too if i try harder :)

btw what units are stripe_cache_size/active in?  is the memory consumed 
equal to (chunk_size * raid_disks * stripe_cache_size) or (chunk_size * 
raid_disks * stripe_cache_active)?

-dean

On Thu, 27 Dec 2007, dean gaudet wrote:

 hmm this seems more serious... i just ran into it with chunksize 64KiB and 
 while just untarring a bunch of linux kernels in parallel... increasing 
 stripe_cache_size did the trick again.
 
 -dean
 
 On Thu, 27 Dec 2007, dean gaudet wrote:
 
  hey neil -- remember that raid5 hang which me and only one or two others 
  ever experienced and which was hard to reproduce?  we were debugging it 
  well over a year ago (that box has 400+ day uptime now so at least that 
  long ago :)  the workaround was to increase stripe_cache_size... i seem to 
  have a way to reproduce something which looks much the same.
  
  setup:
  
  - 2.6.24-rc6
  - system has 8GiB RAM but no swap
  - 8x750GB in a raid5 with one spare, chunksize 1024KiB.
  - mkfs.xfs default options
  - mount -o noatime
  - dd if=/dev/zero of=/mnt/foo bs=4k count=2621440
  
  that sequence hangs for me within 10 seconds... and i can unhang / rehang 
  it by toggling between stripe_cache_size 256 and 1024.  i detect the hang 
  by watching iostat -kx /dev/sd? 5.
  
  i've attached the kernel log where i dumped task and timer state while it 
  was hung... note that you'll see at some point i did an xfs mount with 
  external journal but it happens with internal journal as well.
  
  looks like it's using the raid456 module and async api.
  
  anyhow let me know if you need more info / have any suggestions.
  
  -dean
 -
 To unsubscribe from this list: send the line unsubscribe linux-raid in
 the body of a message to [EMAIL PROTECTED]
 More majordomo info at  http://vger.kernel.org/majordomo-info.html
 
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Linux RAID Partition Offset 63 cylinders / 30% performance hit?

2007-12-29 Thread dean gaudet
On Tue, 25 Dec 2007, Bill Davidsen wrote:

 The issue I'm thinking about is hardware sector size, which on modern drives
 may be larger than 512b and therefore entail a read-alter-rewrite (RAR) cycle
 when writing a 512b block.

i'm not sure any shipping SATA disks have larger than 512B sectors yet... 
do you know of any?  (or is this thread about SCSI which i don't pay 
attention to...)

on a brand new WDC WD7500AAKS-00RBA0 with this partition layout:

255 heads, 63 sectors/track, 91201 cylinders
Units = cylinders of 16065 * 512 = 8225280 bytes

so sda1 starts at a non-multiple of 4096 into the disk.

i ran some random seek+write experiments using
http://arctic.org/~dean/randomio/, here are the results using 512 byte
and 4096 byte writes (fsync after each write), 8 threads, on sda1:

# ./randomio /dev/sda1 8 1 1 512 10 6
  total |  read: latency (ms)   |  write:latency (ms)
   iops |   iops   minavgmax   sdev |   iops   minavgmax   sdev
+---+--
  148.5 |0.0   infnan0.0nan |  148.5   0.2   53.7   89.3   19.5
  129.2 |0.0   infnan0.0nan |  129.2  37.2   61.9   96.79.3
  131.2 |0.0   infnan0.0nan |  131.2  40.3   61.0   90.49.3
  132.0 |0.0   infnan0.0nan |  132.0  39.6   60.6   89.39.1
  130.7 |0.0   infnan0.0nan |  130.7  39.8   61.3   98.18.9
  131.4 |0.0   infnan0.0nan |  131.4  40.0   60.8  101.09.6
# ./randomio /dev/sda1 8 1 1 4096 10 6
  total |  read: latency (ms)   |  write:latency (ms)
   iops |   iops   minavgmax   sdev |   iops   minavgmax   sdev
+---+--
  141.7 |0.0   infnan0.0nan |  141.7   0.3   56.3   99.3   21.1
  132.4 |0.0   infnan0.0nan |  132.4  43.3   60.4   91.88.5
  131.6 |0.0   infnan0.0nan |  131.6  41.4   60.9  111.09.6
  131.8 |0.0   infnan0.0nan |  131.8  41.4   60.7   85.38.6
  130.6 |0.0   infnan0.0nan |  130.6  41.7   61.3   95.09.4
  131.4 |0.0   infnan0.0nan |  131.4  42.2   60.8   90.58.4


i think the anomalous results in the first 10s samples are perhaps the drive
coming out of a standby state.

and here are the results aligned using the sda raw device itself:

# ./randomio /dev/sda 8 1 1 512 10 6
  total |  read: latency (ms)   |  write:latency (ms)
   iops |   iops   minavgmax   sdev |   iops   minavgmax   sdev
+---+--
  147.3 |0.0   infnan0.0nan |  147.3   0.3   54.1   93.7   20.1
  132.4 |0.0   infnan0.0nan |  132.4  37.4   60.6   91.89.2
  132.5 |0.0   infnan0.0nan |  132.5  37.7   60.3   93.79.3
  131.8 |0.0   infnan0.0nan |  131.8  39.4   60.7   92.79.0
  133.9 |0.0   infnan0.0nan |  133.9  41.7   59.8   90.78.5
  130.2 |0.0   infnan0.0nan |  130.2  40.8   61.5   88.68.9
# ./randomio /dev/sda 8 1 1 4096 10 6
  total |  read: latency (ms)   |  write:latency (ms)
   iops |   iops   minavgmax   sdev |   iops   minavgmax   sdev
+---+--
  145.4 |0.0   infnan0.0nan |  145.4   0.3   54.9   94.0   20.1
  130.3 |0.0   infnan0.0nan |  130.3  36.0   61.4   92.79.6
  130.6 |0.0   infnan0.0nan |  130.6  38.2   61.2   96.79.2
  132.1 |0.0   infnan0.0nan |  132.1  39.0   60.5   93.59.2
  131.8 |0.0   infnan0.0nan |  131.8  43.1   60.8   93.89.1
  129.0 |0.0   infnan0.0nan |  129.0  40.2   62.0   96.48.8

it looks pretty much the same to me...

-dean
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Linux RAID Partition Offset 63 cylinders / 30% performance hit?

2007-12-29 Thread Justin Piszcz



On Sat, 29 Dec 2007, dean gaudet wrote:


On Tue, 25 Dec 2007, Bill Davidsen wrote:


The issue I'm thinking about is hardware sector size, which on modern drives
may be larger than 512b and therefore entail a read-alter-rewrite (RAR) cycle
when writing a 512b block.


i'm not sure any shipping SATA disks have larger than 512B sectors yet...
do you know of any?  (or is this thread about SCSI which i don't pay
attention to...)

on a brand new WDC WD7500AAKS-00RBA0 with this partition layout:

255 heads, 63 sectors/track, 91201 cylinders
Units = cylinders of 16065 * 512 = 8225280 bytes

so sda1 starts at a non-multiple of 4096 into the disk.

i ran some random seek+write experiments using
http://arctic.org/~dean/randomio/, here are the results using 512 byte
and 4096 byte writes (fsync after each write), 8 threads, on sda1:

# ./randomio /dev/sda1 8 1 1 512 10 6
 total |  read: latency (ms)   |  write:latency (ms)
  iops |   iops   minavgmax   sdev |   iops   minavgmax   sdev
+---+--
 148.5 |0.0   infnan0.0nan |  148.5   0.2   53.7   89.3   19.5
 129.2 |0.0   infnan0.0nan |  129.2  37.2   61.9   96.79.3
 131.2 |0.0   infnan0.0nan |  131.2  40.3   61.0   90.49.3
 132.0 |0.0   infnan0.0nan |  132.0  39.6   60.6   89.39.1
 130.7 |0.0   infnan0.0nan |  130.7  39.8   61.3   98.18.9
 131.4 |0.0   infnan0.0nan |  131.4  40.0   60.8  101.09.6
# ./randomio /dev/sda1 8 1 1 4096 10 6
 total |  read: latency (ms)   |  write:latency (ms)
  iops |   iops   minavgmax   sdev |   iops   minavgmax   sdev
+---+--
 141.7 |0.0   infnan0.0nan |  141.7   0.3   56.3   99.3   21.1
 132.4 |0.0   infnan0.0nan |  132.4  43.3   60.4   91.88.5
 131.6 |0.0   infnan0.0nan |  131.6  41.4   60.9  111.09.6
 131.8 |0.0   infnan0.0nan |  131.8  41.4   60.7   85.38.6
 130.6 |0.0   infnan0.0nan |  130.6  41.7   61.3   95.09.4
 131.4 |0.0   infnan0.0nan |  131.4  42.2   60.8   90.58.4


i think the anomalous results in the first 10s samples are perhaps the drive
coming out of a standby state.

and here are the results aligned using the sda raw device itself:

# ./randomio /dev/sda 8 1 1 512 10 6
 total |  read: latency (ms)   |  write:latency (ms)
  iops |   iops   minavgmax   sdev |   iops   minavgmax   sdev
+---+--
 147.3 |0.0   infnan0.0nan |  147.3   0.3   54.1   93.7   20.1
 132.4 |0.0   infnan0.0nan |  132.4  37.4   60.6   91.89.2
 132.5 |0.0   infnan0.0nan |  132.5  37.7   60.3   93.79.3
 131.8 |0.0   infnan0.0nan |  131.8  39.4   60.7   92.79.0
 133.9 |0.0   infnan0.0nan |  133.9  41.7   59.8   90.78.5
 130.2 |0.0   infnan0.0nan |  130.2  40.8   61.5   88.68.9
# ./randomio /dev/sda 8 1 1 4096 10 6
 total |  read: latency (ms)   |  write:latency (ms)
  iops |   iops   minavgmax   sdev |   iops   minavgmax   sdev
+---+--
 145.4 |0.0   infnan0.0nan |  145.4   0.3   54.9   94.0   20.1
 130.3 |0.0   infnan0.0nan |  130.3  36.0   61.4   92.79.6
 130.6 |0.0   infnan0.0nan |  130.6  38.2   61.2   96.79.2
 132.1 |0.0   infnan0.0nan |  132.1  39.0   60.5   93.59.2
 131.8 |0.0   infnan0.0nan |  131.8  43.1   60.8   93.89.1
 129.0 |0.0   infnan0.0nan |  129.0  40.2   62.0   96.48.8

it looks pretty much the same to me...

-dean
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html



Good to know/have it confirmed by someone else, the alignment does not 
matter with Linux/SW RAID.


Justin.
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: 2.6.24-rc6 reproducible raid5 hang

2007-12-29 Thread Dan Williams
On Dec 29, 2007 9:48 AM, dean gaudet [EMAIL PROTECTED] wrote:
 hmm bummer, i'm doing another test (rsync 3.5M inodes from another box) on
 the same 64k chunk array and had raised the stripe_cache_size to 1024...
 and got a hang.  this time i grabbed stripe_cache_active before bumping
 the size again -- it was only 905 active.  as i recall the bug we were
 debugging a year+ ago the active was at the size when it would hang.  so
 this is probably something new.

I believe I am seeing the same issue and am trying to track down
whether XFS is doing something unexpected, i.e. I have not been able
to reproduce the problem with EXT3.  MD tries to increase throughput
by letting some stripe work build up in batches.  It looks like every
time your system has hung it has been in the 'inactive_blocked' state
i.e.  3/4 of stripes active.  This state should automatically
clear...


 anyhow raising it to 2048 got it unstuck, but i'm guessing i'll be able to
 hit that limit too if i try harder :)

Once you hang if 'stripe_cache_size' is increased such that
stripe_cache_active  3/4 * stripe_cache_size things will start
flowing again.


 btw what units are stripe_cache_size/active in?  is the memory consumed
 equal to (chunk_size * raid_disks * stripe_cache_size) or (chunk_size *
 raid_disks * stripe_cache_active)?


memory_consumed = PAGE_SIZE * raid_disks * stripe_cache_size


 -dean


--
Dan
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: 2.6.24-rc6 reproducible raid5 hang

2007-12-29 Thread dean gaudet
On Sat, 29 Dec 2007, Dan Williams wrote:

 On Dec 29, 2007 9:48 AM, dean gaudet [EMAIL PROTECTED] wrote:
  hmm bummer, i'm doing another test (rsync 3.5M inodes from another box) on
  the same 64k chunk array and had raised the stripe_cache_size to 1024...
  and got a hang.  this time i grabbed stripe_cache_active before bumping
  the size again -- it was only 905 active.  as i recall the bug we were
  debugging a year+ ago the active was at the size when it would hang.  so
  this is probably something new.
 
 I believe I am seeing the same issue and am trying to track down
 whether XFS is doing something unexpected, i.e. I have not been able
 to reproduce the problem with EXT3.  MD tries to increase throughput
 by letting some stripe work build up in batches.  It looks like every
 time your system has hung it has been in the 'inactive_blocked' state
 i.e.  3/4 of stripes active.  This state should automatically
 clear...

cool, glad you can reproduce it :)

i have a bit more data... i'm seeing the same problem on debian's 
2.6.22-3-amd64 kernel, so it's not new in 2.6.24.

i'm doing some more isolation but just grabbing kernels i have precompiled 
so far -- a 2.6.19.7 kernel doesn't show the problem, and early 
indications are a 2.6.21.7 kernel also doesn't have the problem but i'm 
giving it longer to show its head.

i'll try a stock 2.6.22 next depending on how the 2.6.21 test goes, just 
so we get the debian patches out of the way.

i was tempted to blame async api because it's newish :)  but according to 
the dmesg output it doesn't appear the 2.6.22-3-amd64 kernel used async 
API, and it still hung, so async is probably not to blame.

anyhow the test case i'm using is the dma_thrasher script i attached... it 
takes about an hour to give me confidence there's no problems so this will 
take a while.

-dean
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[patch] improve stripe_cache_size documentation

2007-12-29 Thread dean gaudet
Document the amount of memory used by the stripe cache and the fact that 
it's tied down and unavailable for other purposes (right?).  thanks to Dan 
Williams for the formula.

-dean

Signed-off-by: dean gaudet [EMAIL PROTECTED]

Index: linux/Documentation/md.txt
===
--- linux.orig/Documentation/md.txt 2007-12-29 13:01:25.0 -0800
+++ linux/Documentation/md.txt  2007-12-29 13:04:17.0 -0800
@@ -438,5 +438,11 @@
   stripe_cache_size  (currently raid5 only)
   number of entries in the stripe cache.  This is writable, but
   there are upper and lower limits (32768, 16).  Default is 128.
+
+  The stripe cache memory is locked down and not available for other uses.
+  The total size of the stripe cache is determined by this formula:
+
+PAGE_SIZE * raid_disks * stripe_cache_size
+
   strip_cache_active (currently raid5 only)
   number of active entries in the stripe cache
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: 2.6.24-rc6 reproducible raid5 hang

2007-12-29 Thread Justin Piszcz



On Sat, 29 Dec 2007, dean gaudet wrote:


On Sat, 29 Dec 2007, Dan Williams wrote:


On Dec 29, 2007 9:48 AM, dean gaudet [EMAIL PROTECTED] wrote:

hmm bummer, i'm doing another test (rsync 3.5M inodes from another box) on
the same 64k chunk array and had raised the stripe_cache_size to 1024...
and got a hang.  this time i grabbed stripe_cache_active before bumping
the size again -- it was only 905 active.  as i recall the bug we were
debugging a year+ ago the active was at the size when it would hang.  so
this is probably something new.


I believe I am seeing the same issue and am trying to track down
whether XFS is doing something unexpected, i.e. I have not been able
to reproduce the problem with EXT3.  MD tries to increase throughput
by letting some stripe work build up in batches.  It looks like every
time your system has hung it has been in the 'inactive_blocked' state
i.e.  3/4 of stripes active.  This state should automatically
clear...


cool, glad you can reproduce it :)

i have a bit more data... i'm seeing the same problem on debian's
2.6.22-3-amd64 kernel, so it's not new in 2.6.24.

i'm doing some more isolation but just grabbing kernels i have precompiled
so far -- a 2.6.19.7 kernel doesn't show the problem, and early
indications are a 2.6.21.7 kernel also doesn't have the problem but i'm
giving it longer to show its head.

i'll try a stock 2.6.22 next depending on how the 2.6.21 test goes, just
so we get the debian patches out of the way.

i was tempted to blame async api because it's newish :)  but according to
the dmesg output it doesn't appear the 2.6.22-3-amd64 kernel used async
API, and it still hung, so async is probably not to blame.

anyhow the test case i'm using is the dma_thrasher script i attached... it
takes about an hour to give me confidence there's no problems so this will
take a while.

-dean
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html



Dean,

Curious btw what kind of filesystem size/raid type (5, but defaults 
I assume, nothing special right? (right-symmetric vs. 
left-symmetric, etc?)/cache size/chunk size(s) are you using/testing with?


The script you sent out earlier, you are able to reproduce it easily with 
31 or so kernel tar decompressions?


Justin.
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: 2.6.24-rc6 reproducible raid5 hang

2007-12-29 Thread Dan Williams
On Dec 29, 2007 1:58 PM, dean gaudet [EMAIL PROTECTED] wrote:
 On Sat, 29 Dec 2007, Dan Williams wrote:

  On Dec 29, 2007 9:48 AM, dean gaudet [EMAIL PROTECTED] wrote:
   hmm bummer, i'm doing another test (rsync 3.5M inodes from another box) on
   the same 64k chunk array and had raised the stripe_cache_size to 1024...
   and got a hang.  this time i grabbed stripe_cache_active before bumping
   the size again -- it was only 905 active.  as i recall the bug we were
   debugging a year+ ago the active was at the size when it would hang.  so
   this is probably something new.
 
  I believe I am seeing the same issue and am trying to track down
  whether XFS is doing something unexpected, i.e. I have not been able
  to reproduce the problem with EXT3.  MD tries to increase throughput
  by letting some stripe work build up in batches.  It looks like every
  time your system has hung it has been in the 'inactive_blocked' state
  i.e.  3/4 of stripes active.  This state should automatically
  clear...

 cool, glad you can reproduce it :)

 i have a bit more data... i'm seeing the same problem on debian's
 2.6.22-3-amd64 kernel, so it's not new in 2.6.24.


This is just brainstorming at this point, but it looks like xfs can
submit more requests in the bi_end_io path such that it can lock
itself out of the RAID array.  The sequence that concerns me is:

return_io-xfs_buf_end_io-xfs_buf_io_end-xfs_buf_iodone_work-xfs_buf_iorequest-make_request-hang

I need verify whether this path is actually triggering, but if we are
in an inactive_blocked condition this new request will be put on a
wait queue and we'll never get to the release_stripe() call after
return_io().  It would be interesting to see if this is new XFS
behavior in recent kernels.

--
Dan
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: 2.6.24-rc6 reproducible raid5 hang

2007-12-29 Thread dean gaudet
On Sat, 29 Dec 2007, Justin Piszcz wrote:

 Curious btw what kind of filesystem size/raid type (5, but defaults I assume,
 nothing special right? (right-symmetric vs. left-symmetric, etc?)/cache
 size/chunk size(s) are you using/testing with?

mdadm --create --level=5 --chunk=64 -n7 -x1 /dev/md2 /dev/sd[a-h]1
mkfs.xfs -f /dev/md2

otherwise defaults

 The script you sent out earlier, you are able to reproduce it easily with 31
 or so kernel tar decompressions?

not sure, the point of the script is to untar more than there is RAM.  it 
happened with a single rsync running though -- 3.5M indoes from a remote 
box.  it also happens with the single 10GB dd write... although i've been 
using the tar method for testing different kernel revs.

-dean
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: 2.6.24-rc6 reproducible raid5 hang

2007-12-29 Thread dean gaudet

On Sat, 29 Dec 2007, dean gaudet wrote:

 On Sat, 29 Dec 2007, Justin Piszcz wrote:
 
  Curious btw what kind of filesystem size/raid type (5, but defaults I 
  assume,
  nothing special right? (right-symmetric vs. left-symmetric, etc?)/cache
  size/chunk size(s) are you using/testing with?
 
 mdadm --create --level=5 --chunk=64 -n7 -x1 /dev/md2 /dev/sd[a-h]1
 mkfs.xfs -f /dev/md2
 
 otherwise defaults

hmm i missed a few things, here's exactly how i created the array:

mdadm --create --level=5 --chunk=64 -n7 -x1 --assume-clean /dev/md2 
/dev/sd[a-h]1

it's reassembled automagically each reboot, but i do this each reboot:

mkfs.xfs -f /dev/md2
mount -o noatime /dev/md2 /mnt/new
./dma_thrasher linux.tar.gz /mnt/new

the --assume-clean and noatime probably make no difference though...

on the bisection front it looks like it's new behaviour between 2.6.21.7 
and 2.6.22.15 (stock kernels now, not debian).

i've got to step out for a while, but i'll go at it again later, probably 
with git bisect unless someone has some cherry picked changes to suggest.

-dean
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Linux RAID Partition Offset 63 cylinders / 30% performance hit?

2007-12-29 Thread Michael Tokarev
Justin Piszcz wrote:
[]
 Good to know/have it confirmed by someone else, the alignment does not
 matter with Linux/SW RAID.

Alignment matters when one partitions Linux/SW raid array.
If the inside partitions will not be aligned on a stripe
boundary, esp. in the worst case when the filesystem blocks
cross the stripe boundary (wonder if it's ever possible...
and I think it is, if a partition starts at some odd 512
bytes boundary, and filesystem block size is 4Kb, there's
just no chance for an inside filesystem to do full-stripe
writes, ever, so (modulo stripe cache size) all writes will
go read-modify-write or similar way.

And that's what the original article is about, by the way.
It just happens that hardware raid array is more often split
into partitions (using native tools) than linux software raid
arrays.

And that's what has been pointed out in this thread, as well... ;)

/mjt
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html