Re: Linux RAID Partition Offset 63 cylinders / 30% performance hit?

2007-12-25 Thread Bill Davidsen

Robin Hill wrote:

On Wed Dec 19, 2007 at 09:50:16AM -0500, Justin Piszcz wrote:

  

The (up to) 30% percent figure is mentioned here:
http://insights.oetiker.ch/linux/raidoptimization.html



That looks to be referring to partitioning a RAID device - this'll only
apply to hardware RAID or partitionable software RAID, not to the normal
use case.  When you're creating an array out of standard partitions then
you know the array stripe size will align with the disks (there's no way
it cannot), and you can set the filesystem stripe size to align as well
(XFS will do this automatically).

I've actually done tests on this with hardware RAID to try to find the
correct partition offset, but wasn't able to see any difference (using
bonnie++ and moving the partition start by one sector at a time).

  

# fdisk -l /dev/sdc

Disk /dev/sdc: 150.0 GB, 150039945216 bytes
255 heads, 63 sectors/track, 18241 cylinders
Units = cylinders of 16065 * 512 = 8225280 bytes
Disk identifier: 0x5667c24a

   Device Boot  Start End  Blocks   Id  System
/dev/sdc1   1   18241   146520801   fd  Linux raid 
autodetect




This looks to be a normal disk - the partition offsets shouldn't be
relevant here (barring any knowledge of the actual physical disk layout
anyway, and block remapping may well make that rather irrelevant).
  
The issue I'm thinking about is hardware sector size, which on modern 
drives may be larger than 512b and therefore entail a read-alter-rewrite 
(RAR) cycle when writing a 512b block. With larger writes, if the 
alignment is poor and the write size is some multiple of 512, it's 
possible to have an RAR at each end of the write. The only way to have a 
hope of controlling the alignment is to write a raw device or use a 
filesystem which can be configured to have blocks which are a multiple 
of the sector size and to do all i/o in block size starting each file on 
a block boundary. That may be possible with ext[234] set up properly.


Why this is important: the physical layout of the drive is useful, but 
for a large write the drive will have to make some number of steps from 
on cylinder to another. By carefully choosing the starting point, the 
best improvement will be to eliminate 2 track-to-track seek times, one 
at the start and one at the end. If the writes are small only one t2t 
saving is possible.


Now consider a RAR process. The drive is spinning typically at 7200 rpm, 
or 8.333 ms/rev. A read might take .5 rev on average, and a RAR will 
take 1.5 rev, because it takes a full revolution after the original data 
is read before the altered data can be rewritten. Larger sectors give 
more capacity, but reduced performance for write. And doing small writes 
can result in paying the RAR penalty on every write. So there may be a 
measurable benefit to getting that alignment right at the drive level.


--
Bill Davidsen [EMAIL PROTECTED]
 Woe unto the statesman who makes war without a reason that will still
 be valid when the war is over... Otto von Bismark 



-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: raid10 performance question

2007-12-25 Thread Peter Grandi
 On Sun, 23 Dec 2007 08:26:55 -0600, Jon Nelson
 [EMAIL PROTECTED] said:

 I've found in some tests that raid10,f2 gives me the best I/O
 of any raid5 or raid10 format.

Mostly, depending on type of workload. Anyhow in general most
forms of RAID10 are cool, and handle disk losses better and so
on.

 However, the performance of raid10,o2 and raid10,n2 in
 degraded mode is nearly identical to the non-degraded mode
 performance (for me, this hovers around 100MB/s).

You don't say how many drives you got, but may suggest that your
array transfers are limited by the PCI host bus speed.

 raid10,f2 has degraded mode performance, writing, that is
 indistinguishable from it's non-degraded mode performance

 It's the raid10,f2 *read* performance in degraded mode that is
 strange - I get almost exactly 50% of the non-degraded mode
 read performance. Why is that?

Well, the best description I found of the odd Linux RAID10 modes
is here:

  http://en.Wikipedia.org/wiki/Non-standard_RAID_levels#Linux_MD_RAID_10

The key here is:

   The driver also supports a far layout where all the drives
are divided into f sections.

Now when there are two sections as in 'f2', each block will be
written to a block in the first half of the first disk and to
the second half of the next disk.

Consider this layout for the first 4 blocks on 2x2 layout
compared to the standard layout:

   DISK  DISK
  A B C D   A B C D

  1 2 3 4   1 1 2 2
  . . . .   3 3 4 4
  . . . .
  . . . .
  ---
  4 1 2 3   
  . . . .
  . . . .
  . . . .


This means that with the far layout one can read blocks 1,2,3,4
at the same speed as a RAID0 on the outer cylinders of each
disk; but if one of the disks fails, the mirror blocks have to
be read from the inner cylinders of the next disk, which are
usually a lot slower than the outer ones.

Now, there is a very interesting detail here: one idea about
getting a fast array is to take make it out of large high
density drives and just use the outer cylinders of each drive,
thus at the same time having a much smaller range of arm travel
and higher transfer rates.

The 'f2' layout means that (until a drive fails) for all reads
and for short writes MD is effectively using just the outer
half of each drive, *as well as* what is effectively a RAID0
layout.

  Note that the sustained writing speed of 'f2' is going to be
  same *across the whole capacity* of the RAID. While the
  sustained write speed of a 'n2' layout will be higher at the
  beginning and slower at the end just like for a single disk.

Interesting, I hadn't realized that, even if I am keenly aware
of the non uniform speeds of disks across cylinders.
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Raid over 48 disks

2007-12-25 Thread pg_mh
 On Wed, 19 Dec 2007 07:28:20 +1100, Neil Brown
 [EMAIL PROTECTED] said:

[ ... what to do with 48 drive Sun Thumpers ... ]

neilb I wouldn't create a raid5 or raid6 on all 48 devices.
neilb RAID5 only survives a single device failure and with that
neilb many devices, the chance of a second failure before you
neilb recover becomes appreciable.

That's just one of the many problems, other are:

* If a drive fails, rebuild traffic is going to hit hard, with
  reading in parallel 47 blocks to compute a new 48th.

* With a parity strip length of 48 it will be that much harder
  to avoid read-modify before write, as it will be avoidable
  only for writes of at least 48 blocks aligned on 48 block
  boundaries. And reading 47 blocks to write one is going to be
  quite painful.

[ ... ]

neilb RAID10 would be a good option if you are happy wit 24
neilb drives worth of space. [ ... ]

That sounds like the only feasible option (except for the 3
drive case in most cases). Parity RAID does not scale much
beyond 3-4 drives.

neilb Alternately, 8 6drive RAID5s or 6 8raid RAID6s, and use
neilb RAID0 to combine them together. This would give you
neilb adequate reliability and performance and still a large
neilb amount of storage space.

That sounds optimistic to me: the reason to do a RAID50 of
8x(5+1) can only be to have a single filesystem, else one could
have 8 distinct filesystems each with a subtree of the whole.
With a single filesystem the failure of any one of the 8 RAID5
components of the RAID0 will cause the loss of the whole lot.

So in the 47+1 case a loss of any two drives would lead to
complete loss; in the 8x(5+1) case only a loss of two drives in
the same RAID5 will.

It does not sound like a great improvement to me (especially
considering the thoroughly inane practice of building arrays out
of disks of the same make and model taken out of the same box).

There are also modest improvements in the RMW strip size and in
the cost of a rebuild after a single drive loss. Probably the
reduction in the RMW strip size is the best improvement.

Anyhow, let's assume 0.5TB drives; with a 47+1 we get a single
23.5TB filesystem, and with 8*(5+1) we get a 20TB filesystem.
With current filesystem technology either size is worrying, for
example as to time needed for an 'fsck'.

In practice RAID5 beyond 3-4 drives seems only useful for almost
read-only filesystems where restoring from backups is quick and
easy, never mind the 47+1 case or the 8x(5+1) one, and I think
that giving some credit even to the latter arrangement is not
quite right...
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: raid10: unfair disk load?

2007-12-25 Thread Bill Davidsen

Richard Scobie wrote:

Jon Nelson wrote:


My own tests on identical hardware (same mobo, disks, partitions,
everything) and same software, with the only difference being how
mdadm is invoked (the only changes here being level and possibly
layout) show that raid0 is about 15% faster on reads than the very
fast raid10, f2 layout. raid10,f2 is approx. 50% of the write speed of
raid0.


This more or less matches my testing.


Have you tested a stacked RAID 10 made up of 2 drive RAID1 arrays, 
striped together into a RAID0.


That is not raid10, that's raid1+0. See man md.


I have found this configuration to offer very good performance, at the 
cost of slightly more complexity.


It does, raid0 can be striped over many configurations, raid[156] being 
most common.


--
Bill Davidsen [EMAIL PROTECTED]
 Woe unto the statesman who makes war without a reason that will still
 be valid when the war is over... Otto von Bismark 



-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Raid over 48 disks

2007-12-25 Thread Bill Davidsen

Peter Grandi wrote:

On Wed, 19 Dec 2007 07:28:20 +1100, Neil Brown
[EMAIL PROTECTED] said:



[ ... what to do with 48 drive Sun Thumpers ... ]

neilb I wouldn't create a raid5 or raid6 on all 48 devices.
neilb RAID5 only survives a single device failure and with that
neilb many devices, the chance of a second failure before you
neilb recover becomes appreciable.

That's just one of the many problems, other are:

* If a drive fails, rebuild traffic is going to hit hard, with
  reading in parallel 47 blocks to compute a new 48th.

* With a parity strip length of 48 it will be that much harder
  to avoid read-modify before write, as it will be avoidable
  only for writes of at least 48 blocks aligned on 48 block
  boundaries. And reading 47 blocks to write one is going to be
  quite painful.

[ ... ]

neilb RAID10 would be a good option if you are happy wit 24
neilb drives worth of space. [ ... ]

That sounds like the only feasible option (except for the 3
drive case in most cases). Parity RAID does not scale much
beyond 3-4 drives.

neilb Alternately, 8 6drive RAID5s or 6 8raid RAID6s, and use
neilb RAID0 to combine them together. This would give you
neilb adequate reliability and performance and still a large
neilb amount of storage space.

That sounds optimistic to me: the reason to do a RAID50 of
8x(5+1) can only be to have a single filesystem, else one could
have 8 distinct filesystems each with a subtree of the whole.
With a single filesystem the failure of any one of the 8 RAID5
components of the RAID0 will cause the loss of the whole lot.

So in the 47+1 case a loss of any two drives would lead to
complete loss; in the 8x(5+1) case only a loss of two drives in
the same RAID5 will.

It does not sound like a great improvement to me (especially
considering the thoroughly inane practice of building arrays out
of disks of the same make and model taken out of the same box).
  


Quality control just isn't that good that same box make a big 
difference, assuming that you have an appropriate number of hot spares 
online. Note that I said big difference, is there some clustering of 
failures? Some, but damn little. A few years ago I was working with 
multiple 6TB machines and 20+ 1TB machines, all using small, fast, 
drives in RAID5E. I can't remember a case where a drive failed before 
rebuild was complete, and only one or two where there was a failure to 
degraded mode before the hot spare was replaced.


That said, RAID5E typically can rebuild a lot faster than a typical hot 
spare as a unit drive, at least for any given impact on performance. 
This undoubtedly reduce our exposure time.

There are also modest improvements in the RMW strip size and in
the cost of a rebuild after a single drive loss. Probably the
reduction in the RMW strip size is the best improvement.

Anyhow, let's assume 0.5TB drives; with a 47+1 we get a single
23.5TB filesystem, and with 8*(5+1) we get a 20TB filesystem.
With current filesystem technology either size is worrying, for
example as to time needed for an 'fsck'.
  


Given that someone is putting a typical filesystem full of small files 
on a big raid, I agree. But fsck with large files is pretty fast on a 
given filesystem (200GB files on a 6TB ext3, for instance), due to the 
small number of inodes in play. While the bitmap resolution is a factor, 
it's pretty linear, fsck with lots of files gets really slow. And let's 
face it, the objective of raid is to avoid doing that fsck in the first 
place ;-)


--
Bill Davidsen [EMAIL PROTECTED]
 Woe unto the statesman who makes war without a reason that will still
 be valid when the war is over... Otto von Bismark 



-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: raid10 performance question

2007-12-25 Thread Peter Grandi
 On Tue, 25 Dec 2007 19:08:15 +,
 [EMAIL PROTECTED] (Peter Grandi) said:

[ ... ]

 It's the raid10,f2 *read* performance in degraded mode that is
 strange - I get almost exactly 50% of the non-degraded mode
 read performance. Why is that?

 [ ... ] the mirror blocks have to be read from the inner
 cylinders of the next disk, which are usually a lot slower
 than the outer ones. [ ... ]

Just to be complete there is of course the other issue that
affect sustained writes too, which is extra seeks. If disk B
fails the situation becomes:

DISK
   A X C D

   1 X 3 4
   . . . .
   . . . .
   . . . .
   ---
   4 X 2 3   
   . . . .
   . . . .
   . . . .

Not only must block 2 be read from an inner cylinder, but to
read block 3 there must be a seek to an outer cylinder on the
same disk. Which is the same well known issue when doing
sustained writes with RAID10 'f2'.
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 001 of 7] md: Support 'external' metadata for md arrays.

2007-12-25 Thread Andrew Morton
On Fri, 14 Dec 2007 17:26:08 +1100 NeilBrown [EMAIL PROTECTED] wrote:

 + if (strncmp(buf, external:, 9) == 0) {
 + int namelen = len-9;
 + if (namelen = sizeof(mddev-metadata_type))
 + namelen = sizeof(mddev-metadata_type)-1;
 + strncpy(mddev-metadata_type, buf+9, namelen);
 + mddev-metadata_type[namelen] = 0;
 + if (namelen  mddev-metadata_type[namelen-1] == '\n')
 + mddev-metadata_type[--namelen] = 0;
 + mddev-persistent = 0;
 + mddev-external = 1;

size_t would be a more appropriate type for `namelen'.
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 004 of 7] md: Allow devices to be shared between md arrays.

2007-12-25 Thread Andrew Morton
On Fri, 14 Dec 2007 17:26:28 +1100 NeilBrown [EMAIL PROTECTED] wrote:

 + mddev_unlock(rdev-mddev);
 + ITERATE_MDDEV(mddev, tmp) {
 + mdk_rdev_t *rdev2;
 +
 + mddev_lock(mddev);
 + ITERATE_RDEV(mddev, rdev2, tmp2)
 + if (test_bit(AllReserved, rdev2-flags) ||
 + (rdev-bdev == rdev2-bdev 
 +  rdev != rdev2 
 +  overlaps(rdev-data_offset, rdev-size,
 + rdev2-data_offset, rdev2-size))) {
 + overlap = 1;
 + break;
 + }
 + mddev_unlock(mddev);
 + if (overlap) {
 + mddev_put(mddev);
 + break;
 + }
 + }

eww, ITERATE_MDDEV() and ITERATE_RDEV() are an eyesore.

for_each_mddev() and for_each_rdev() would at least mean the reader doesn't
need to check the implementation when wondering what that `break' is
breaking from.

  #define  In_sync 2   /* device is in_sync with rest 
 of array */
  #define  WriteMostly 4   /* Avoid reading if at all 
 possible */
  #define  BarriersNotsupp 5   /* BIO_RW_BARRIER is not 
 supported */
 +#define  AllReserved 6   /* If whole device is reserved 
 for

The naming style here is inconsistent.

A task for the keen would be to convert these to an enum and add some
namespacing prefix to them.  
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html