Re: RAID5 to RAID6 reshape?

2008-02-24 Thread Peter Grandi
 On Sat, 23 Feb 2008 21:40:08 +0100, Nagilum
 [EMAIL PROTECTED] said:

[ ... ]

 * Doing unaligned writes on a 13+1 or 12+2 is catastrophically
 slow because of the RMW cycle. This is of course independent
 of how one got to the something like 13+1 or a 12+2.

nagilum Changing a single byte in a 2+1 raid5 or a 13+1 raid5
nagilum requires exactly two 512byte blocks to be read and
nagilum written from two different disks. Changing two bytes
nagilum which are unaligned (the last and first byte of two
nagilum consecutive stripes) doubles those figures, but more
nagilum disks are involved.

Here you are using the astute misdirection of talking about
unaunaligned *byte* *updates* when the issue is unaligned
*stripe* *writes*.

If one used your scheme to write a 13+1 stripe one block at a
time would take 26R+26W operations (about half of which could be
cached) instead of 14W which are what is required when doing
aligned stripe writes, which is what good file systems try to
achieve.

Well, 26R+26W may be a caricature, but the problem is that even
if one bunches updates of N blocks into a read N blocks+parity,
write N blocks+parity operation is still RMW, just a smaller RMW
than a full stripe RMW.

And reading before writing can kill write performance, because
it is a two-pass algorithm and a two-pass algorithm is pretty
bad news for disk work, and even more so, given most OS and disk
elevator algorithms, for one pass of reads and one of writes
dependent on the reads.

But enough of talking about absurd cases, let's do a good clear
example of why a 13+1 is bad bad bad when doing unaligned writes.

Consider writing to a 2+1 and an 13+1 just 15 blocks in 4+4+4+3
bunches, starting with block 0 (so aligned start, unaligned
bunch length, unaligned total length), a random case but quite
illustrative:

  2+1:
00 01 P1 03 04 P2 06 07 P3 09 10 P4
00 0102 0304 0506 07   
--**---** --**---**
12 13 P5 15 16 P6 18 19 P7 21 22 P8
08 0910 1112 1314
--**---** --**---**

write D00 D01 DP1
write D03 D04 DP2

write D06 D07 DP3
write D09 D10 DP4

write D12 D13 DP5
write D15 D16 DP6

write D18 D19 DP7
read  D21 DP8
write D21 DP8

Total:
  IOP: 01 reads, 08 writes
  NLK: 02 reads, 23 writes
  XOR: 28 reads, 15 writes

 13+1:
00 01 02 03 04 05 06 07 08 09 10 11 12 P1
00 01 02 03 04 05 06 07 08 09 10 11 12
--- --- --- -- **

14 15 16 17 18 19 20 21 22 23 24 25 26 P2
13 14
-  **

read  D00 D01 D02 D03 DP1
write D00 D01 D02 D03 DP1

read  D04 D05 D06 D07 DP1
write D04 D05 D06 D07 DP1

read  D08 D09 D10 D11 DP1
write D08 D09 D10 D11 DP1

read  D12 DP1 D14 D15 DP2
write D12 DP1 D14 D15 DP2

Total:
  IOP: 04 reads, 04 writes
  BLK: 20 reads, 20 writes
  XOR: 34 reads, 10 writes

The short stripe size means that one does not need to RMW in
many cases, just W; and this despite that much higher redundancy
of 2+1. it also means that there are lots of parity blocks to
compute and write. With a 4 block operation length a 3+1 or even
more a 4+1 would be flattered here, but I wanted to exemplify
two extremes.

The narrow parallelism thus short stripe length of 2+1 means
that a lot less blocks get transferred because of almost no RM,
but it does 9 IOPs and 13+1 does one less at 8 (wider
parallelism); but then the 2+1 IOPs are mostly in back-to-back
write pairs, while the 13+1 are in read-rewrite pairs, which is
a significant disadvantage (often greatly underestimated).

Never mind that the number of IOPs is almost the same despite
the large difference in width, and that can do with the same
disks as a 13+1 something like 4 2+1/3+1 arrays, thus gaining a
lot of parallelism across threads, if there is such to be
obtained. And if one really wants to write long stripes, one
should use RAID10 of course, not long stripes with a single (or
two) parity blocks.

In the above example the length of the transfer is not aligned
with either the 2+1 or 13+1 stripe length; if the starting block
is unaligned too, then things look worse for 2+1, but that is a
pathologically bad case (and at the same time a pathologically
good case for 13+1):

  2+1:
00 01 P1|03 04 P2|06 07 P3|09 10 P4|12
   00   |01 02   |03 04   |05 06   |07
   ---**|--**|-- ---**|--**|--
13 P5|15 16 P6|18 19 P7|21 22 P8
08   |09 10   |11 12   |13 14
---**|--**|-- ---**|--**

read  D01 DP1
read  D06 DP3
write D01 DP1
write D03 D04 DP2
write D06 DP3

read  D07 DP3
read  D12 DP5
write D07 DP3
write D09 D10 DP4
write D12 DP5

read  D13 DP5

Re: RAID5 to RAID6 reshape?

2008-02-22 Thread Peter Grandi
[ ... ]

 * Suppose you have a 2+1 array which is full. Now you add a
 disk and that means that almost all free space is on a single
 disk. The MD subsystem has two options as to where to add
 that lump of space, consider why neither is very pleasant.

 No, only one, at the end of the md device and the free space
 will be evenly distributed among the drives.

Not necessarily, however let's assume that happens.

Since the the free space will have a different distribution
then the used space will also, so that the physical layout will
evolve like this when creating up a 3+1 from a 2+1+1:

   2+1+1   3+1
  a b c da b c d
  ------   
  0 1 P F0 1 2 QP: old parity
  P 2 3 FQ 3 4 5F: free block
  4 P 5 F6 Q 7 8Q: new parity
  ......
 F F F F

How will the free space become evenly distributed among the
drives? Well, sounds like 3 drives will be read (2 if not
checking parity) and 4 drives written; while on a 3+1 a mere
parity rebuild only writes to 1 at a time, even if reads from
3, and a recovery reads from 3 and writes to 2 drives.

Is that a pleasant option? To me it looks like begging for
trouble. For one thing the highest likelyhood of failure is
when a lot of disk start running together doing much the same
things. RAID is based on the idea of uncorrelated failures...

  An aside: in my innocence I realized only recently that online
  redundancy and uncorrelated failures are somewhat contradictory.

Never mind that since one is changing the layout an interruption
in the process may leave the array unusable, even if with no
loss of data, evne if recent MD versions mostly cope; from a
recent 'man' page for 'mdadm':

 «Increasing the number of active devices in a RAID5 is much
  more effort.  Every block in the array will need to be read
  and written back to a new location.»

  From 2.6.17, the Linux Kernel is able to do this safely,
  including restart and interrupted reshape.

  When relocating the first few stripes on a raid5, it is not
  possible to keep the data on disk completely consistent and
  crash-proof. To provide the required safety, mdadm disables
  writes to the array while this critical section is reshaped,
  and takes a backup of the data that is in that section.

  This backup is normally stored in any spare devices that the
  array has, however it can also be stored in a separate file
  specified with the --backup-file option.»

Since the reshape reads N *and then writes* to N+1 the drives at
almost the same time things are going to be a bit slower than a
mere rebuild or recover: each stripe will be read from the N
existing drives and then written back to N+1 *while the next
stripe is being read from N* (or not...).

 * How fast is doing unaligned writes with a 13+1 or a 12+2
 stripe? How often is that going to happen, especially on an
 array that started as a 2+1?

 They are all the same speed with raid5 no matter what you
 started with.

But I asked two questions questions that are not how does the
speed differ. The two answers to the questions I aked are very
different from the same speed (they are very slow and
rather often):

* Doing unaligned writes on a 13+1 or 12+2 is catastrophically
  slow because of the RMW cycle. This is of course independent
  of how one got to the something like 13+1 or a 12+2.

* Unfortunately the frequency of unaligned writes *does* usually
  depend on how dementedly one got to the 13+1 or 12+2 case:
  because a filesystem that lays out files so that misalignment
  is minimised with a 2+1 stripe just about guarantees that when
  one switches to a 3+1 stripe all previously written data is
  misaligned, and so on -- and never mind that every time one
  adds a disk a reshape is done that shuffles stuff around.

There is a saving grace as to the latter point: many programs
don't overwrite files in place but truncate and recreate them
(which is not so good but for this case).

 You read two blocks and you write two blocks. (not even chunks
 mind you)

But we are talking about a *reshape* here and to a RAID5. If you
add a drive to a RAID5 and redistribute in the obvious way then
existing stripes have to be rewritten as the periodicity of the
parity changes from every N to every N+1.

 * How long does it take to rebuild parity with a 13+1 array
 or a 12+2 array in case of single disk failure? What happens
 if a disk fails during rebuild?

 Depends on how much data the controllers can push. But at
 least with my hpt2320 the limiting factor is the disk speed

But here we are on the Linux RAID mailing list and we are
talking about software RAID. With software RAID a reshape with
14 disks needs to shuffle around the *host bus* (not merely the
host adapter as with hw RAID) almost 5 times as much data as
with 3 (say 14x80MB/s ~= 1GB/s sustained in both directions at
the outer tracks). The host adapter also has to be able to run
14 operations in parallel.

It can be done -- it is just somewhat expensive, 

Re: How many drives are bad?

2008-02-21 Thread Peter Grandi
 On Tue, 19 Feb 2008 14:25:28 -0500, Norman Elton
 [EMAIL PROTECTED] said:

[ ... ]

normelton The box presents 48 drives, split across 6 SATA
normelton controllers. So disks sda-sdh are on one controller,
normelton etc. In our configuration, I run a RAID5 MD array for
normelton each controller, then run LVM on top of these to form
normelton one large VolGroup.

Pure genius! I wonder how many Thumpers have been configured in
this well thought out way :-).

BTW, just to be sure -- you are running LVM in default linear
mode over those 6 RAID5s aren't you?

normelton I found that it was easiest to setup ext3 with a max
normelton of 2TB partitions. So running on top of the massive
normelton LVM VolGroup are a handful of ext3 partitions, each
normelton mounted in the filesystem.

Uhm, assuming 500GB drives each RAID set has a capacity of
3.5TB, and odds are that a bit over half of those 2TB volumes
will straddle array boundaries. Such attention to detail is
quite remarkable :-).

normelton This less than ideal (ZFS would allow us one large
normelton partition),

That would be another stroke of genius! (especially if you were
still using a set of underlying RAID5s instead of letting ZFS do
its RAIDZ thing). :-)

normelton but we're rewriting some software to utilize the
normelton multi-partition scheme.

Good luck!
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: LVM performance (was: Re: RAID5 to RAID6 reshape?)

2008-02-21 Thread Peter Grandi
 This might be related to raid chunk positioning with respect
 to LVM chunk positioning. If they interfere there indeed may
 be some performance drop. Best to make sure that those chunks
 are aligned together.

 Interesting. I'm seeing a 20% performance drop too, with default
 RAID and LVM chunk sizes of 64K and 4M, respectively. Since 64K
 divides 4M evenly, I'd think there shouldn't be such a big
 performance penalty. [ ... ]

Those are as such not very meaningful. What matters most is
whether the starting physical address of each logical volume
extent is stripe aligned (and whether the filesystem makes use
of that) and then the stripe size of the parity RAID set, not
the chunk sizes in themselves.

I am often surprised by how many people who use parity RAID
don't seem to realize the crucial importance of physical stripe
alignment, but I am getting used to it.

Because of stripe alignment it is usually better to build parity
arrays on top of partitions or volumes than viceversa, as it is
often more difficult to align the start of a partition or volume
to the underlying stripes than the reverse.

But then those who understand the vital importance of stripe
aligned writes for parity RAID often avoid using parity RAID
:-).
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Linux Software RAID 5 + XFS Multi-Benchmarks / 10 Raptors Again

2008-02-19 Thread Peter Grandi
 What sort of tools are you using to get these benchmarks, and can I
 used them for ext3?

The only simple tools that I found that gives semi-reasonable
numbers avoiding most of the many pitfalls of storage speed
testing (almost all storage benchmarks I see are largely
meaningless) are recent versions of GNU 'dd' when used with the
'fdatsync' and 'direct' flags and Bonnie 1.4 with the options
'-u -y -o_direct', both used with a moderately large volume of
data (dependent on the size of the host adapter cache if any).

In particular one must be very careful when using older versions
of 'dd' or Bonnie, or using bonnie++, iozone (unless with -U or
-I), ...

[ ... ]

  for i in 4 8 16 32 64 128 256 512 1024 2048 4096 8192 16384 32768 65536
  do
  cd /
  umount /r1
  mdadm -S /dev/md3
  mdadm --create --assume-clean --verbose /dev/md3 --level=5 --raid-devices=10 
  --chunk=$i --run /dev/sd[c-l]1
  /etc/init.d/oraid.sh # to optimize my raid stuff
  mkfs.xfs -f /dev/md3
  mount /dev/md3 /r1 -o logbufs=8,logbsize=262144
  /usr/bin/time -f %E -o ~/$i=chunk.txt bash -c 'dd if=/dev/zero 
  of=/r1/bigfile bs=1M count=10240; sync'
  done

I would not consider the results from this as particularly
meaningful (that 'sync' only helps a little bit) even for
large sequential write testing. One would have also to document
the elevator used and the flushed daemon parameters.

Let's say that storage benchmarking is a lot more difficult and
subtle than it looks like to the untrained eye.

It is just so much easier to use Bonnie 1.4 (with the flags
mentioned above) as a first (and often last) approximation (but
always remember to mention which elevator was in use).
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: RAID5 to RAID6 reshape?

2008-02-18 Thread Peter Grandi
 On Sun, 17 Feb 2008 07:45:26 -0700, Conway S. Smith
 [EMAIL PROTECTED] said:

[ ... ]

beolach Which part isn't wise? Starting w/ a few drives w/ the
beolach intention of growing; or ending w/ a large array (IOW,
beolach are 14 drives more than I should put in 1 array  expect
beolach to be safe from data loss)?

Well, that rather depends on what is your intended data setup and
access patterns, but the above are all things that may be unwise
in many cases. The intended use mentioned below does not require
a single array for example.

However while doing the above may make sense in *some* situation,
I reckon that the number of those situations is rather small.

Consider for example the answers to these questions:

* Suppose you have a 2+1 array which is full. Now you add a disk
  and that means that almost all free space is on a single disk.
  The MD subsystem has two options as to where to add that lump
  of space, consider why neither is very pleasant.

* How fast is doing unaligned writes with a 13+1 or a 12+2
  stripe? How often is that going to happen, especially on an
  array that started as a 2+1?

* How long does it take to rebuild parity with a 13+1 array or a
  12+2 array in case of s single disk failure? What happens if a
  disk fails during rebuild?

* When you have 13 drives and you add the 14th, how long does
  that take? What happens if a disk fails during rebuild??

The points made by http://WWW.BAARF.com/ apply too.

beolach [ ... ] media files that would typically be accessed
beolach over the network by MythTV boxes.  I'll also be using
beolach it as a sandbox database/web/mail server. [ ... ] most
beolach important stuff backed up, [ ... ] some gaming, which
beolach is where I expect performance to be most noticeable.

To me that sounds like something that could well be split across
multiple arrays, rather than risking repeatedly extending a
single array, and then risking a single large array.

beolach Well, I was reading that LVM2 had a 20%-50% performance
beolach penalty, which in my mind is a really big penalty. But I
beolach think those numbers where from some time ago, has the
beolach situation improved?

LVM2 relies on DM, which is not much slower than say 'loop', so
it is almost insignificant for most people.

But even if the overhead may be very very low, DM/LVM2/EVMS seem
to me to have very limited usefulness (e.g. Oracle tablespaces,
and there are contrary opinions as to that too). In your stated
applications it is hard to see why you'd want to split your
arrays into very many block devices or why you'd want to resize
them.

beolach And is a 14 drive RAID6 going to already have enough
beolach overhead that the additional overhead isn't very
beolach significant? I'm not sure why you say it's amusing.

Consider the questions above. Parity RAID has issues, extending
an array has issues, the idea of extending both massively and
in several steps a parity RAID looks very amusing to me.

beolach [ ... ] The other reason I wasn't planning on using LVM
beolach was because I was planning on keeping all the drives in
beolach the one RAID. [... ]

Good luck :-).
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: RAID5 to RAID6 reshape?

2008-02-17 Thread Peter Grandi
 On Sat, 16 Feb 2008 20:58:07 -0700, Beolach
 [EMAIL PROTECTED] said:

beolach [ ... ] start w/ 3 drives in RAID5, and add drives as I
beolach run low on free space, eventually to a total of 14
beolach drives (the max the case can fit).

Like for for so many other posts to this list, all that is
syntactically valid is not necessarily the same thing as that
which is wise. 

beolach But when I add the 5th or 6th drive, I'd like to switch
beolach from RAID5 to RAID6 for the extra redundancy.

Again, what may be possible is not necessarily what may be wise.

In particular it seems difficult to discern which usage such
arrays would be put to. There might be a bit of difference
between a giant FAT32 volume containing song lyrics files or an
XFS filesystem with a collection of 500GB tomography scans in
them cached from a large tape backup system.

beolach I'm also interested in hearing people's opinions about
beolach LVM / EVMS.

They are yellow, and taste of vanilla :-). To say something more
specific is difficult without knowing what kind of requirement
they may be expected to satisfy.

beolach I'm currently planning on just using RAID w/out the
beolach higher level volume management, as from my reading I
beolach don't think they're worth the performance penalty, [
beolach ... ]

Very amusing that someone who is planning to grow a 3 drive
RAID5 into a 14 drive RAID6 worries about the DM performance
penalty.
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: striping of a 4 drive raid10

2008-01-27 Thread Peter Grandi
 On Sun, 27 Jan 2008 20:33:45 +0100, Keld Jørn Simonsen
 [EMAIL PROTECTED] said:

keld Hi I have tried to make a striping raid out of my new 4 x
keld 1 TB SATA-2 disks. I tried raid10,f2 in several ways:

keld 1: md0 = raid10,f2 of sda1+sdb1, md1= raid10,f2 of sdc1+sdd1, md2 = raid0
keldof md0+md1
keld 2: md0 = raid0 of sda1+sdb1, md1= raid0 of sdc1+sdd1, md2 = raid01,f2
keldof md0+md1
keld 3: md0 = raid10,f2 of sda1+sdb1, md1= raid10,f2 of sdc1+sdd1, chunksize 
of 
keldmd0 =md1 =128 KB,  md2 = raid0 of md0+md1 chunksize = 256 KB
keld 4: md0 = raid0 of sda1+sdb1, md1= raid0 of sdc1+sdd1, chunksize
keldof md0 = md1 = 128 KB, md2 = raid01,f2 of md0+md1 chunksize = 256 KB

These stacked RAID levels don't make a lot of sense.

keld 5: md0= raid10,f4 of sda1+sdb1+sdc1+sdd1

This also does not make a lot of sense. Why have four mirrors
instead of two?

Instead, try 'md0 = raid10,f2' for example. The first mirror of
will be striped across the outer half of all four drives, and
the second mirrors will be rotated in the inner half of each
drive.

Which of course means that reads will be quite quick, but writes
and degraded operation will be slower.

Consider this post for more details:

  http://www.spinics.net/lists/raid/msg18130.html

[ ... ]
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: New XFS benchmarks using David Chinner's recommendations for XFS-based optimizations.

2007-12-31 Thread Peter Grandi
 Why does mdadm still use 64k for the default chunk size?

 Probably because this is the best balance for average file
 sizes, which are smaller than you seem to be testing with?

Well average file sizes relate less to chunk sizes than access
patterns do. Single threaded sequential reads with smaller block
sizes tend to perform better with smaller chunk sizes, for example.
File sizes do influence a bit the access patterns seen by disks.

The goals of a chunk size choice are to spread a single
''logical'' (application or filesystem level) operation over as
many arms as possible while keeping rotational latency low, to
minimize arm movement, and to minimize arm contention among
different threads.

Thus the tradeoffs influencing chunk size are about the
sequential vs. random nature of reads or writes, how many blocks
are involved in a single ''logical'' operation, and how many
threads are accessing the array.

The goal here is not to optimize the speed of the array, but the
throughput and/or latency of the applications using it.

A good idea is to consider the two extremes: a chunk size of 1
sector and a chunk size of the whole disk (or perhaps more
interestingly 1/2 disk).

For example, consider a RAID0 of 4 disks ('a', 'b', 'c', 'd')
with each chunk size of 8 sectors.

To read the first 16 chunks or 128 sectors of the array these
sector read operations ['get(device,first,last)'] have to be
issued:

  00-31: get(a,0,7) get(b,0,7) get(c,0,7) get(d,0,7)
  32-63:   get(a,8,15) get(b,8,15) get(c,8,15) get(d,8,15)
  64-95: get(a,16,23) get(b,16,23) get(c,16,23) get(d,16,23)
  96-127:  get(a,24,31) get(b,24,31) get(c,24,31) get(d,24,31)

I have indented the lists to show the increasing offset into
each block device.

Now, the big question here are all about the interval between
these operations, that is how large are logical operations and
how much they cluster in time and space.

For example in the above sequence it matters whether clusters of
operations involve less then 32 sectors or not, and the likely
interval between clusters generated by different concurrent
applications (consider rotational latency and likelyhood of arm
being moved between successive clusters).

So that space/time clustering depends more on how applications
process their data and how many applications concurrently access
the array, and whether they are reading or writing.

The latter point involves an exceedingly important asymmetry
that is often forgotten: an application read can only complete
when the last block is read, while a write can complete as soon
as it issued. So the time clustering of sector reads depends on
how long ''logical'' reads are as well as how long is the
interval between them.

So an application that issues frequent small reads rather than
infrequent large ones may work best with a small chunk size.

Not much related to distribution of file sizes, unless this
influences the space/time clustering of application issued
operations...

In general I like smaller chunk sizes than larger chunk sizes,
because the latter work well only in somewhat artificial cases
like simple-minded benchmarks.

In particular if one uses parity-based (not a good idea in
general...) arrays, as small chunk sizes (as well as stripe
sizes) give a better chance of reducing the frequency of RMW.

Counter to this that the Linux IO queueing subsystem (elevators
etc.) perhaps does not always take advantage of parallelizable
operations across disks as much as it could, and bandwidth
bottlenecks (e.g. PCI bus).

-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: raid10 performance question

2007-12-25 Thread Peter Grandi
 On Sun, 23 Dec 2007 08:26:55 -0600, Jon Nelson
 [EMAIL PROTECTED] said:

 I've found in some tests that raid10,f2 gives me the best I/O
 of any raid5 or raid10 format.

Mostly, depending on type of workload. Anyhow in general most
forms of RAID10 are cool, and handle disk losses better and so
on.

 However, the performance of raid10,o2 and raid10,n2 in
 degraded mode is nearly identical to the non-degraded mode
 performance (for me, this hovers around 100MB/s).

You don't say how many drives you got, but may suggest that your
array transfers are limited by the PCI host bus speed.

 raid10,f2 has degraded mode performance, writing, that is
 indistinguishable from it's non-degraded mode performance

 It's the raid10,f2 *read* performance in degraded mode that is
 strange - I get almost exactly 50% of the non-degraded mode
 read performance. Why is that?

Well, the best description I found of the odd Linux RAID10 modes
is here:

  http://en.Wikipedia.org/wiki/Non-standard_RAID_levels#Linux_MD_RAID_10

The key here is:

   The driver also supports a far layout where all the drives
are divided into f sections.

Now when there are two sections as in 'f2', each block will be
written to a block in the first half of the first disk and to
the second half of the next disk.

Consider this layout for the first 4 blocks on 2x2 layout
compared to the standard layout:

   DISK  DISK
  A B C D   A B C D

  1 2 3 4   1 1 2 2
  . . . .   3 3 4 4
  . . . .
  . . . .
  ---
  4 1 2 3   
  . . . .
  . . . .
  . . . .


This means that with the far layout one can read blocks 1,2,3,4
at the same speed as a RAID0 on the outer cylinders of each
disk; but if one of the disks fails, the mirror blocks have to
be read from the inner cylinders of the next disk, which are
usually a lot slower than the outer ones.

Now, there is a very interesting detail here: one idea about
getting a fast array is to take make it out of large high
density drives and just use the outer cylinders of each drive,
thus at the same time having a much smaller range of arm travel
and higher transfer rates.

The 'f2' layout means that (until a drive fails) for all reads
and for short writes MD is effectively using just the outer
half of each drive, *as well as* what is effectively a RAID0
layout.

  Note that the sustained writing speed of 'f2' is going to be
  same *across the whole capacity* of the RAID. While the
  sustained write speed of a 'n2' layout will be higher at the
  beginning and slower at the end just like for a single disk.

Interesting, I hadn't realized that, even if I am keenly aware
of the non uniform speeds of disks across cylinders.
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: raid10 performance question

2007-12-25 Thread Peter Grandi
 On Tue, 25 Dec 2007 19:08:15 +,
 [EMAIL PROTECTED] (Peter Grandi) said:

[ ... ]

 It's the raid10,f2 *read* performance in degraded mode that is
 strange - I get almost exactly 50% of the non-degraded mode
 read performance. Why is that?

 [ ... ] the mirror blocks have to be read from the inner
 cylinders of the next disk, which are usually a lot slower
 than the outer ones. [ ... ]

Just to be complete there is of course the other issue that
affect sustained writes too, which is extra seeks. If disk B
fails the situation becomes:

DISK
   A X C D

   1 X 3 4
   . . . .
   . . . .
   . . . .
   ---
   4 X 2 3   
   . . . .
   . . . .
   . . . .

Not only must block 2 be read from an inner cylinder, but to
read block 3 there must be a seek to an outer cylinder on the
same disk. Which is the same well known issue when doing
sustained writes with RAID10 'f2'.
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: raid6 check/repair

2007-12-04 Thread Peter Grandi
[ ... on RAID1, ... RAID6 error recovery ... ]

tn The use case for the proposed 'repair' would be occasional,
tn low-frequency corruption, for which many sources can be
tn imagined:

tn Any piece of hardware has a certain failure rate, which may
tn depend on things like age, temperature, stability of
tn operating voltage, cosmic rays, etc. but also on variations
tn in the production process.  Therefore, hardware may suffer
tn from infrequent glitches, which are seldom enough, to be
tn impossible to trace back to a particular piece of equipment.
tn It would be nice to recover gracefully from that.

What has this got to do with RAID6 or RAID in general? I have
been following this discussion with a sense of bewilderment as I
have started to suspect that parts of it are based on a very
large misunderstanding.

tn Kernel bugs or just plain administrator mistakes are another
tn thing.

The biggest administrator mistakes are lack of end-to-end checking
and backups. Those that don't have them wish their storage systems
could detect and recover from arbitrary and otherwise undetected
errors (but see below for bad news on silent corruptions).

tn But also the case of power-loss during writing that you have
tn mentioned could profit from that 'repair': With heterogeneous
tn hardware, blocks may be written in unpredictable order, so
tn that in more cases graceful recovery would be possible with
tn 'repair' compared to just recalculating parity.

Redundant RAID levels are designed to recover only from _reported_
errors that identify precisely where the error is. Recovering from
random block writing is something that seems to me to be quite
outside the scope of a low level virtual storage device layer.

ms I just want to give another suggestion. It may or may not be
ms possible to repair inconsistent arrays but in either way some
ms code there MUST at least warn the administrator that
ms something (may) went wrong.

tn Agreed.

That sounds instead quite extraordinary to me because it is not
clear how to define ''inconsistency'' in the general case never
mind detect it reliably, and never mind knowing when it is found
how to determine which are the good data bits and which are the
bad.

Now I am starting to think that this discussion is based on the
curious assumption that storage subsystems should solve the so
called ''byzantine generals'' problem, that is to operate reliably
in the presence of unreliable communications and storage.

ms I had an issue once where the chipset / mainboard was broken
ms so on one raid1 array I have diferent data was written to the
ms disks occasionally [ ... ]

Indeed. Some links from a web search:

  http://en.Wikipedia.org/wiki/Byzantine_Fault_Tolerance
  http://pages.CS.Wisc.edu/~sschang/OS-Qual/reliability/byzantine.htm
  http://research.Microsoft.com/users/lamport/pubs/byz.pdf

ms and linux-raid / mdadm did not complain or do anything.

The mystic version of Linux-RAID is in psi-test right now :-).


To me RAID does not seem the right abstraction level to deal with
this problem; and perhaps the file system level is not either,
even if ZFS tries to address some of the problem.

However there are ominous signs that the storage version of the
Byzantine generals problem is happening in particularly nasty
forms. For example as reported in this very very scary paper:

  
https://InDiCo.DESY.DE/contributionDisplay.py?contribId=65sessionId=42confId=257

where some of the causes have been apparently identified recently,
see slides 11, 12 and 13:

  
http://InDiCo.FNAL.gov/contributionDisplay.py?contribId=44amp;sessionId=15amp;confId=805

So I guess that end-to-end verification will have to become more
common, but which form it will take is not clear (I always use a
checksummed container format for important long term data).
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: md RAID 10 on Linux 2.6.20?

2007-11-24 Thread Peter Grandi
 On Thu, 22 Nov 2007 22:09:27 -0500, [EMAIL PROTECTED]
 said:

 [ ... ] a RAID 10 personality defined in md that can be
 implemented using mdadm. If so, is it available in 2.6.20.11,
 [ ... ]

Very good choice about 'raid10' in general. For a single layer
just use '-l raid10'. Run 'man mdadm', the '-l' option and also
the '-p' option for the more exotic variants. Also 'man 4 md'
the RAID10 section.

The pairs are formed naturally out of the block device list
(first with second listed, and so on).

 8 drive RAID 10 would actually consist of 5 md devices (four
 RAID 1's and one RAID 0). [ ... ] one RAID 10, that would of
 course be better both in terms of management and probably
 performance I would guess. [ ... ]

Indeed easier in terms of management and there are some
interesting options for layout. Not sure about performance, as
sometimes I have seen strange interactions with the page cache
either way, but usually '-l raid10' is the way to go as you say.
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: slow raid5 performance

2007-10-22 Thread Peter Grandi
 On Mon, 22 Oct 2007 15:33:09 -0400 (EDT), Justin Piszcz
 [EMAIL PROTECTED] said:

[ ... speed difference between PCI and PCIe RAID HAs ... ]

 I recently built a 3 drive RAID5 using the onboard SATA
 controllers on an MCP55 based board and get around 115MB/s
 write and 141MB/s read.  A fourth drive was added some time
 later and after growing the array and filesystem (XFS), saw
 160MB/s write and 178MB/s read, with the array 60% full.

jpiszcz Yes, your chipset must be PCI-e based and not PCI.

Broadly speaking yes (the MCP55 is a PCIe chipset), but it is
more complicated than that. The south bridge chipset host
adapters often have a rather faster link to memory and the CPU
interconnect than the PCI or PCIe buses can provide, even when
they are externally ''PCI''.

Also, when the RAID HA is not in-chipset it also matters a fair
bit how many lanes the PCIe slot (or whether it is PCI-X 64 bit
and 66MHz) it is plugged in has -- most PCIe RAID HAs can use 4
or 8 lanes (or equivalent for PCI-X).

-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: slow raid5 performance

2007-10-20 Thread Peter Grandi
 On Thu, 18 Oct 2007 16:45:20 -0700 (PDT), nefilim
 [EMAIL PROTECTED] said:

[ ... ]

 3 x 500GB WD RE2 hard drives
 AMD Athlon XP 2400 (2.0Ghz), 1GB RAM
[ ... ]
 avg-cpu:  %user   %nice %system %iowait  %steal   %idle
1.010.00   55.56   40.400.003.03
[ ... ]
 which is pretty much what I see with hdparm etc. 32MB/s seems
 pretty slow for drives that can easily do 50MB/s each. Read
 performance is better around 85MB/s (although I expected
 somewhat higher).

 So it doesn't seem that PCI bus is limiting factor here

Most 500GB drives can do 60-80MB/s on the outer tracks
(30-40MB/s on the inner ones), and 3 together can easily swamp
the PCI bus. While you see the write rates of two disks, the OS
is really writing to all three disks at the same time, and it
will do read-modify-write unless the writes are exactly stripe
aligned. When RMW happens write speed is lower than writing to a
single disk.

 I see a lot of time being spent in the kernel.. and a
 significant iowait time.

The system time is because the Linux page cache etc. is CPU
bound (never mind RAID5 XOR computation, which is not that
big). The IO wait is because IO is taking place.

  http://www.sabi.co.uk/blog/anno05-4th.html#051114

Almost all kernel developers of note have been hired by wealthy
corporations who sell to people buying large servers. Then the
typical system that these developers may have and also target
are high ends 2-4 CPU workstations and servers, with CPUs many
times faster than your PC, and on those system the CPU overhead
of the page cache at speeds like yours less than 5%.

My impression is that something that takes less than 5% on a
developers's system does not get looked at, even if it takes 50%
on your system. The Linux kernel was very efficient when most
developers were using old cheap PCs themselves. scratch your
itch rules.

Anyhow, try to bypass the page cache with 'O_DIRECT' or test
with 'dd oflag=direct' and similar for an alterative code path.

 The CPU is pretty old but where exactly is the bottleneck?

Misaligned writes and page cache CPU time most likely.

-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html