Re: RAID5 to RAID6 reshape?

2008-02-25 Thread Nagilum

- Message from [EMAIL PROTECTED] -
Date: Mon, 25 Feb 2008 00:10:07 +
From: Peter Grandi [EMAIL PROTECTED]
Reply-To: Peter Grandi [EMAIL PROTECTED]
 Subject: Re: RAID5 to RAID6 reshape?
  To: Linux RAID linux-raid@vger.kernel.org



On Sat, 23 Feb 2008 21:40:08 +0100, Nagilum
[EMAIL PROTECTED] said:


[ ... ]


* Doing unaligned writes on a 13+1 or 12+2 is catastrophically
slow because of the RMW cycle. This is of course independent
of how one got to the something like 13+1 or a 12+2.


nagilum Changing a single byte in a 2+1 raid5 or a 13+1 raid5
nagilum requires exactly two 512byte blocks to be read and
nagilum written from two different disks. Changing two bytes
nagilum which are unaligned (the last and first byte of two
nagilum consecutive stripes) doubles those figures, but more
nagilum disks are involved.

Here you are using the astute misdirection of talking about
unaunaligned *byte* *updates* when the issue is unaligned
*stripe* *writes*.


Which are (imho) much less likely to occur than minor changes in a  
block. (think touch, mv, chown, chmod, etc.)



If one used your scheme to write a 13+1 stripe one block at a
time would take 26R+26W operations (about half of which could be
cached) instead of 14W which are what is required when doing
aligned stripe writes, which is what good file systems try to
achieve.

But enough of talking about absurd cases, let's do a good clear
example of why a 13+1 is bad bad bad when doing unaligned writes.

Consider writing to a 2+1 and an 13+1 just 15 blocks in 4+4+4+3
bunches, starting with block 0 (so aligned start, unaligned
bunch length, unaligned total length), a random case but quite
illustrative:

  2+1:
00 01 P1 03 04 P2 06 07 P3 09 10 P4
00 0102 0304 0506 07
--**---** --**---**
12 13 P5 15 16 P6 18 19 P7 21 22 P8
08 0910 1112 1314
--**---** --**---**

write D00 D01 DP1
write D03 D04 DP2

write D06 D07 DP3
write D09 D10 DP4

write D12 D13 DP5
write D15 D16 DP6

write D18 D19 DP7
read  D21 DP8
write D21 DP8

Total:
  IOP: 01 reads, 08 writes
  NLK: 02 reads, 23 writes
  XOR: 28 reads, 15 writes

 13+1:
00 01 02 03 04 05 06 07 08 09 10 11 12 P1
00 01 02 03 04 05 06 07 08 09 10 11 12
--- --- --- -- **

14 15 16 17 18 19 20 21 22 23 24 25 26 P2
13 14
-  **

read  D00 D01 D02 D03 DP1
write D00 D01 D02 D03 DP1

read  D04 D05 D06 D07 DP1
write D04 D05 D06 D07 DP1

read  D08 D09 D10 D11 DP1
write D08 D09 D10 D11 DP1

read  D12 DP1 D14 D15 DP2
write D12 DP1 D14 D15 DP2

Total:
  IOP: 04 reads, 04 writes
  BLK: 20 reads, 20 writes
  XOR: 34 reads, 10 writes


and now the same with cache:

write D01 D02 D03 D04 D05 D06 D07 D08 D09 D10 D11 D12 D13 DP1
read  D14 D15 DP2
write D14 D15 DP2
Total:
  IOP: 01 reads, 02 writes
  BLK: 03 reads, 18 writes
	  XOR: not sure what you're calculating here, but it's mostly  
irrelevant anyway, even my old Athlon500MHz can XOR 2.6GB/s iirc.



The short stripe size means that one does not need to RMW in
many cases, just W; and this despite that much higher redundancy
of 2+1. it also means that there are lots of parity blocks to
compute and write. With a 4 block operation length a 3+1 or even
more a 4+1 would be flattered here, but I wanted to exemplify
two extremes.


With a write cache the picture looks a bit better. If the writes  
happen close enough together (temporal) they will be joined. If they  
are further apart chances are the write speed is not that critical  
anyway.



The narrow parallelism thus short stripe length of 2+1 means
that a lot less blocks get transferred because of almost no RM,
but it does 9 IOPs and 13+1 does one less at 8 (wider
parallelism); but then the 2+1 IOPs are mostly in back-to-back
write pairs, while the 13+1 are in read-rewrite pairs, which is
a significant disadvantage (often greatly underestimated).

Never mind that the number of IOPs is almost the same despite
the large difference in width, and that can do with the same
disks as a 13+1 something like 4 2+1/3+1 arrays, thus gaining a
lot of parallelism across threads, if there is such to be
obtained. And if one really wants to write long stripes, one
should use RAID10 of course, not long stripes with a single (or
two) parity blocks.




Never mind that finding the chances of putting in the IO request
stream a set of back-to-back logical writes to 13 contiguous
blocks aligned starting on a 13 block multiple are bound to be
lower than those of get a set of of 2 or 3 blocks, and even
worse with a filesystem mostly built for the wrong stripe
alignment.


I have yet

Re: RAID5 to RAID6 reshape?

2008-02-24 Thread Peter Grandi
 On Sat, 23 Feb 2008 21:40:08 +0100, Nagilum
 [EMAIL PROTECTED] said:

[ ... ]

 * Doing unaligned writes on a 13+1 or 12+2 is catastrophically
 slow because of the RMW cycle. This is of course independent
 of how one got to the something like 13+1 or a 12+2.

nagilum Changing a single byte in a 2+1 raid5 or a 13+1 raid5
nagilum requires exactly two 512byte blocks to be read and
nagilum written from two different disks. Changing two bytes
nagilum which are unaligned (the last and first byte of two
nagilum consecutive stripes) doubles those figures, but more
nagilum disks are involved.

Here you are using the astute misdirection of talking about
unaunaligned *byte* *updates* when the issue is unaligned
*stripe* *writes*.

If one used your scheme to write a 13+1 stripe one block at a
time would take 26R+26W operations (about half of which could be
cached) instead of 14W which are what is required when doing
aligned stripe writes, which is what good file systems try to
achieve.

Well, 26R+26W may be a caricature, but the problem is that even
if one bunches updates of N blocks into a read N blocks+parity,
write N blocks+parity operation is still RMW, just a smaller RMW
than a full stripe RMW.

And reading before writing can kill write performance, because
it is a two-pass algorithm and a two-pass algorithm is pretty
bad news for disk work, and even more so, given most OS and disk
elevator algorithms, for one pass of reads and one of writes
dependent on the reads.

But enough of talking about absurd cases, let's do a good clear
example of why a 13+1 is bad bad bad when doing unaligned writes.

Consider writing to a 2+1 and an 13+1 just 15 blocks in 4+4+4+3
bunches, starting with block 0 (so aligned start, unaligned
bunch length, unaligned total length), a random case but quite
illustrative:

  2+1:
00 01 P1 03 04 P2 06 07 P3 09 10 P4
00 0102 0304 0506 07   
--**---** --**---**
12 13 P5 15 16 P6 18 19 P7 21 22 P8
08 0910 1112 1314
--**---** --**---**

write D00 D01 DP1
write D03 D04 DP2

write D06 D07 DP3
write D09 D10 DP4

write D12 D13 DP5
write D15 D16 DP6

write D18 D19 DP7
read  D21 DP8
write D21 DP8

Total:
  IOP: 01 reads, 08 writes
  NLK: 02 reads, 23 writes
  XOR: 28 reads, 15 writes

 13+1:
00 01 02 03 04 05 06 07 08 09 10 11 12 P1
00 01 02 03 04 05 06 07 08 09 10 11 12
--- --- --- -- **

14 15 16 17 18 19 20 21 22 23 24 25 26 P2
13 14
-  **

read  D00 D01 D02 D03 DP1
write D00 D01 D02 D03 DP1

read  D04 D05 D06 D07 DP1
write D04 D05 D06 D07 DP1

read  D08 D09 D10 D11 DP1
write D08 D09 D10 D11 DP1

read  D12 DP1 D14 D15 DP2
write D12 DP1 D14 D15 DP2

Total:
  IOP: 04 reads, 04 writes
  BLK: 20 reads, 20 writes
  XOR: 34 reads, 10 writes

The short stripe size means that one does not need to RMW in
many cases, just W; and this despite that much higher redundancy
of 2+1. it also means that there are lots of parity blocks to
compute and write. With a 4 block operation length a 3+1 or even
more a 4+1 would be flattered here, but I wanted to exemplify
two extremes.

The narrow parallelism thus short stripe length of 2+1 means
that a lot less blocks get transferred because of almost no RM,
but it does 9 IOPs and 13+1 does one less at 8 (wider
parallelism); but then the 2+1 IOPs are mostly in back-to-back
write pairs, while the 13+1 are in read-rewrite pairs, which is
a significant disadvantage (often greatly underestimated).

Never mind that the number of IOPs is almost the same despite
the large difference in width, and that can do with the same
disks as a 13+1 something like 4 2+1/3+1 arrays, thus gaining a
lot of parallelism across threads, if there is such to be
obtained. And if one really wants to write long stripes, one
should use RAID10 of course, not long stripes with a single (or
two) parity blocks.

In the above example the length of the transfer is not aligned
with either the 2+1 or 13+1 stripe length; if the starting block
is unaligned too, then things look worse for 2+1, but that is a
pathologically bad case (and at the same time a pathologically
good case for 13+1):

  2+1:
00 01 P1|03 04 P2|06 07 P3|09 10 P4|12
   00   |01 02   |03 04   |05 06   |07
   ---**|--**|-- ---**|--**|--
13 P5|15 16 P6|18 19 P7|21 22 P8
08   |09 10   |11 12   |13 14
---**|--**|-- ---**|--**

read  D01 DP1
read  D06 DP3
write D01 DP1
write D03 D04 DP2
write D06 DP3

read  D07 DP3
read  D12 DP5
write D07 DP3
write D09 D10 DP4
write D12 DP5

read  D13 DP5

Re: RAID5 to RAID6 reshape?

2008-02-22 Thread Peter Grandi
[ ... ]

 * Suppose you have a 2+1 array which is full. Now you add a
 disk and that means that almost all free space is on a single
 disk. The MD subsystem has two options as to where to add
 that lump of space, consider why neither is very pleasant.

 No, only one, at the end of the md device and the free space
 will be evenly distributed among the drives.

Not necessarily, however let's assume that happens.

Since the the free space will have a different distribution
then the used space will also, so that the physical layout will
evolve like this when creating up a 3+1 from a 2+1+1:

   2+1+1   3+1
  a b c da b c d
  ------   
  0 1 P F0 1 2 QP: old parity
  P 2 3 FQ 3 4 5F: free block
  4 P 5 F6 Q 7 8Q: new parity
  ......
 F F F F

How will the free space become evenly distributed among the
drives? Well, sounds like 3 drives will be read (2 if not
checking parity) and 4 drives written; while on a 3+1 a mere
parity rebuild only writes to 1 at a time, even if reads from
3, and a recovery reads from 3 and writes to 2 drives.

Is that a pleasant option? To me it looks like begging for
trouble. For one thing the highest likelyhood of failure is
when a lot of disk start running together doing much the same
things. RAID is based on the idea of uncorrelated failures...

  An aside: in my innocence I realized only recently that online
  redundancy and uncorrelated failures are somewhat contradictory.

Never mind that since one is changing the layout an interruption
in the process may leave the array unusable, even if with no
loss of data, evne if recent MD versions mostly cope; from a
recent 'man' page for 'mdadm':

 «Increasing the number of active devices in a RAID5 is much
  more effort.  Every block in the array will need to be read
  and written back to a new location.»

  From 2.6.17, the Linux Kernel is able to do this safely,
  including restart and interrupted reshape.

  When relocating the first few stripes on a raid5, it is not
  possible to keep the data on disk completely consistent and
  crash-proof. To provide the required safety, mdadm disables
  writes to the array while this critical section is reshaped,
  and takes a backup of the data that is in that section.

  This backup is normally stored in any spare devices that the
  array has, however it can also be stored in a separate file
  specified with the --backup-file option.»

Since the reshape reads N *and then writes* to N+1 the drives at
almost the same time things are going to be a bit slower than a
mere rebuild or recover: each stripe will be read from the N
existing drives and then written back to N+1 *while the next
stripe is being read from N* (or not...).

 * How fast is doing unaligned writes with a 13+1 or a 12+2
 stripe? How often is that going to happen, especially on an
 array that started as a 2+1?

 They are all the same speed with raid5 no matter what you
 started with.

But I asked two questions questions that are not how does the
speed differ. The two answers to the questions I aked are very
different from the same speed (they are very slow and
rather often):

* Doing unaligned writes on a 13+1 or 12+2 is catastrophically
  slow because of the RMW cycle. This is of course independent
  of how one got to the something like 13+1 or a 12+2.

* Unfortunately the frequency of unaligned writes *does* usually
  depend on how dementedly one got to the 13+1 or 12+2 case:
  because a filesystem that lays out files so that misalignment
  is minimised with a 2+1 stripe just about guarantees that when
  one switches to a 3+1 stripe all previously written data is
  misaligned, and so on -- and never mind that every time one
  adds a disk a reshape is done that shuffles stuff around.

There is a saving grace as to the latter point: many programs
don't overwrite files in place but truncate and recreate them
(which is not so good but for this case).

 You read two blocks and you write two blocks. (not even chunks
 mind you)

But we are talking about a *reshape* here and to a RAID5. If you
add a drive to a RAID5 and redistribute in the obvious way then
existing stripes have to be rewritten as the periodicity of the
parity changes from every N to every N+1.

 * How long does it take to rebuild parity with a 13+1 array
 or a 12+2 array in case of single disk failure? What happens
 if a disk fails during rebuild?

 Depends on how much data the controllers can push. But at
 least with my hpt2320 the limiting factor is the disk speed

But here we are on the Linux RAID mailing list and we are
talking about software RAID. With software RAID a reshape with
14 disks needs to shuffle around the *host bus* (not merely the
host adapter as with hw RAID) almost 5 times as much data as
with 3 (say 14x80MB/s ~= 1GB/s sustained in both directions at
the outer tracks). The host adapter also has to be able to run
14 operations in parallel.

It can be done -- it is just somewhat expensive, 

Re: LVM performance (was: Re: RAID5 to RAID6 reshape?)

2008-02-21 Thread Peter Grandi
 This might be related to raid chunk positioning with respect
 to LVM chunk positioning. If they interfere there indeed may
 be some performance drop. Best to make sure that those chunks
 are aligned together.

 Interesting. I'm seeing a 20% performance drop too, with default
 RAID and LVM chunk sizes of 64K and 4M, respectively. Since 64K
 divides 4M evenly, I'd think there shouldn't be such a big
 performance penalty. [ ... ]

Those are as such not very meaningful. What matters most is
whether the starting physical address of each logical volume
extent is stripe aligned (and whether the filesystem makes use
of that) and then the stripe size of the parity RAID set, not
the chunk sizes in themselves.

I am often surprised by how many people who use parity RAID
don't seem to realize the crucial importance of physical stripe
alignment, but I am getting used to it.

Because of stripe alignment it is usually better to build parity
arrays on top of partitions or volumes than viceversa, as it is
often more difficult to align the start of a partition or volume
to the underlying stripes than the reverse.

But then those who understand the vital importance of stripe
aligned writes for parity RAID often avoid using parity RAID
:-).
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


LVM performance (was: Re: RAID5 to RAID6 reshape?)

2008-02-19 Thread Oliver Martin

Janek Kozicki schrieb:

hold on. This might be related to raid chunk positioning with respect
to LVM chunk positioning. If they interfere there indeed may be some
performance drop. Best to make sure that those chunks are aligned together.


Interesting. I'm seeing a 20% performance drop too, with default RAID 
and LVM chunk sizes of 64K and 4M, respectively. Since 64K divides 4M 
evenly, I'd think there shouldn't be such a big performance penalty.
It's not like I care that much, I only have 100 Mbps ethernet anyway. 
I'm just wondering...


$ hdparm -t /dev/md0

/dev/md0:
 Timing buffered disk reads:  148 MB in  3.01 seconds =  49.13 MB/sec

$ hdparm -t /dev/dm-0

/dev/dm-0:
 Timing buffered disk reads:  116 MB in  3.04 seconds =  38.20 MB/sec

dm doesn't do anything fancy to justify the drop (encryption etc). In 
fact, it doesn't do much at all yet - I intend to use it to join 
multiple arrays in the future when I have drives of different sizes, but 
right now, I only have 500GB drives. So it's just one PV in one VG in 
one LV.


Here's some more info:

$ mdadm -D /dev/md0
/dev/md0:
Version : 00.90.03
  Creation Time : Sat Nov 24 12:15:48 2007
 Raid Level : raid5
 Array Size : 976767872 (931.52 GiB 1000.21 GB)
  Used Dev Size : 488383936 (465.76 GiB 500.11 GB)
   Raid Devices : 3
  Total Devices : 3
Preferred Minor : 0
Persistence : Superblock is persistent

Update Time : Tue Feb 19 01:18:26 2008
  State : clean
 Active Devices : 3
Working Devices : 3
 Failed Devices : 0
  Spare Devices : 0

 Layout : left-symmetric
 Chunk Size : 64K

   UUID : d41fe8a6:84b0f97a:8ac8b21a:819833c6 (local to host 
quassel)

 Events : 0.330016

Number   Major   Minor   RaidDevice State
   0   8   170  active sync   /dev/sdb1
   1   8   331  active sync   /dev/sdc1
   2   8   812  active sync   /dev/sdf1

$ pvdisplay
  --- Physical volume ---
  PV Name   /dev/md0
  VG Name   raid
  PV Size   931,52 GB / not usable 2,69 MB
  Allocatable   yes (but full)
  PE Size (KByte)   4096
  Total PE  238468
  Free PE   0
  Allocated PE  238468
  PV UUID   KadH5k-9Cie-dn5Y-eNow-g4It-lfuI-XqNIet

$ vgdisplay
  --- Volume group ---
  VG Name   raid
  System ID
  Formatlvm2
  Metadata Areas1
  Metadata Sequence No  4
  VG Access read/write
  VG Status resizable
  MAX LV0
  Cur LV1
  Open LV   1
  Max PV0
  Cur PV1
  Act PV1
  VG Size   931,52 GB
  PE Size   4,00 MB
  Total PE  238468
  Alloc PE / Size   238468 / 931,52 GB
  Free  PE / Size   0 / 0
  VG UUID   AW9yaV-B3EM-pRLN-RTIK-LEOd-bfae-3Vx3BC

$ lvdisplay
  --- Logical volume ---
  LV Name/dev/raid/raid
  VG Nameraid
  LV UUIDeWIRs8-SFyv-lnix-Gk72-Lu9E-Ku7j-iMIv92
  LV Write Accessread/write
  LV Status  available
  # open 1
  LV Size931,52 GB
  Current LE 238468
  Segments   1
  Allocation inherit
  Read ahead sectors auto
  - currently set to 256
  Block device   253:0

--
Oliver
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: LVM performance (was: Re: RAID5 to RAID6 reshape?)

2008-02-19 Thread Jon Nelson
On Feb 19, 2008 1:41 PM, Oliver Martin
[EMAIL PROTECTED] wrote:
 Janek Kozicki schrieb:
  hold on. This might be related to raid chunk positioning with respect
  to LVM chunk positioning. If they interfere there indeed may be some
  performance drop. Best to make sure that those chunks are aligned together.

 Interesting. I'm seeing a 20% performance drop too, with default RAID
 and LVM chunk sizes of 64K and 4M, respectively. Since 64K divides 4M
 evenly, I'd think there shouldn't be such a big performance penalty.
 It's not like I care that much, I only have 100 Mbps ethernet anyway.
 I'm just wondering...

 $ hdparm -t /dev/md0

 /dev/md0:
   Timing buffered disk reads:  148 MB in  3.01 seconds =  49.13 MB/sec

 $ hdparm -t /dev/dm-0

 /dev/dm-0:
   Timing buffered disk reads:  116 MB in  3.04 seconds =  38.20 MB/sec

I'm getting better performance on a LV than on the underlying MD:

# hdparm -t /dev/md0

/dev/md0:
 Timing buffered disk reads:  408 MB in  3.01 seconds = 135.63 MB/sec
# hdparm -t /dev/raid/multimedia

/dev/raid/multimedia:
 Timing buffered disk reads:  434 MB in  3.01 seconds = 144.04 MB/sec
#

md0 is a 3-disk raid5, 64k chunk, alg. 2, using a bitmap comprised of
7200rpm sata drives from several manufacturers.



-- 
Jon
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: LVM performance (was: Re: RAID5 to RAID6 reshape?)

2008-02-19 Thread Iustin Pop
On Tue, Feb 19, 2008 at 01:52:21PM -0600, Jon Nelson wrote:
 On Feb 19, 2008 1:41 PM, Oliver Martin
 [EMAIL PROTECTED] wrote:
  Janek Kozicki schrieb:
 
  $ hdparm -t /dev/md0
 
  /dev/md0:
Timing buffered disk reads:  148 MB in  3.01 seconds =  49.13 MB/sec
 
  $ hdparm -t /dev/dm-0
 
  /dev/dm-0:
Timing buffered disk reads:  116 MB in  3.04 seconds =  38.20 MB/sec
 
 I'm getting better performance on a LV than on the underlying MD:
 
 # hdparm -t /dev/md0
 
 /dev/md0:
  Timing buffered disk reads:  408 MB in  3.01 seconds = 135.63 MB/sec
 # hdparm -t /dev/raid/multimedia
 
 /dev/raid/multimedia:
  Timing buffered disk reads:  434 MB in  3.01 seconds = 144.04 MB/sec
 #

As people are trying to point out in many lists and docs: hdparm is
*not* a benchmark tool. So its numbers, while interesting, should not be
regarded as a valid comparison.

Just my oppinion.

regards,
iustin
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: RAID5 to RAID6 reshape?

2008-02-19 Thread Alexander Kühn

- Message from [EMAIL PROTECTED] -
Date: Mon, 18 Feb 2008 19:05:02 +
From: Peter Grandi [EMAIL PROTECTED]
Reply-To: Peter Grandi [EMAIL PROTECTED]
 Subject: Re: RAID5 to RAID6 reshape?
  To: Linux RAID linux-raid@vger.kernel.org



On Sun, 17 Feb 2008 07:45:26 -0700, Conway S. Smith
[EMAIL PROTECTED] said:



Consider for example the answers to these questions:

* Suppose you have a 2+1 array which is full. Now you add a disk
  and that means that almost all free space is on a single disk.
  The MD subsystem has two options as to where to add that lump
  of space, consider why neither is very pleasant.


No, only one, at the end of the md device and the free space will be  
evenly distributed among the drives.



* How fast is doing unaligned writes with a 13+1 or a 12+2
  stripe? How often is that going to happen, especially on an
  array that started as a 2+1?


They are all the same speed with raid5 no matter what you started  
with. You read two blocks and you write two blocks. (not even chunks  
mind you)



* How long does it take to rebuild parity with a 13+1 array or a
  12+2 array in case of s single disk failure? What happens if a
  disk fails during rebuild?


Depends on how much data the controllers can push. But at least with  
my hpt2320 the limiting factor is the disk speed and that doesn't  
change whether I have 2 disks or 12.



* When you have 13 drives and you add the 14th, how long does
  that take? What happens if a disk fails during rebuild??


..again pretty much the same as adding a fourth drive to a three-drives raid5.
It will continue to be degraded..nothing special.


beolach Well, I was reading that LVM2 had a 20%-50% performance
beolach penalty, which in my mind is a really big penalty. But I
beolach think those numbers where from some time ago, has the
beolach situation improved?

LVM2 relies on DM, which is not much slower than say 'loop', so
it is almost insignificant for most people.


I agree.


But even if the overhead may be very very low, DM/LVM2/EVMS seem
to me to have very limited usefulness (e.g. Oracle tablespaces,
and there are contrary opinions as to that too). In your stated
applications it is hard to see why you'd want to split your
arrays into very many block devices or why you'd want to resize
them.


I think the idea is to be able to have more than just one device to  
put a filesystem on. For example a / filesystem, swap and maybe  
something like /storage comes to mind. Yes, one could to that with  
partitioning but lvm was made for this so why not use it.


The situation looks different with Raid6, there the write penalty  
becomes higher with more disks but not with raid5.

Regards,
Alex.

- End message from [EMAIL PROTECTED] -




- --
Alexander Kuehn

Cell phone: +49 (0)177 6461165
Cell fax:   +49 (0)177 6468001
Tel @Home:  +49 (0)711 6336140
Mail mailto:[EMAIL PROTECTED]



cakebox.homeunix.net - all the machine one needs..



pgpiiwEnUQD98.pgp
Description: PGP Digital Signature


Re: RAID5 to RAID6 reshape?

2008-02-18 Thread Beolach
On Feb 17, 2008 10:26 PM, Janek Kozicki [EMAIL PROTECTED] wrote:
 Conway S. Smith said: (by the date of Sun, 17 Feb 2008 07:45:26 -0700)

  Well, I was reading that LVM2 had a 20%-50% performance penalty,

 huh? Make a benchmark. Do you really think that anyone would be using
 it if there was any penalty bigger than 1-2% ? (random access, r/w).

 I have no idea what is the penalty, but I'm totally sure I didn't
 notice it.


(Oops, replied straight to Janek, rather than the list.  Sorry.)

I saw those numbers in a few places, the only one I can remember off
the top of my head was the Gentoo-Wiki:
http://gentoo-wiki.com/HOWTO_Gentoo_Install_on_Software_RAID_mirror_and_LVM2_on_top_of_RAID.
Looking at its history, that warning was added back on 23 Dec. 2006,
so it could very well be out-of-date.  Good to hear you don't notice
any performance drop.  I think I will try to run some benchmarks.
What do you guys recommend using for benchmarking?  Plain dd,
bonnie++?


Conway S. Smith
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: RAID5 to RAID6 reshape?

2008-02-18 Thread Andre Noll
On 17:40, Mark Hahn wrote:
 Question to other people here - what is the maximum partition size
 that ext3 can handle, am I correct it 4 TB ?
 
 8 TB.  people who want to push this are probably using ext4 already.

ext3 supports up to 16T for quite some time. It works fine for me:

[EMAIL PROTECTED]:~ # mount |grep sda; df /dev/sda; uname -a; uptime
/dev/sda on /media/bia type ext3 (rw)
FilesystemSize  Used Avail Use% Mounted on
/dev/sda   15T  7.8T  7.0T  53% /media/bia
Linux ume 2.6.20.12 #3 SMP Tue Jun 5 14:33:44 CEST 2007 x86_64 GNU/Linux
 13:44:29 up 236 days, 15:12,  9 users,  load average: 10.47, 10.28, 10.17

Andre
-- 
The only person who always got his work done by Friday was Robinson Crusoe


signature.asc
Description: Digital signature


Re: RAID5 to RAID6 reshape?

2008-02-18 Thread Janek Kozicki
Beolach said: (by the date of Mon, 18 Feb 2008 05:38:15 -0700)

 On Feb 17, 2008 10:26 PM, Janek Kozicki [EMAIL PROTECTED] wrote:
  Conway S. Smith said: (by the date of Sun, 17 Feb 2008 07:45:26 -0700)
 
   Well, I was reading that LVM2 had a 20%-50% performance penalty,
 http://gentoo-wiki.com/HOWTO_Gentoo_Install_on_Software_RAID_mirror_and_LVM2_on_top_of_RAID.

hold on. This might be related to raid chunk positioning with respect
to LVM chunk positioning. If they interfere there indeed may be some
performance drop. Best to make sure that those chunks are aligned together.

-- 
Janek Kozicki |
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: RAID5 to RAID6 reshape?

2008-02-18 Thread Mark Hahn

8 TB.  people who want to push this are probably using ext4 already.


ext3 supports up to 16T for quite some time. It works fine for me:


thanks.  16 makes sense (2^32 * 4k blocks).
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: RAID5 to RAID6 reshape?

2008-02-18 Thread Peter Grandi
 On Sun, 17 Feb 2008 07:45:26 -0700, Conway S. Smith
 [EMAIL PROTECTED] said:

[ ... ]

beolach Which part isn't wise? Starting w/ a few drives w/ the
beolach intention of growing; or ending w/ a large array (IOW,
beolach are 14 drives more than I should put in 1 array  expect
beolach to be safe from data loss)?

Well, that rather depends on what is your intended data setup and
access patterns, but the above are all things that may be unwise
in many cases. The intended use mentioned below does not require
a single array for example.

However while doing the above may make sense in *some* situation,
I reckon that the number of those situations is rather small.

Consider for example the answers to these questions:

* Suppose you have a 2+1 array which is full. Now you add a disk
  and that means that almost all free space is on a single disk.
  The MD subsystem has two options as to where to add that lump
  of space, consider why neither is very pleasant.

* How fast is doing unaligned writes with a 13+1 or a 12+2
  stripe? How often is that going to happen, especially on an
  array that started as a 2+1?

* How long does it take to rebuild parity with a 13+1 array or a
  12+2 array in case of s single disk failure? What happens if a
  disk fails during rebuild?

* When you have 13 drives and you add the 14th, how long does
  that take? What happens if a disk fails during rebuild??

The points made by http://WWW.BAARF.com/ apply too.

beolach [ ... ] media files that would typically be accessed
beolach over the network by MythTV boxes.  I'll also be using
beolach it as a sandbox database/web/mail server. [ ... ] most
beolach important stuff backed up, [ ... ] some gaming, which
beolach is where I expect performance to be most noticeable.

To me that sounds like something that could well be split across
multiple arrays, rather than risking repeatedly extending a
single array, and then risking a single large array.

beolach Well, I was reading that LVM2 had a 20%-50% performance
beolach penalty, which in my mind is a really big penalty. But I
beolach think those numbers where from some time ago, has the
beolach situation improved?

LVM2 relies on DM, which is not much slower than say 'loop', so
it is almost insignificant for most people.

But even if the overhead may be very very low, DM/LVM2/EVMS seem
to me to have very limited usefulness (e.g. Oracle tablespaces,
and there are contrary opinions as to that too). In your stated
applications it is hard to see why you'd want to split your
arrays into very many block devices or why you'd want to resize
them.

beolach And is a 14 drive RAID6 going to already have enough
beolach overhead that the additional overhead isn't very
beolach significant? I'm not sure why you say it's amusing.

Consider the questions above. Parity RAID has issues, extending
an array has issues, the idea of extending both massively and
in several steps a parity RAID looks very amusing to me.

beolach [ ... ] The other reason I wasn't planning on using LVM
beolach was because I was planning on keeping all the drives in
beolach the one RAID. [... ]

Good luck :-).
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: RAID5 to RAID6 reshape?

2008-02-17 Thread Peter Grandi
 On Sat, 16 Feb 2008 20:58:07 -0700, Beolach
 [EMAIL PROTECTED] said:

beolach [ ... ] start w/ 3 drives in RAID5, and add drives as I
beolach run low on free space, eventually to a total of 14
beolach drives (the max the case can fit).

Like for for so many other posts to this list, all that is
syntactically valid is not necessarily the same thing as that
which is wise. 

beolach But when I add the 5th or 6th drive, I'd like to switch
beolach from RAID5 to RAID6 for the extra redundancy.

Again, what may be possible is not necessarily what may be wise.

In particular it seems difficult to discern which usage such
arrays would be put to. There might be a bit of difference
between a giant FAT32 volume containing song lyrics files or an
XFS filesystem with a collection of 500GB tomography scans in
them cached from a large tape backup system.

beolach I'm also interested in hearing people's opinions about
beolach LVM / EVMS.

They are yellow, and taste of vanilla :-). To say something more
specific is difficult without knowing what kind of requirement
they may be expected to satisfy.

beolach I'm currently planning on just using RAID w/out the
beolach higher level volume management, as from my reading I
beolach don't think they're worth the performance penalty, [
beolach ... ]

Very amusing that someone who is planning to grow a 3 drive
RAID5 into a 14 drive RAID6 worries about the DM performance
penalty.
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: RAID5 to RAID6 reshape?

2008-02-17 Thread Janek Kozicki
Beolach said: (by the date of Sat, 16 Feb 2008 20:58:07 -0700)

 I'm also interested in hearing people's opinions about LVM / EVMS.

With LVM it will be possible for you to have several raid5 and raid6:
eg: 5 HHDs (raid6), 5HDDs (raid6) and 4 HDDs (raid5). Here you would
have 14 HDDs and five of them being extra - for safety/redundancy
purposes.

LVM allows you to join several blockdevices and create one huge
partition on top of them. Without LVM you will end up with raid6 on
14 HDDs thus having only 2 drives used for redundancy. Quite risky
IMHO.

It is quite often that a *whole* IO controller dies and takes all 4
drives with it. So when you connect your drives, always make sure
that you are totally safe if any of your IO conrollers dies (taking
down 4 HDDs with it). With 5 redundant discs this may be possible to
solve. Of course when you replace the controller the discs are up
again, and only need to resync (which is done automatically).

LVM can be grown on-line (without rebooting the computer) to join
new block devices. And after that you only `resize2fs /dev/...` and
your partition is bigger. Also in such configuration I suggest you to
use ext3 fs, because no other fs (XFS, JFS, whatever) had that much
testing than ext* filesystems had.


Question to other people here - what is the maximum partition size
that ext3 can handle, am I correct it 4 TB ?

And to go above 4 TB we need to use ext4dev, right?

best regards
-- 
Janek Kozicki |
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: RAID5 to RAID6 reshape?

2008-02-17 Thread Janek Kozicki
Beolach said: (by the date of Sat, 16 Feb 2008 20:58:07 -0700)


 Or would I be better off starting w/ 4 drives in RAID6?

oh, right - Sevrin Robstad has a good idea to solve your problem -
create raid6 with one missing member. And add this member, when you
have it, next year or such.

-- 
Janek Kozicki |
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: RAID5 to RAID6 reshape?

2008-02-17 Thread Conway S. Smith
On Sun, 17 Feb 2008 14:31:22 +0100
Janek Kozicki [EMAIL PROTECTED] wrote:
 Beolach said: (by the date of Sat, 16 Feb 2008 20:58:07 -0700)
 
  I'm also interested in hearing people's opinions about LVM / EVMS.
 
 With LVM it will be possible for you to have several raid5 and
 raid6: eg: 5 HHDs (raid6), 5HDDs (raid6) and 4 HDDs (raid5). Here
 you would have 14 HDDs and five of them being extra - for
 safety/redundancy purposes.
 
 LVM allows you to join several blockdevices and create one huge
 partition on top of them. Without LVM you will end up with raid6 on
 14 HDDs thus having only 2 drives used for redundancy. Quite risky
 IMHO.
 

I guess I'm just too reckless a guy.  I don't like having wasted
space, even though I know redundancy is by no means a waste.  And
part of me keeps thinking that the vast majority of my drives have
never failed (although a few have, including one just recently, which
is a large part of my motivation for this fileserver).  So I was
thinking RAID6, possibly w/ a hot spare or 2, would be safe enough.

Speaking of hot spares, how well would cheap external USB drives work
as hot spares?  Is that a pretty silly idea?

 It is quite often that a *whole* IO controller dies and takes all 4
 drives with it. So when you connect your drives, always make sure
 that you are totally safe if any of your IO conrollers dies (taking
 down 4 HDDs with it). With 5 redundant discs this may be possible to
 solve. Of course when you replace the controller the discs are up
 again, and only need to resync (which is done automatically).
 

That sounds scary.  Does a controller failure often cause data loss
on the disks?  My understanding was that one of the advantages of
Linux's SW RAID was that if a controller failed you could swap in
another controller, not even the same model or brand, and Linux would
reassemble the RAID.  But if a controller failure typically takes all
the data w/ it, then the portability isn't as awesome an advantage.
Is your last sentence about replacing the controller applicable to
most controller failures, or just w/ more redundant discs?  In my
situation downtime is only mildly annoying, data loss would be much
worse.

 LVM can be grown on-line (without rebooting the computer) to join
 new block devices. And after that you only `resize2fs /dev/...` and
 your partition is bigger. Also in such configuration I suggest you
 to use ext3 fs, because no other fs (XFS, JFS, whatever) had that
 much testing than ext* filesystems had.
 
 

Plain RAID5  RAID6 are also capable of growing on-line, although I
expect it's a much more complex  time-consuming process than LVM.  I
had been planning on using XFS, but I could rethink that.  Have there
been many horror stories about XFS?

 Question to other people here - what is the maximum partition size
 that ext3 can handle, am I correct it 4 TB ?
 
 And to go above 4 TB we need to use ext4dev, right?
 

I thought it depended on CPU architecture  kernel version, w/ recent
kernels on 64-bit archs being capable of 32 TiB.  If it is only 4
TiB, I would go w/ XFS.

 oh, right - Sevrin Robstad has a good idea to solve your problem -
 create raid6 with one missing member. And add this member, when you
 have it, next year or such.
 

I thought I read that would involve a huge performance hit, since
then everything would require parity calculations.  Or would that
just be w/ 2 missing drives?


Thanks,
Conway S. Smith
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: RAID5 to RAID6 reshape?

2008-02-17 Thread Mark Hahn

I'm also interested in hearing people's opinions about LVM / EVMS.


With LVM it will be possible for you to have several raid5 and raid6:
eg: 5 HHDs (raid6), 5HDDs (raid6) and 4 HDDs (raid5). Here you would
have 14 HDDs and five of them being extra - for safety/redundancy
purposes.


that's a very high price to pay.


partition on top of them. Without LVM you will end up with raid6 on
14 HDDs thus having only 2 drives used for redundancy. Quite risky
IMHO.


your risk model is quite strange - 5/14 redundancy means that either 
you expect a LOT of failures, or you put a huge premium on availability.
the latter is odd because normally, HA people go for replication of 
more components, not just controllers (ie, whole servers).



It is quite often that a *whole* IO controller dies and takes all 4


you appear to be using very flakey IO controllers.  are you specifically
talking about very cheap ones, or in hostile environments?


drives with it. So when you connect your drives, always make sure
that you are totally safe if any of your IO conrollers dies (taking


IO controllers are not a common failure mode, in my experience.
when it happens, it usually indicates an environmental problem
(heat, bad power, bad hotplug, etc).


Question to other people here - what is the maximum partition size
that ext3 can handle, am I correct it 4 TB ?


8 TB.  people who want to push this are probably using ext4 already.


And to go above 4 TB we need to use ext4dev, right?


or patches (which have been around and even in some production use 
for a long while.)

-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: RAID5 to RAID6 reshape?

2008-02-17 Thread Janek Kozicki
Mark Hahn said: (by the date of Sun, 17 Feb 2008 17:40:12 -0500 (EST))

  I'm also interested in hearing people's opinions about LVM / EVMS.
 
  With LVM it will be possible for you to have several raid5 and raid6:
  eg: 5 HHDs (raid6), 5HDDs (raid6) and 4 HDDs (raid5). Here you would
  have 14 HDDs and five of them being extra - for safety/redundancy
  purposes.
 
 that's a very high price to pay.
 
  partition on top of them. Without LVM you will end up with raid6 on
  14 HDDs thus having only 2 drives used for redundancy. Quite risky
  IMHO.
 
 your risk model is quite strange - 5/14 redundancy means that either 

yeah, sorry. I went too far.

I didn't have IO controller failure so far. But I've read about one
on this list, and that all data was lost.

You're right, better to duplicate a server with backup copy, so it is
independent of the original one.

-- 
Janek Kozicki |
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: RAID5 to RAID6 reshape?

2008-02-17 Thread Neil Brown
On Sunday February 17, [EMAIL PROTECTED] wrote:
 On Sun, 17 Feb 2008 14:31:22 +0100
 Janek Kozicki [EMAIL PROTECTED] wrote:
 
  oh, right - Sevrin Robstad has a good idea to solve your problem -
  create raid6 with one missing member. And add this member, when you
  have it, next year or such.
  
 
 I thought I read that would involve a huge performance hit, since
 then everything would require parity calculations.  Or would that
 just be w/ 2 missing drives?

A raid6 with one missing drive would have a little bit of a
performance hit over raid5.

Partly there is a CPU hit to calculate the Q block which is slower
than calculating normal parity.

Partly there is the fact that raid6 never does read-modify-write
cycles, so to update one block in a stripe, it has to read all the
other data blocks.

But the worst aspect of doing this that if you have a system crash,
you could get hidden data corruption.
After a system crash you cannot trust parity data (as it may have been
in the process of being updated) so you have to regenerate it from
known good data.  But if your array is degraded, you don't have all
the known good data, so you loose.

It is really best to avoid degraded raid4/5/6 arrays when at all
possible.

NeilBrown
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: RAID5 to RAID6 reshape?

2008-02-17 Thread Neil Brown
On Saturday February 16, [EMAIL PROTECTED] wrote:
 found was a few months old.  Is it likely that RAID5 to RAID6
 reshaping will be implemented in the next 12 to 18 months (my rough

Certainly possible.

I won't say it is likely until it is actually done.  And by then it
will be definite :-)

i.e. no concrete plans.
It is always best to base your decisions on what is available today.


NeilBrown
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: RAID5 to RAID6 reshape?

2008-02-17 Thread Conway S. Smith
On Sun, 17 Feb 2008 11:50:25 +
[EMAIL PROTECTED] (Peter Grandi) wrote:
  On Sat, 16 Feb 2008 20:58:07 -0700, Beolach
  [EMAIL PROTECTED] said:
 
 beolach [ ... ] start w/ 3 drives in RAID5, and add drives as I
 beolach run low on free space, eventually to a total of 14
 beolach drives (the max the case can fit).
 
 Like for for so many other posts to this list, all that is
 syntactically valid is not necessarily the same thing as that
 which is wise. 
 

Which part isn't wise?  Starting w/ a few drives w/ the intention of
growing; or ending w/ a large array (IOW, are 14 drives more than I
should put in 1 array  expect to be safe from data loss)?

 beolach But when I add the 5th or 6th drive, I'd like to switch
 beolach from RAID5 to RAID6 for the extra redundancy.
 
 Again, what may be possible is not necessarily what may be wise.
 
 In particular it seems difficult to discern which usage such
 arrays would be put to. There might be a bit of difference
 between a giant FAT32 volume containing song lyrics files or an
 XFS filesystem with a collection of 500GB tomography scans in
 them cached from a large tape backup system.
 

Sorry for not mentioning, I am planning on using XFS.  Its intended
usage is general home use; probably most of the space will end up
being used by media files that would typically be accessed over the
network by MythTV boxes.  I'll also be using it as a sandbox
database/web/mail server.  Everything will just be personal stuff, so
if the I did lose it all I would be very depressed, but I hopefully
will have all the most important stuff backed up, and I won't lose my
job or anything too horrible.  The main reason I'm concerned about
performance is that for some time after I buy it, it will be the
highest speced of my boxes, and so I will also be using it for some
gaming, which is where I expect performance to be most noticeable.

 beolach I'm also interested in hearing people's opinions about
 beolach LVM / EVMS.
 
 They are yellow, and taste of vanilla :-). To say something more
 specific is difficult without knowing what kind of requirement
 they may be expected to satisfy.
 
 beolach I'm currently planning on just using RAID w/out the
 beolach higher level volume management, as from my reading I
 beolach don't think they're worth the performance penalty, [
 beolach ... ]
 
 Very amusing that someone who is planning to grow a 3 drive
 RAID5 into a 14 drive RAID6 worries about the DM performance
 penalty.
 

Well, I was reading that LVM2 had a 20%-50% performance penalty,
which in my mind is a really big penalty.  But I think those numbers
where from some time ago, has the situation improved?  And is a 14
drive RAID6 going to already have enough overhead that the additional
overhead isn't very significant?  I'm not sure why you say it's
amusing.

The other reason I wasn't planning on using LVM was because I was
planning on keeping all the drives in the one RAID.  If I decide a 14
drive array is too risky, and I go w/ 2 or 3 arrays then LVM would
appear much more useful to me.


Thanks for the response,
Conway S. Smith
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: RAID5 to RAID6 reshape?

2008-02-17 Thread Janek Kozicki
Conway S. Smith said: (by the date of Sun, 17 Feb 2008 07:45:26 -0700)

 Well, I was reading that LVM2 had a 20%-50% performance penalty,

huh? Make a benchmark. Do you really think that anyone would be using
it if there was any penalty bigger than 1-2% ? (random access, r/w).

I have no idea what is the penalty, but I'm totally sure I didn't
notice it.

-- 
Janek Kozicki |
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RAID5 to RAID6 reshape?

2008-02-16 Thread Beolach
Hi list,

I'm a newbie to RAID, planning a home fileserver that will be pretty
much my first real time using RAID.  What I think I'd like to do is
start w/ 3 drives in RAID5, and add drives as I run low on free space,
eventually to a total of 14 drives (the max the case can fit).  But
when I add the 5th or 6th drive, I'd like to switch from RAID5 to
RAID6 for the extra redundancy.  As I've been researching RAID
options, I've seen that RAID5 to RAID6 migration is a planned feature,
but AFAIK it isn't implemented yet, and the most recent mention I
found was a few months old.  Is it likely that RAID5 to RAID6
reshaping will be implemented in the next 12 to 18 months (my rough
guesstimate as to when I might want to migrate from RAID5 to RAID6)?
Or would I be better off starting w/ 4 drives in RAID6?

I'm also interested in hearing people's opinions about LVM / EVMS.
I'm currently planning on just using RAID w/out the higher level
volume management, as from my reading I don't think they're worth the
performance penalty, but if anyone thinks that's a horrible mistake
I'd like to know sooner rather than later.

And if anyone has comments on good hardware to consider or bad
hardware to avoid, here's what I'm currently planning on getting:
http://secure.newegg.com/NewVersion/wishlist/PublicWishDetail.asp?WishListNumber=6134331


TIA,
Conway S. Smith
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html