[PATCH] md: Fix bug where new drives added to an md array sometimes don't sync properly.

2006-10-05 Thread NeilBrown
There is a nasty bug in md in 2.6.18 affecting at least raid1.
This fixes it (and has already been sent to [EMAIL PROTECTED]).

### Comments for Changeset

This fixes a bug introduced in 2.6.18. 

If a drive is added to a raid1 using older tools (mdadm-1.x or
raidtools) then it will be included in the array without any resync
happening.

It has been submitted for 2.6.18.1.


Signed-off-by: Neil Brown [EMAIL PROTECTED]

### Diffstat output
 ./drivers/md/md.c |1 +
 1 file changed, 1 insertion(+)

diff .prev/drivers/md/md.c ./drivers/md/md.c
--- .prev/drivers/md/md.c   2006-09-29 11:51:39.0 +1000
+++ ./drivers/md/md.c   2006-10-05 16:40:51.0 +1000
@@ -3849,6 +3849,7 @@ static int hot_add_disk(mddev_t * mddev,
}
clear_bit(In_sync, rdev-flags);
rdev-desc_nr = -1;
+   rdev-saved_raid_disk = -1;
err = bind_rdev_to_array(rdev, mddev);
if (err)
goto abort_export;
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


converting RAID5 to RAID10

2006-10-05 Thread martin f krafft
I have a 1.5Tb RAID5 machine (3*750Gb disks + 1 spare) and need to
move some write-intensive services there. Unfortunately, the
performance is unacceptable. Thus, I wanted to convert the machine
to RAID10.

My theory was: backup, remove the spare, set one disk faulty, remove
it, create a degraded RAID10 on the two freed disks, copy data, kill
RAID5, add disks to new RAID10.

Unfortunately, mdadm (2.5.3) doesn't seem to agree; it complains
that it cannot assemble a RAID10 with 4 devices when I ask it to:

  mdadm --create -l 10 -n4 -pn2 /dev/md1 /dev/sd[cd] missing missing

I can kind of understand, but on the other hand I don't. After all,
if you'll allow me to think in terms of 1+0 instead of 10 for
a second, why doesn't mdadm just assemble /dev/sd[cd] as RAID0 and
make the couple one of the two components of the RAID1? What I mean
is: I could set up RAID1+0 that way; why doesn't it work for RAID10?

Do you know of a way in which I could migrate the data to RAID10?
Unfortunately, I do not have more 750Gb disks available nor
a budget, and the 1.5Tb are 96% full.

Cheers,

-- 
martin;  (greetings from the heart of the sun.)
  \ echo mailto: !#^.*|tr * mailto:; [EMAIL PROTECTED]
 
spamtraps: [EMAIL PROTECTED]
 
if a man treats life artistically, his brain is his heart.
-- oscar wilde


signature.asc
Description: Digital signature (GPG/PGP)


Re: question about raid5 parity calculations

2006-10-05 Thread Neil Brown
On Tuesday October 3, [EMAIL PROTECTED] wrote:
 
 Hello Neil, Ingo and [insert your name here],
 
 I try to understand the raid5 and md code and I have a question
 concerning the cache.
 
 There are two ways of calculating the parity: read-modify-write and
 reconstruct-write. In my understanding, the code only checks how many
 buffers it has to read for each method (rmw or rcw) without considering
 the cache. But what if there was relevant data in the cache? How would
 the raid code know it so it can build a decision on top of this
 knowledge?

Close.
It checks how many buffers it has to read for each method *with*
consideration of the cache.
Note the !test_bit(R5_UPTODATE, dev-flags) tests in handle_stripe5
in the section where 'rcw' and 'rwm' are being calculated.
It only counts buffers that are not uptodate.  i.e. those that do not
already exist in the stripe cache.

NeilBrown
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: converting RAID5 to RAID10

2006-10-05 Thread Neil Brown
On Thursday October 5, [EMAIL PROTECTED] wrote:
 
 Unfortunately, mdadm (2.5.3) doesn't seem to agree; it complains
 that it cannot assemble a RAID10 with 4 devices when I ask it to:
 
   mdadm --create -l 10 -n4 -pn2 /dev/md1 /dev/sd[cd] missing missing
 

mdadm --create -l 10 -n 4 -pn2 /dev/md1 /dev/sdc missing /dev/sdd missing

Raid10 lays out data like
  A A B B
  C C D D
not
  A B A B
  C D C D
as you seem to expect.

So you could even do

mdadm --create -l 10 -n 4 -pn2 /dev/md1 missing /dev/sd[cd] missing

for slightly less typing.

There seems to be a bug in raid10 that is reports the wrong number of
working drives.  This is probably only in 2.6.18.  Patch is below.

NeilBrown

Signed-off-by: Neil Brown [EMAIL PROTECTED]

### Diffstat output
 ./drivers/md/raid10.c |2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff .prev/drivers/md/raid10.c ./drivers/md/raid10.c
--- .prev/drivers/md/raid10.c   2006-09-29 11:44:36.0 +1000
+++ ./drivers/md/raid10.c   2006-10-05 20:10:07.0 +1000
@@ -2079,7 +2079,7 @@ static int run(mddev_t *mddev)
disk = conf-mirrors + i;
 
if (!disk-rdev ||
-   !test_bit(In_sync, rdev-flags)) {
+   !test_bit(In_sync, disk-rdev-flags)) {
disk-head_position = 0;
mddev-degraded++;
}
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: converting RAID5 to RAID10

2006-10-05 Thread martin f krafft
also sprach Neil Brown [EMAIL PROTECTED] [2006.10.05.1214 +0200]:
 mdadm --create -l 10 -n 4 -pn2 /dev/md1 /dev/sdc missing /dev/sdd missing

Peter Samuelson of the Debian project already suggested this and it
seems to work.

Thanks a lot, Neil, for the quick and informative response.

-- 
martin;  (greetings from the heart of the sun.)
  \ echo mailto: !#^.*|tr * mailto:; [EMAIL PROTECTED]
 
spamtraps: [EMAIL PROTECTED]
 
the ships hung in the sky in much the same way that bricks don't.
 -- hitchhiker's guide to the galaxy


signature.asc
Description: Digital signature (GPG/PGP)


Re: converting RAID5 to RAID10

2006-10-05 Thread Ask Bjørn Hansen


On Oct 5, 2006, at 3:15 AM, Jurriaan Kalkman wrote:


AFAIK, linux raid-10 is not exactly raid 1+0, it allows you to, for
example, use 3 disks.


I made a raid-10 device earlier today with 7 drives and I was  
surprised to see that it reported to use all of them.  I thought it'd  
make one of them a spare (or complain about the odd number of drives).


How does that work?  (Or is it the number of drives reported bug  
Neil referred to a moment ago?  I use FC6's version of 2.6.18).



 - ask

--
http://www.askbjoernhansen.com/


-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: converting RAID5 to RAID10

2006-10-05 Thread Neil Brown
On Thursday October 5, [EMAIL PROTECTED] wrote:
 
 On Oct 5, 2006, at 3:15 AM, Jurriaan Kalkman wrote:
 
  AFAIK, linux raid-10 is not exactly raid 1+0, it allows you to, for
  example, use 3 disks.
 
 I made a raid-10 device earlier today with 7 drives and I was  
 surprised to see that it reported to use all of them.  I thought it'd  
 make one of them a spare (or complain about the odd number of drives).
 
 How does that work?  (Or is it the number of drives reported bug  
 Neil referred to a moment ago?  I use FC6's version of 2.6.18).

If you wanted 6 drives and a spare you need to ask for in: -n6 -x1.
If you asked for 7 drives in a raid10 you get them.
The data is laid out thus:

   A A B B C C D
   D E E F F G G
   H H I I J J K
   K L L M M N N
(each column in a drive, each letter is a chunk of data).

NeilBrown
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: mdadm RAID5 Grow

2006-10-05 Thread mickg

Neil Brown wrote:

On Wednesday October 4, [EMAIL PROTECTED] wrote:
I have been trying to run: 
mdadm --grow /dev/md0 --raid-devices=6 --backup-file /backup_raid_grow

I get:
mdadm: Need to backup 1280K of critical section..
mdadm: /dev/md0: Cannot get array details from sysfs


It shouldn't do that 
Can you 
  strace -o /tmp/trace -s 300 mdadm --grow .


and send a copy of /tmp/trace.  I'd like to see how far it gets at
reading information from sysfs.


Would it need to be unmounted to work properly (It is currently mounted
under lvm)?


No.  unmounting isn't needed and won't make any difference.

NeilBrown
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

strace mdadm --grow /dev/md0 --raid-devices=6 --backup-file /backup_raid_grow


execve(/sbin/mdadm, [mdadm, --grow, /dev/md0, --raid-devices=6, 
--backup-file, /backup_raid_grow], [/* 68 vars */]) = 0
brk(0)  = 0x8076000
access(/etc/ld.so.preload, R_OK)  = -1 ENOENT (No such file or directory)
open(/etc/ld.so.cache, O_RDONLY)  = 3
fstat64(3, {st_mode=S_IFREG|0644, st_size=107351, ...}) = 0
mmap2(NULL, 107351, PROT_READ, MAP_PRIVATE, 3, 0) = 0xa7fa8000
close(3)= 0
open(/lib/libc.so.6, O_RDONLY)= 3
read(3, \177ELF\1\1\1\0\0\0\0\0\0\0\0\0\3\0\3\0\1\0\0\0\300Y\1..., 512) = 512
fstat64(3, {st_mode=S_IFREG|0755, st_size=1404242, ...}) = 0
mmap2(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 
0xa7fa7000
mmap2(NULL, 1176988, PROT_READ|PROT_EXEC, MAP_PRIVATE|MAP_DENYWRITE, 3, 0) = 
0xa7e87000
madvise(0xa7e87000, 1176988, MADV_SEQUENTIAL|0x1) = 0
mmap2(0xa7fa, 16384, PROT_READ|PROT_WRITE, 
MAP_PRIVATE|MAP_FIXED|MAP_DENYWRITE, 3, 0x118) = 0xa7fa
mmap2(0xa7fa4000, 9628, PROT_READ|PROT_WRITE, 
MAP_PRIVATE|MAP_FIXED|MAP_ANONYMOUS, -1, 0) = 0xa7fa4000
close(3)= 0
mmap2(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 
0xa7e86000
set_thread_area({entry_number:-1 - 6, base_addr:0xa7e866b0, limit:1048575, seg_32bit:1, contents:0, read_exec_only:0, limit_in_pages:1, 
seg_not_present:0, useable:1}) = 0

mprotect(0xa7fa, 8192, PROT_READ)   = 0
munmap(0xa7fa8000, 107351)  = 0
time(NULL)  = 1160052126
getpid()= 8461
brk(0)  = 0x8076000
brk(0x8097000)  = 0x8097000
open(/etc/mdadm.conf, O_RDONLY)   = 3
fstat64(3, {st_mode=S_IFREG|0644, st_size=154, ...}) = 0
mmap2(NULL, 131072, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 
0xa7e66000
read(3, DEVICE partitions\nARRAY /dev/md0..., 131072) = 154
read(3, , 131072) = 0
read(3, , 131072) = 0
close(3)= 0
munmap(0xa7e66000, 131072)  = 0
open(/dev/md0, O_RDWR)= 3
fstat64(3, {st_mode=S_IFBLK|0640, st_rdev=makedev(9, 0), ...}) = 0
ioctl(3, 0x800c0910, 0xafc4a024)= 0
ioctl(3, 0x80480911, 0xafc4a0a8)= 0
fstat64(1, {st_mode=S_IFCHR|0620, st_rdev=makedev(136, 3), ...}) = 0
mmap2(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 
0xa7fc2000
write(1, mdadm: Need to backup 1280K of c..., 50mdadm: Need to backup 1280K 
of critical section..
) = 50
fstat64(3, {st_mode=S_IFBLK|0640, st_rdev=makedev(9, 0), ...}) = 0
open(/sys/block/md0/md/component_size, O_RDONLY) = -1 ENOENT (No such file or 
directory)
write(2, mdadm: /dev/md0: Cannot get arra..., 53mdadm: /dev/md0: Cannot get 
array details from sysfs
) = 53
exit_group(1)   = ?
Process 8461 detached


-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: mdadm RAID5 Grow

2006-10-05 Thread Neil Brown
On Thursday October 5, [EMAIL PROTECTED] wrote:
 Neil Brown wrote:
  On Wednesday October 4, [EMAIL PROTECTED] wrote:
  I have been trying to run: 
  mdadm --grow /dev/md0 --raid-devices=6 --backup-file /backup_raid_grow
  I get:
  mdadm: Need to backup 1280K of critical section..
  mdadm: /dev/md0: Cannot get array details from sysfs
  
  It shouldn't do that 
  Can you 
strace -o /tmp/trace -s 300 mdadm --grow .
...
 open(/sys/block/md0/md/component_size, O_RDONLY) = -1 ENOENT (No such file 
 or directory)

So it couldn't open .../component_size.  That was added prior to the
release of 2.6.16, and you are running 2.6.17.13 so the kernel
certainly supports it.  
Most likely explanation is that /sys isn't mounted.
Do you have a /sys?
Is it mounted?
Can you ls -l /sys/block/md0/md ??

Maybe you need to
  mkdir /sys
  mount -t sysfs sysfs /sys

and try again.

NeilBrown
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RAID10: near, far, offset -- which one?

2006-10-05 Thread martin f krafft
I am trying to compare the three RADI10 layouts with each other.
Assuming a simple 4 drive setup with 2 copies of each block,
I understand that a near layout makes RAID10 resemble RAID1+0
(although it's not 1+0).

I also understand that the far layout trades some read performance
for some write performance, so it's best for read-intensive
operations, like read-only file servers.

I don't really understand the offset layout. Am I right in
asserting that like near it keeps stripes together and thus
requires less seeking, but stores the blocks at different offsets
wrt the disks?

If A,B,C are data blocks, a,b their parts, and 1,2 denote their
copies, the following would be a classic RAID1+0 where 1,2 and 3,4
are RAID0 pairs combined into a RAID1:

  hdd1  Aa1 Ba1 Ca1
  hdd2  Ab1 Bb1 Cb1
  hdd3  Aa2 Ba2 Ca2
  hdd4  Ab2 Bb2 Cb2

How would this look with the three different layouts? I think near
is pretty much the same as above, but I can't figure out far and
offset from the md(4) manpage.

Also, what are their respective advantages and disadvantages?

Thanks,

-- 
martin;  (greetings from the heart of the sun.)
  \ echo mailto: !#^.*|tr * mailto:; [EMAIL PROTECTED]
 
spamtraps: [EMAIL PROTECTED]
 
a woman begins by resisting a man's advances and ends by blocking
 his retreat.
-- oscar wilde


signature.asc
Description: Digital signature (GPG/PGP)


Re: RAID10: near, far, offset -- which one?

2006-10-05 Thread Eli Stair


Taken for what it is, here's some recent experience I'm seeing (not a 
precise explanation as you're asking for, which I'd like to know also).


Layout : near=2, far=1
Chunk Size : 512K
gtmp01,16G,,,125798,22,86157,17,,,337603,34,765.3,2,16,240,1,+,+++,237,1,241,1,+,+++,239,1
gtmp01,16G,,,129137,21,87074,17,,,336256,34,751.7,1,16,239,1,+,+++,238,1,240,1,+,+++,238,1
gtmp01,16G,,,125458,22,86293,17,,,338146,34,755.8,1,16,240,1,+,+++,237,1,240,1,+,+++,237,1

Layout : near=1, offset=2
Chunk Size : 512K
gtmp02,16G,,,141278,25,98789,20,,,297263,29,767.5,2,16,240,1,+,+++,238,1,240,1,+,+++,238,1
gtmp02,16G,,,143068,25,98469,20,,,316138,31,793.6,1,16,239,1,+,+++,237,1,239,1,+,+++,238,0
gtmp02,16G,,,143236,24,99234,20,,,313824,32,782.1,1,16,240,1,+,+++,237,1,240,1,+,+++,238,1


Here, testing with bonnie++, 14-drive RAID10 dual-multipath FC, 10K 146G 
drives.  RAID5 nets the same approximate read performance (sometimes 
higher), with single-thread writes limited to 100MB/sec, and 
concurrent-thread R/W access in the pits (obvious for RAID5).


mdadm 2.5.3
linux 2.6.18
xfs (mkfs.xfs -d su=512k,sw=3 -l logdev=/dev/sda1 -f /dev/md0)


Cheers,

/eli




martin f krafft wrote:

I am trying to compare the three RADI10 layouts with each other.
Assuming a simple 4 drive setup with 2 copies of each block,
I understand that a near layout makes RAID10 resemble RAID1+0
(although it's not 1+0).

I also understand that the far layout trades some read performance
for some write performance, so it's best for read-intensive
operations, like read-only file servers.

I don't really understand the offset layout. Am I right in
asserting that like near it keeps stripes together and thus
requires less seeking, but stores the blocks at different offsets
wrt the disks?

If A,B,C are data blocks, a,b their parts, and 1,2 denote their
copies, the following would be a classic RAID1+0 where 1,2 and 3,4
are RAID0 pairs combined into a RAID1:

  hdd1  Aa1 Ba1 Ca1
  hdd2  Ab1 Bb1 Cb1
  hdd3  Aa2 Ba2 Ca2
  hdd4  Ab2 Bb2 Cb2

How would this look with the three different layouts? I think near
is pretty much the same as above, but I can't figure out far and
offset from the md(4) manpage.

Also, what are their respective advantages and disadvantages?

Thanks,



-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] md: Fix bug where new drives added to an md array sometimes don't sync properly.

2006-10-05 Thread Eli Stair



I'm actually seeing similar behaviour on RAID10 (2.6.18), where after 
removing a drive from an array re-adding it sometimes results in it 
still being listed as a faulty-spare and not being taken for resync. 
In the same scenario, after swapping drives, doing a fail,remove, then 
an 'add' doesn't work, only a re-add will even get the drive listed by 
MDADM.



What's the failure mode/symptoms that this patch is resolving?

Is it possible this affects the RAID10 module/mode as well?  If not, 
I'll start a new thread for that.  I'm testing this patch to see if it 
does remedy the situation on RAID10, and will update after some 
significant testing.



/eli








NeilBrown wrote:

There is a nasty bug in md in 2.6.18 affecting at least raid1.
This fixes it (and has already been sent to [EMAIL PROTECTED]).

### Comments for Changeset

This fixes a bug introduced in 2.6.18.

If a drive is added to a raid1 using older tools (mdadm-1.x or
raidtools) then it will be included in the array without any resync
happening.

It has been submitted for 2.6.18.1.


Signed-off-by: Neil Brown [EMAIL PROTECTED]

### Diffstat output
 ./drivers/md/md.c |1 +
 1 file changed, 1 insertion(+)

diff .prev/drivers/md/md.c ./drivers/md/md.c
--- .prev/drivers/md/md.c   2006-09-29 11:51:39.0 +1000
+++ ./drivers/md/md.c   2006-10-05 16:40:51.0 +1000
@@ -3849,6 +3849,7 @@ static int hot_add_disk(mddev_t * mddev,
}
clear_bit(In_sync, rdev-flags);
rdev-desc_nr = -1;
+   rdev-saved_raid_disk = -1;
err = bind_rdev_to_array(rdev, mddev);
if (err)
goto abort_export;
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html



-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: mdadm RAID5 Grow

2006-10-05 Thread mickg

Neil Brown wrote:

On Thursday October 5, [EMAIL PROTECTED] wrote:

Neil Brown wrote:

On Wednesday October 4, [EMAIL PROTECTED] wrote:
I have been trying to run: 
mdadm --grow /dev/md0 --raid-devices=6 --backup-file /backup_raid_grow

I get:
mdadm: Need to backup 1280K of critical section..
mdadm: /dev/md0: Cannot get array details from sysfs
It shouldn't do that 
Can you 
  strace -o /tmp/trace -s 300 mdadm --grow .

...

open(/sys/block/md0/md/component_size, O_RDONLY) = -1 ENOENT (No such file or 
directory)


So it couldn't open .../component_size.  That was added prior to the
release of 2.6.16, and you are running 2.6.17.13 so the kernel
certainly supports it.  
Most likely explanation is that /sys isn't mounted.

Do you have a /sys?
Is it mounted?
Can you ls -l /sys/block/md0/md ??

Maybe you need to
  mkdir /sys
  mount -t sysfs sysfs /sys

and try again.


Worked like a charm!

Thank you!

There is a
  sysfs   /syssysfs   noauto 0 0
line in /etc/fstab
I am assuming noauto is the culprit?

Should it be made to automount ?

mickg

NeilBrown
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html



-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html