[PATCH] md: Fix bug where new drives added to an md array sometimes don't sync properly.
There is a nasty bug in md in 2.6.18 affecting at least raid1. This fixes it (and has already been sent to [EMAIL PROTECTED]). ### Comments for Changeset This fixes a bug introduced in 2.6.18. If a drive is added to a raid1 using older tools (mdadm-1.x or raidtools) then it will be included in the array without any resync happening. It has been submitted for 2.6.18.1. Signed-off-by: Neil Brown [EMAIL PROTECTED] ### Diffstat output ./drivers/md/md.c |1 + 1 file changed, 1 insertion(+) diff .prev/drivers/md/md.c ./drivers/md/md.c --- .prev/drivers/md/md.c 2006-09-29 11:51:39.0 +1000 +++ ./drivers/md/md.c 2006-10-05 16:40:51.0 +1000 @@ -3849,6 +3849,7 @@ static int hot_add_disk(mddev_t * mddev, } clear_bit(In_sync, rdev-flags); rdev-desc_nr = -1; + rdev-saved_raid_disk = -1; err = bind_rdev_to_array(rdev, mddev); if (err) goto abort_export; - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
converting RAID5 to RAID10
I have a 1.5Tb RAID5 machine (3*750Gb disks + 1 spare) and need to move some write-intensive services there. Unfortunately, the performance is unacceptable. Thus, I wanted to convert the machine to RAID10. My theory was: backup, remove the spare, set one disk faulty, remove it, create a degraded RAID10 on the two freed disks, copy data, kill RAID5, add disks to new RAID10. Unfortunately, mdadm (2.5.3) doesn't seem to agree; it complains that it cannot assemble a RAID10 with 4 devices when I ask it to: mdadm --create -l 10 -n4 -pn2 /dev/md1 /dev/sd[cd] missing missing I can kind of understand, but on the other hand I don't. After all, if you'll allow me to think in terms of 1+0 instead of 10 for a second, why doesn't mdadm just assemble /dev/sd[cd] as RAID0 and make the couple one of the two components of the RAID1? What I mean is: I could set up RAID1+0 that way; why doesn't it work for RAID10? Do you know of a way in which I could migrate the data to RAID10? Unfortunately, I do not have more 750Gb disks available nor a budget, and the 1.5Tb are 96% full. Cheers, -- martin; (greetings from the heart of the sun.) \ echo mailto: !#^.*|tr * mailto:; [EMAIL PROTECTED] spamtraps: [EMAIL PROTECTED] if a man treats life artistically, his brain is his heart. -- oscar wilde signature.asc Description: Digital signature (GPG/PGP)
Re: question about raid5 parity calculations
On Tuesday October 3, [EMAIL PROTECTED] wrote: Hello Neil, Ingo and [insert your name here], I try to understand the raid5 and md code and I have a question concerning the cache. There are two ways of calculating the parity: read-modify-write and reconstruct-write. In my understanding, the code only checks how many buffers it has to read for each method (rmw or rcw) without considering the cache. But what if there was relevant data in the cache? How would the raid code know it so it can build a decision on top of this knowledge? Close. It checks how many buffers it has to read for each method *with* consideration of the cache. Note the !test_bit(R5_UPTODATE, dev-flags) tests in handle_stripe5 in the section where 'rcw' and 'rwm' are being calculated. It only counts buffers that are not uptodate. i.e. those that do not already exist in the stripe cache. NeilBrown - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: converting RAID5 to RAID10
On Thursday October 5, [EMAIL PROTECTED] wrote: Unfortunately, mdadm (2.5.3) doesn't seem to agree; it complains that it cannot assemble a RAID10 with 4 devices when I ask it to: mdadm --create -l 10 -n4 -pn2 /dev/md1 /dev/sd[cd] missing missing mdadm --create -l 10 -n 4 -pn2 /dev/md1 /dev/sdc missing /dev/sdd missing Raid10 lays out data like A A B B C C D D not A B A B C D C D as you seem to expect. So you could even do mdadm --create -l 10 -n 4 -pn2 /dev/md1 missing /dev/sd[cd] missing for slightly less typing. There seems to be a bug in raid10 that is reports the wrong number of working drives. This is probably only in 2.6.18. Patch is below. NeilBrown Signed-off-by: Neil Brown [EMAIL PROTECTED] ### Diffstat output ./drivers/md/raid10.c |2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff .prev/drivers/md/raid10.c ./drivers/md/raid10.c --- .prev/drivers/md/raid10.c 2006-09-29 11:44:36.0 +1000 +++ ./drivers/md/raid10.c 2006-10-05 20:10:07.0 +1000 @@ -2079,7 +2079,7 @@ static int run(mddev_t *mddev) disk = conf-mirrors + i; if (!disk-rdev || - !test_bit(In_sync, rdev-flags)) { + !test_bit(In_sync, disk-rdev-flags)) { disk-head_position = 0; mddev-degraded++; } - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: converting RAID5 to RAID10
also sprach Neil Brown [EMAIL PROTECTED] [2006.10.05.1214 +0200]: mdadm --create -l 10 -n 4 -pn2 /dev/md1 /dev/sdc missing /dev/sdd missing Peter Samuelson of the Debian project already suggested this and it seems to work. Thanks a lot, Neil, for the quick and informative response. -- martin; (greetings from the heart of the sun.) \ echo mailto: !#^.*|tr * mailto:; [EMAIL PROTECTED] spamtraps: [EMAIL PROTECTED] the ships hung in the sky in much the same way that bricks don't. -- hitchhiker's guide to the galaxy signature.asc Description: Digital signature (GPG/PGP)
Re: converting RAID5 to RAID10
On Oct 5, 2006, at 3:15 AM, Jurriaan Kalkman wrote: AFAIK, linux raid-10 is not exactly raid 1+0, it allows you to, for example, use 3 disks. I made a raid-10 device earlier today with 7 drives and I was surprised to see that it reported to use all of them. I thought it'd make one of them a spare (or complain about the odd number of drives). How does that work? (Or is it the number of drives reported bug Neil referred to a moment ago? I use FC6's version of 2.6.18). - ask -- http://www.askbjoernhansen.com/ - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: converting RAID5 to RAID10
On Thursday October 5, [EMAIL PROTECTED] wrote: On Oct 5, 2006, at 3:15 AM, Jurriaan Kalkman wrote: AFAIK, linux raid-10 is not exactly raid 1+0, it allows you to, for example, use 3 disks. I made a raid-10 device earlier today with 7 drives and I was surprised to see that it reported to use all of them. I thought it'd make one of them a spare (or complain about the odd number of drives). How does that work? (Or is it the number of drives reported bug Neil referred to a moment ago? I use FC6's version of 2.6.18). If you wanted 6 drives and a spare you need to ask for in: -n6 -x1. If you asked for 7 drives in a raid10 you get them. The data is laid out thus: A A B B C C D D E E F F G G H H I I J J K K L L M M N N (each column in a drive, each letter is a chunk of data). NeilBrown - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: mdadm RAID5 Grow
Neil Brown wrote: On Wednesday October 4, [EMAIL PROTECTED] wrote: I have been trying to run: mdadm --grow /dev/md0 --raid-devices=6 --backup-file /backup_raid_grow I get: mdadm: Need to backup 1280K of critical section.. mdadm: /dev/md0: Cannot get array details from sysfs It shouldn't do that Can you strace -o /tmp/trace -s 300 mdadm --grow . and send a copy of /tmp/trace. I'd like to see how far it gets at reading information from sysfs. Would it need to be unmounted to work properly (It is currently mounted under lvm)? No. unmounting isn't needed and won't make any difference. NeilBrown - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html strace mdadm --grow /dev/md0 --raid-devices=6 --backup-file /backup_raid_grow execve(/sbin/mdadm, [mdadm, --grow, /dev/md0, --raid-devices=6, --backup-file, /backup_raid_grow], [/* 68 vars */]) = 0 brk(0) = 0x8076000 access(/etc/ld.so.preload, R_OK) = -1 ENOENT (No such file or directory) open(/etc/ld.so.cache, O_RDONLY) = 3 fstat64(3, {st_mode=S_IFREG|0644, st_size=107351, ...}) = 0 mmap2(NULL, 107351, PROT_READ, MAP_PRIVATE, 3, 0) = 0xa7fa8000 close(3)= 0 open(/lib/libc.so.6, O_RDONLY)= 3 read(3, \177ELF\1\1\1\0\0\0\0\0\0\0\0\0\3\0\3\0\1\0\0\0\300Y\1..., 512) = 512 fstat64(3, {st_mode=S_IFREG|0755, st_size=1404242, ...}) = 0 mmap2(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0xa7fa7000 mmap2(NULL, 1176988, PROT_READ|PROT_EXEC, MAP_PRIVATE|MAP_DENYWRITE, 3, 0) = 0xa7e87000 madvise(0xa7e87000, 1176988, MADV_SEQUENTIAL|0x1) = 0 mmap2(0xa7fa, 16384, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED|MAP_DENYWRITE, 3, 0x118) = 0xa7fa mmap2(0xa7fa4000, 9628, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED|MAP_ANONYMOUS, -1, 0) = 0xa7fa4000 close(3)= 0 mmap2(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0xa7e86000 set_thread_area({entry_number:-1 - 6, base_addr:0xa7e866b0, limit:1048575, seg_32bit:1, contents:0, read_exec_only:0, limit_in_pages:1, seg_not_present:0, useable:1}) = 0 mprotect(0xa7fa, 8192, PROT_READ) = 0 munmap(0xa7fa8000, 107351) = 0 time(NULL) = 1160052126 getpid()= 8461 brk(0) = 0x8076000 brk(0x8097000) = 0x8097000 open(/etc/mdadm.conf, O_RDONLY) = 3 fstat64(3, {st_mode=S_IFREG|0644, st_size=154, ...}) = 0 mmap2(NULL, 131072, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0xa7e66000 read(3, DEVICE partitions\nARRAY /dev/md0..., 131072) = 154 read(3, , 131072) = 0 read(3, , 131072) = 0 close(3)= 0 munmap(0xa7e66000, 131072) = 0 open(/dev/md0, O_RDWR)= 3 fstat64(3, {st_mode=S_IFBLK|0640, st_rdev=makedev(9, 0), ...}) = 0 ioctl(3, 0x800c0910, 0xafc4a024)= 0 ioctl(3, 0x80480911, 0xafc4a0a8)= 0 fstat64(1, {st_mode=S_IFCHR|0620, st_rdev=makedev(136, 3), ...}) = 0 mmap2(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0xa7fc2000 write(1, mdadm: Need to backup 1280K of c..., 50mdadm: Need to backup 1280K of critical section.. ) = 50 fstat64(3, {st_mode=S_IFBLK|0640, st_rdev=makedev(9, 0), ...}) = 0 open(/sys/block/md0/md/component_size, O_RDONLY) = -1 ENOENT (No such file or directory) write(2, mdadm: /dev/md0: Cannot get arra..., 53mdadm: /dev/md0: Cannot get array details from sysfs ) = 53 exit_group(1) = ? Process 8461 detached - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: mdadm RAID5 Grow
On Thursday October 5, [EMAIL PROTECTED] wrote: Neil Brown wrote: On Wednesday October 4, [EMAIL PROTECTED] wrote: I have been trying to run: mdadm --grow /dev/md0 --raid-devices=6 --backup-file /backup_raid_grow I get: mdadm: Need to backup 1280K of critical section.. mdadm: /dev/md0: Cannot get array details from sysfs It shouldn't do that Can you strace -o /tmp/trace -s 300 mdadm --grow . ... open(/sys/block/md0/md/component_size, O_RDONLY) = -1 ENOENT (No such file or directory) So it couldn't open .../component_size. That was added prior to the release of 2.6.16, and you are running 2.6.17.13 so the kernel certainly supports it. Most likely explanation is that /sys isn't mounted. Do you have a /sys? Is it mounted? Can you ls -l /sys/block/md0/md ?? Maybe you need to mkdir /sys mount -t sysfs sysfs /sys and try again. NeilBrown - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
RAID10: near, far, offset -- which one?
I am trying to compare the three RADI10 layouts with each other. Assuming a simple 4 drive setup with 2 copies of each block, I understand that a near layout makes RAID10 resemble RAID1+0 (although it's not 1+0). I also understand that the far layout trades some read performance for some write performance, so it's best for read-intensive operations, like read-only file servers. I don't really understand the offset layout. Am I right in asserting that like near it keeps stripes together and thus requires less seeking, but stores the blocks at different offsets wrt the disks? If A,B,C are data blocks, a,b their parts, and 1,2 denote their copies, the following would be a classic RAID1+0 where 1,2 and 3,4 are RAID0 pairs combined into a RAID1: hdd1 Aa1 Ba1 Ca1 hdd2 Ab1 Bb1 Cb1 hdd3 Aa2 Ba2 Ca2 hdd4 Ab2 Bb2 Cb2 How would this look with the three different layouts? I think near is pretty much the same as above, but I can't figure out far and offset from the md(4) manpage. Also, what are their respective advantages and disadvantages? Thanks, -- martin; (greetings from the heart of the sun.) \ echo mailto: !#^.*|tr * mailto:; [EMAIL PROTECTED] spamtraps: [EMAIL PROTECTED] a woman begins by resisting a man's advances and ends by blocking his retreat. -- oscar wilde signature.asc Description: Digital signature (GPG/PGP)
Re: RAID10: near, far, offset -- which one?
Taken for what it is, here's some recent experience I'm seeing (not a precise explanation as you're asking for, which I'd like to know also). Layout : near=2, far=1 Chunk Size : 512K gtmp01,16G,,,125798,22,86157,17,,,337603,34,765.3,2,16,240,1,+,+++,237,1,241,1,+,+++,239,1 gtmp01,16G,,,129137,21,87074,17,,,336256,34,751.7,1,16,239,1,+,+++,238,1,240,1,+,+++,238,1 gtmp01,16G,,,125458,22,86293,17,,,338146,34,755.8,1,16,240,1,+,+++,237,1,240,1,+,+++,237,1 Layout : near=1, offset=2 Chunk Size : 512K gtmp02,16G,,,141278,25,98789,20,,,297263,29,767.5,2,16,240,1,+,+++,238,1,240,1,+,+++,238,1 gtmp02,16G,,,143068,25,98469,20,,,316138,31,793.6,1,16,239,1,+,+++,237,1,239,1,+,+++,238,0 gtmp02,16G,,,143236,24,99234,20,,,313824,32,782.1,1,16,240,1,+,+++,237,1,240,1,+,+++,238,1 Here, testing with bonnie++, 14-drive RAID10 dual-multipath FC, 10K 146G drives. RAID5 nets the same approximate read performance (sometimes higher), with single-thread writes limited to 100MB/sec, and concurrent-thread R/W access in the pits (obvious for RAID5). mdadm 2.5.3 linux 2.6.18 xfs (mkfs.xfs -d su=512k,sw=3 -l logdev=/dev/sda1 -f /dev/md0) Cheers, /eli martin f krafft wrote: I am trying to compare the three RADI10 layouts with each other. Assuming a simple 4 drive setup with 2 copies of each block, I understand that a near layout makes RAID10 resemble RAID1+0 (although it's not 1+0). I also understand that the far layout trades some read performance for some write performance, so it's best for read-intensive operations, like read-only file servers. I don't really understand the offset layout. Am I right in asserting that like near it keeps stripes together and thus requires less seeking, but stores the blocks at different offsets wrt the disks? If A,B,C are data blocks, a,b their parts, and 1,2 denote their copies, the following would be a classic RAID1+0 where 1,2 and 3,4 are RAID0 pairs combined into a RAID1: hdd1 Aa1 Ba1 Ca1 hdd2 Ab1 Bb1 Cb1 hdd3 Aa2 Ba2 Ca2 hdd4 Ab2 Bb2 Cb2 How would this look with the three different layouts? I think near is pretty much the same as above, but I can't figure out far and offset from the md(4) manpage. Also, what are their respective advantages and disadvantages? Thanks, - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] md: Fix bug where new drives added to an md array sometimes don't sync properly.
I'm actually seeing similar behaviour on RAID10 (2.6.18), where after removing a drive from an array re-adding it sometimes results in it still being listed as a faulty-spare and not being taken for resync. In the same scenario, after swapping drives, doing a fail,remove, then an 'add' doesn't work, only a re-add will even get the drive listed by MDADM. What's the failure mode/symptoms that this patch is resolving? Is it possible this affects the RAID10 module/mode as well? If not, I'll start a new thread for that. I'm testing this patch to see if it does remedy the situation on RAID10, and will update after some significant testing. /eli NeilBrown wrote: There is a nasty bug in md in 2.6.18 affecting at least raid1. This fixes it (and has already been sent to [EMAIL PROTECTED]). ### Comments for Changeset This fixes a bug introduced in 2.6.18. If a drive is added to a raid1 using older tools (mdadm-1.x or raidtools) then it will be included in the array without any resync happening. It has been submitted for 2.6.18.1. Signed-off-by: Neil Brown [EMAIL PROTECTED] ### Diffstat output ./drivers/md/md.c |1 + 1 file changed, 1 insertion(+) diff .prev/drivers/md/md.c ./drivers/md/md.c --- .prev/drivers/md/md.c 2006-09-29 11:51:39.0 +1000 +++ ./drivers/md/md.c 2006-10-05 16:40:51.0 +1000 @@ -3849,6 +3849,7 @@ static int hot_add_disk(mddev_t * mddev, } clear_bit(In_sync, rdev-flags); rdev-desc_nr = -1; + rdev-saved_raid_disk = -1; err = bind_rdev_to_array(rdev, mddev); if (err) goto abort_export; - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: mdadm RAID5 Grow
Neil Brown wrote: On Thursday October 5, [EMAIL PROTECTED] wrote: Neil Brown wrote: On Wednesday October 4, [EMAIL PROTECTED] wrote: I have been trying to run: mdadm --grow /dev/md0 --raid-devices=6 --backup-file /backup_raid_grow I get: mdadm: Need to backup 1280K of critical section.. mdadm: /dev/md0: Cannot get array details from sysfs It shouldn't do that Can you strace -o /tmp/trace -s 300 mdadm --grow . ... open(/sys/block/md0/md/component_size, O_RDONLY) = -1 ENOENT (No such file or directory) So it couldn't open .../component_size. That was added prior to the release of 2.6.16, and you are running 2.6.17.13 so the kernel certainly supports it. Most likely explanation is that /sys isn't mounted. Do you have a /sys? Is it mounted? Can you ls -l /sys/block/md0/md ?? Maybe you need to mkdir /sys mount -t sysfs sysfs /sys and try again. Worked like a charm! Thank you! There is a sysfs /syssysfs noauto 0 0 line in /etc/fstab I am assuming noauto is the culprit? Should it be made to automount ? mickg NeilBrown - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html