Re: Bad drive discovered during raid5 reshape
On Tuesday October 30, [EMAIL PROTECTED] wrote: Neil Brown wrote: On Monday October 29, [EMAIL PROTECTED] wrote: Hi, I bought two new hard drives to expand my raid array today and unfortunately one of them appears to be bad. The problem didn't arise Looks like you are in real trouble. Both the drives seem bad in some way. If it was just sdc that was failing it would have picked up after the -Af, but when it tried, sdb gave errors. Humble enquiry :) I'm not sure that's right? He *removed* sdb and sdc when the failure occurred so sdc would indeed be non-fresh. I'm not sure what point you are making here. In any case, remove two drives from a raid5 is always a bad thing. Part of the array was striped over 8 drives by this time. With only six still in the array, some data will be missing. The key question I think is: will md continue to grow an array even if it enters degraded mode during the grow? ie grow from a 6 drive array to a 7-of-8 degraded array? Technically I guess it should be able to. Yes, md can grow to a degraded array. If you get a single failure I would expect it to abort the growth process, then restart where it left off (after checking that that made sense). In which case should he be able to re-add /dev/sdc and allow md to retry the grow? (possibly losing some data due to the sdc staleness) He only needs one of the two drives in there. I got the impression that both sdc and sdb had reported errors. If not, and sdc really seems OK, then --assemble --force listing all drives except sdb should make it all work again. NeilBrown - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Bad drive discovered during raid5 reshape
David Greaves wrote: I read that he aborted it, then removed both drives before giving md a chance to restart. He said: After several minutes dmesg indicated that mdadm gave up and the grow process stopped. After googling around I tried the solutions that seemed most likely to work, including removing the new drives with mdadm --remove --force /dev/md1 /dev/sd[bc]1 and rebooting and *then* he: ran mdadm -Af /dev/md1. This is correct. I first removed sdb and sdc then rebooted and ran mdadm -Af /dev/md1. Kyle - I think you need to clarify this as it may not be too bad. Apologies if I misread something and sdc is bad too :) It may be an idea to let us (Neil) know what you've done and if you've done any writes to any devices before trying this assemble. David When I sent the first email I thought only sdb had failed. After digging into the log files it appears sdc also reported several bad blocks during the grow. This is what I get for not testing cheap refurbed drives before trusting them with my data, but hindsight is 20/20. Fortunately all of the important data is backed up so if I can't recover anything using Neil's suggestions it's not a total loss. Thank you both for the help. - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Bad drive discovered during raid5 reshape
Hi, I bought two new hard drives to expand my raid array today and unfortunately one of them appears to be bad. The problem didn't arise until after I attempted to grow the raid array. I was trying to expand the array from 6 to 8 drives. I added both drives using mdadm --add /dev/md1 /dev/sdb1 which completed, then mdadm --add /dev/md1 /dev/sdc1 which also completed. I then ran mdadm --grow /dev/md1 --raid-devices=8. It passed the critical section, then began the grow process. After a few minutes I started to hear unusual sounds from within the case. Fearing the worst I tried to cat /proc/mdstat which resulted in no output so I checked dmesg which showed that /dev/sdb1 was not working correctly. After several minutes dmesg indicated that mdadm gave up and the grow process stopped. After googling around I tried the solutions that seemed most likely to work, including removing the new drives with mdadm --remove --force /dev/md1 /dev/sd[bc]1 and rebooting after which I ran mdadm -Af /dev/md1. The grow process restarted then failed almost immediately. Trying to mount the drive gives me a reiserfs replay failure and suggests running fsck. I don't dare fsck the array since I've already messed it up so badly. Is there any way to go back to the original working 6 disc configuration with minimal data loss? Here's where I'm at right now, please let me know if I need to include any additional information. # uname -a Linux nas 2.6.22-gentoo-r5 #1 SMP Thu Aug 23 16:59:47 MDT 2007 x86_64 AMD Athlon(tm) 64 Processor 3500+ AuthenticAMD GNU/Linux # mdadm --version mdadm - v2.6.2 - 21st May 2007 # cat /proc/mdstat Personalities : [raid6] [raid5] [raid4] md1 : active raid5 hdb1[0] sdb1[8](F) sda1[5] sdf1[4] sde1[3] sdg1[2] sdd1[1] 1220979520 blocks super 0.91 level 5, 64k chunk, algorithm 2 [8/6] [UU__] unused devices: none # mdadm --detail --verbose /dev/md1 /dev/md1: Version : 00.91.03 Creation Time : Sun Apr 8 19:48:01 2007 Raid Level : raid5 Array Size : 1220979520 (1164.42 GiB 1250.28 GB) Used Dev Size : 244195904 (232.88 GiB 250.06 GB) Raid Devices : 8 Total Devices : 7 Preferred Minor : 1 Persistence : Superblock is persistent Update Time : Mon Oct 29 00:53:21 2007 State : clean, degraded Active Devices : 6 Working Devices : 6 Failed Devices : 1 Spare Devices : 0 Layout : left-symmetric Chunk Size : 64K Delta Devices : 2, (6-8) UUID : 56e7724e:9a5d0949:ff52889f:ac229049 Events : 0.487460 Number Major Minor RaidDevice State 0 3 650 active sync /dev/hdb1 1 8 491 active sync /dev/sdd1 2 8 972 active sync /dev/sdg1 3 8 653 active sync /dev/sde1 4 8 814 active sync /dev/sdf1 5 815 active sync /dev/sda1 6 006 removed 8 8 177 faulty spare rebuilding /dev/sdb1 #dmesg snip md: md1 stopped. md: unbindhdb1 md: export_rdev(hdb1) md: unbindsdc1 md: export_rdev(sdc1) md: unbindsdb1 md: export_rdev(sdb1) md: unbindsda1 md: export_rdev(sda1) md: unbindsdf1 md: export_rdev(sdf1) md: unbindsde1 md: export_rdev(sde1) md: unbindsdg1 md: export_rdev(sdg1) md: unbindsdd1 md: export_rdev(sdd1) md: bindsdd1 md: bindsdg1 md: bindsde1 md: bindsdf1 md: bindsda1 md: bindsdb1 md: bindsdc1 md: bindhdb1 md: md1 stopped. md: unbindhdb1 md: export_rdev(hdb1) md: unbindsdc1 md: export_rdev(sdc1) md: unbindsdb1 md: export_rdev(sdb1) md: unbindsda1 md: export_rdev(sda1) md: unbindsdf1 md: export_rdev(sdf1) md: unbindsde1 md: export_rdev(sde1) md: unbindsdg1 md: export_rdev(sdg1) md: unbindsdd1 md: export_rdev(sdd1) md: bindsdd1 md: bindsdg1 md: bindsde1 md: bindsdf1 md: bindsda1 md: bindsdb1 md: bindsdc1 md: bindhdb1 md: kicking non-fresh sdc1 from array! md: unbindsdc1 md: export_rdev(sdc1) raid5: reshape will continue raid5: device hdb1 operational as raid disk 0 raid5: device sdb1 operational as raid disk 7 raid5: device sda1 operational as raid disk 5 raid5: device sdf1 operational as raid disk 4 raid5: device sde1 operational as raid disk 3 raid5: device sdg1 operational as raid disk 2 raid5: device sdd1 operational as raid disk 1 raid5: allocated 8462kB for md1 raid5: raid level 5 set md1 active with 7 out of 8 devices, algorithm 2 RAID5 conf printout: --- rd:8 wd:7 disk 0, o:1, dev:hdb1 disk 1, o:1, dev:sdd1 disk 2, o:1, dev:sdg1 disk 3, o:1, dev:sde1 disk 4, o:1, dev:sdf1 disk 5, o:1, dev:sda1 disk 7, o:1, dev:sdb1 ...ok start reshape thread md: reshape of RAID array md1 md: minimum _guaranteed_ speed: 1000 KB/sec/disk. md: using maximum available idle IO bandwidth (but not more than 20 KB/sec) for reshape. md: using 128k window, over a total of 244195904 blocks. ata2.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x2 frozen ata2.00: cmd
Re: Bad drive discovered during raid5 reshape
On Monday October 29, [EMAIL PROTECTED] wrote: Hi, I bought two new hard drives to expand my raid array today and unfortunately one of them appears to be bad. The problem didn't arise until after I attempted to grow the raid array. I was trying to expand the array from 6 to 8 drives. I added both drives using mdadm --add /dev/md1 /dev/sdb1 which completed, then mdadm --add /dev/md1 /dev/sdc1 which also completed. I then ran mdadm --grow /dev/md1 --raid-devices=8. It passed the critical section, then began the grow process. After a few minutes I started to hear unusual sounds from within the case. Fearing the worst I tried to cat /proc/mdstat which resulted in no output so I checked dmesg which showed that /dev/sdb1 was not working correctly. After several minutes dmesg indicated that mdadm gave up and the grow process stopped. After googling around I tried the solutions that seemed most likely to work, including removing the new drives with mdadm --remove --force /dev/md1 /dev/sd[bc]1 and rebooting after which I ran mdadm -Af /dev/md1. The grow process restarted then failed almost immediately. Trying to mount the drive gives me a reiserfs replay failure and suggests running fsck. I don't dare fsck the array since I've already messed it up so badly. Is there any way to go back to the original working 6 disc configuration with minimal data loss? Here's where I'm at right now, please let me know if I need to include any additional information. Looks like you are in real trouble. Both the drives seem bad in some way. If it was just sdc that was failing it would have picked up after the -Af, but when it tried, sdb gave errors. Have two failed devices in a RAID5 is not good! Your best bet goes like this: The reshape has started and got up to some point. The data before that point is spread over 8 drives. The data after is over 6. We need to restripe the 8drive data back to 6 drives. This can be done with the test_stripe tool that can be built from the mdadm source. 1/ Find out how far the reshape progressed, by using mdadm -E on one of the devices. 2/ use something like test_stripe save /some/file 8 $chunksize 5 2 0 $length /dev/.. If you get all the args right, this should copy the data from the array into /some/file. You could possibly do the same thing by assembling the array read-only (set /sys/modules/md_mod/parameters/start_ro to 1) and 'dd' from the array. It might be worth doing both and checking you get the same result. 3/ use something like test_stripe restore /some/file 6 .. to restore the data to just 6 devices. 4/ use mdadm -C to create the array a-new on the 6 devices. Make sure the order and the chunksize etc is preserved. Once you have done this, the start of the array should (again) look like the content of /some/file. It wouldn't hurt to check. Then your data would be as much back together as possible. You will probably still need to do an fsck, but I think you did the right thing in holding off. Don't do an fsck until you are sure the array is writable. You can probably do the above without using test_stripe by using dd to copy of the array before you recreate it, then using dd to put the same data back. Using test_stripe as well might give you extra confidence. Feel free to ask questions NeilBrown - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html