Re: mdadm RAID5 array failure
mdadm -Af /dev/md0 should get it back for you. It did indeed... Thank you. But you really want to find out why it died. Well, it looks like I have a bad section on hde, which got tickled as I was copying files onto it... As the rebuild progressed, and hit around 6%, it hit the same spot on the disk again, and locked the box up solid. I ended up setting speed_limit_min and speed_limit_max to 0 so that the rebuild didn't happen, activated my LVM volume groups, and mounted the first of the logical volumes. I've just copied off all the files on that LV, and tomorrow I'll get the other 2 done. I do have a spare drive in the array... any idea why it wasn't being activated when hde went offline? What kernel version are you running? Kernel is 2.6.17-1.2142.FC4, and mdadm is V1.11.0 11 April 2005 I am assuming that the underlying RAID doesn't do any bad block handling? Once again, thank you for your help. Graham - Original Message From: Neil Brown [EMAIL PROTECTED] To: jahammonds prost [EMAIL PROTECTED] Cc: linux-raid@vger.kernel.org Sent: Wednesday, 7 February, 2007 10:57:47 PM Subject: Re: mdadm RAID5 array failure On Thursday February 8, [EMAIL PROTECTED] wrote: I'm running an FC4 system. I was copying some files on to the server this weekend, and the server locked up hard, and I had to power off. I rebooted the server, and the array came up fine, but when I tried to fsck the filesystem, fsck just locked up at about 40%. I left it sitting there for 12 hours, hoping it was going to come back, but I had to power off the server again. When I now reboot the server, it is failing to mount my raid5 array.. mdadm: /dev/md0 assembled from 3 drives and 1 spare - not enough to start the array. mdadm -Af /dev/md0 should get it back for you. But you really want to find out why it died. Where there any kernel messages at the time of the first failure? What kernel version are you running? I've added the output from the various files/commands at the bottom... I am a little confused at the output.. According to /dev/hd[cgh], there is only 1 failed disk in the array, so why does it think that there are 3 failed disks in the array? You need to look at the 'Event' count. md will look for the device with the highest event count and reject anything with an event count 2 or more less than that. NeilBrown ___ What kind of emailer are you? Find out today - get a free analysis of your email personality. Take the quiz at the Yahoo! Mail Championship. http://uk.rd.yahoo.com/evt=44106/*http://mail.yahoo.net/uk - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
mdadm RAID5 array failure
I'm running an FC4 system. I was copying some files on to the server this weekend, and the server locked up hard, and I had to power off. I rebooted the server, and the array came up fine, but when I tried to fsck the filesystem, fsck just locked up at about 40%. I left it sitting there for 12 hours, hoping it was going to come back, but I had to power off the server again. When I now reboot the server, it is failing to mount my raid5 array.. mdadm: /dev/md0 assembled from 3 drives and 1 spare - not enough to start the array. I've added the output from the various files/commands at the bottom... I am a little confused at the output.. According to /dev/hd[cgh], there is only 1 failed disk in the array, so why does it think that there are 3 failed disks in the array? It looks like there is only 1 failed disk I got an error from SMARTD about it when I got the server back into multiuser mode, so I know there is an issue with the disk (Device: /dev/hde, 8 Offline uncorrectable sectors), but there are still enough disks to bring up the array, and for the spare disk to start rebuilding. I've spent the last couple of days googling around, and I can't seem to find much on how to recover a failed md arrary. Is there any way to get the array back and working? Unfortunately I don't have a back up of this array, and I'd really like to try and get the data back (there are 3 LVM logical volumes on it). Thanks very much for any help. Graham My /etc/mdadm.conf looks like this ]# cat /etc/mdadm.conf DEVICE /dev/hd*[a-z] ARRAY /dev/md0 level=raid5 num-devices=6 UUID=96c7d78a:2113ea58:9dc237f1:79a60ddf devices=/dev/hdh,/dev/hdg,/dev/hdf,/dev/hde,/dev/hdd,/dev/hdc,/dev/hdb Looking at /proc/mdstat, I am getting this output # cat /proc/mdstat Personalities : [raid5] [raid4] md0 : inactive hdc[0] hdb[6] hdh[5] hdg[4] hdf[3] hde[2] hdd[1] 137832 blocks super non-persistent Here's the output when ran on the device that some think have failed. # mdadm -E /dev/hde /dev/hde: Magic : a92b4efc Version : 00.90.02 UUID : 96c7d78a:2113ea58:9dc237f1:79a60ddf Creation Time : Wed Feb 1 17:10:39 2006 Raid Level : raid5 Raid Devices : 6 Total Devices : 7 Preferred Minor : 0 Update Time : Sun Feb 4 17:29:53 2007 State : active Active Devices : 6 Working Devices : 7 Failed Devices : 0 Spare Devices : 1 Checksum : dcab70d - correct Events : 0.840944 Layout : left-symmetric Chunk Size : 128K Number Major Minor RaidDevice State this 2 3302 active sync /dev/hde 0 0 2200 active sync /dev/hdc 1 1 22 641 active sync /dev/hdd 2 2 3302 active sync /dev/hde 3 3 33 643 active sync /dev/hdf 4 4 3404 active sync /dev/hdg 5 5 34 645 active sync /dev/hdh 6 6 3 646 spare /dev/hdb Running an mdadm -E on /dev/hd[bcgh] gives this, Number Major Minor RaidDevice State this 6 3 646 spare /dev/hdb 0 0 2200 active sync /dev/hdc 1 1 22 641 active sync /dev/hdd 2 2 002 faulty removed 3 3 33 643 active sync /dev/hdf 4 4 3404 active sync /dev/hdg 5 5 34 645 active sync /dev/hdh 6 6 3 646 spare /dev/hdb And running mdadm -E on /dev/hd[def] Number Major Minor RaidDevice State this 3 33 643 active sync /dev/hdf 0 0 2200 active sync /dev/hdc 1 1 22 641 active sync /dev/hdd 2 2 3302 active sync /dev/hde 3 3 33 643 active sync /dev/hdf 4 4 3404 active sync /dev/hdg 5 5 34 645 active sync /dev/hdh 6 6 3 646 spare /dev/hdb Looking at /var/log/messages, shows the following Feb 6 12:36:42 file01bert kernel: md: bindhdd Feb 6 12:36:42 file01bert kernel: md: bindhde Feb 6 12:36:42 file01bert kernel: md: bindhdf Feb 6 12:36:42 file01bert kernel: md: bindhdg Feb 6 12:36:42 file01bert kernel: md: bindhdh Feb 6 12:36:42 file01bert kernel: md: bindhdb Feb 6 12:36:42 file01bert kernel: md: bindhdc Feb 6 12:36:42 file01bert kernel: md: kicking non-fresh hdf from array! Feb 6 12:36:42 file01bert kernel: md: unbindhdf Feb 6 12:36:42 file01bert kernel: md: export_rdev(hdf) Feb 6 12:36:42 file01bert kernel: md: kicking non-fresh hde from array! Feb 6 12:36:42
Re: mdadm RAID5 array failure
On Thursday February 8, [EMAIL PROTECTED] wrote: I'm running an FC4 system. I was copying some files on to the server this weekend, and the server locked up hard, and I had to power off. I rebooted the server, and the array came up fine, but when I tried to fsck the filesystem, fsck just locked up at about 40%. I left it sitting there for 12 hours, hoping it was going to come back, but I had to power off the server again. When I now reboot the server, it is failing to mount my raid5 array.. mdadm: /dev/md0 assembled from 3 drives and 1 spare - not enough to start the array. mdadm -Af /dev/md0 should get it back for you. But you really want to find out why it died. Where there any kernel messages at the time of the first failure? What kernel version are you running? I've added the output from the various files/commands at the bottom... I am a little confused at the output.. According to /dev/hd[cgh], there is only 1 failed disk in the array, so why does it think that there are 3 failed disks in the array? You need to look at the 'Event' count. md will look for the device with the highest event count and reject anything with an event count 2 or more less than that. NeilBrown - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html