Re: mdadm RAID5 array failure

2007-02-08 Thread jahammonds prost
 mdadm -Af /dev/md0 should get it back for you. 

It did indeed... Thank you.

 But you really want to find out why it died.

Well, it looks like I have a bad section on hde, which got tickled as I was 
copying files onto it... As the rebuild progressed, and hit around 6%, it hit 
the same spot on the disk again, and locked the box up solid. I ended up 
setting speed_limit_min and speed_limit_max to 0 so that the rebuild didn't 
happen, activated my LVM volume groups, and mounted the first of the logical 
volumes. I've just copied off all the files on that LV, and tomorrow I'll get 
the other 2 done. I do have a spare drive in the array... any idea why it 
wasn't being activated when hde went offline?

 What kernel version are you running?

Kernel is 2.6.17-1.2142.FC4, and mdadm is V1.11.0 11 April 2005

I am assuming that the underlying RAID doesn't do any bad block handling?


Once again, thank you for your help.


Graham

- Original Message 
From: Neil Brown [EMAIL PROTECTED]
To: jahammonds prost [EMAIL PROTECTED]
Cc: linux-raid@vger.kernel.org
Sent: Wednesday, 7 February, 2007 10:57:47 PM
Subject: Re: mdadm RAID5 array failure


On Thursday February 8, [EMAIL PROTECTED] wrote:

 I'm running an FC4 system. I was copying some files on to the server
 this weekend, and the server locked up hard, and I had to power
 off. I rebooted the server, and the array came up fine, but when I
 tried to fsck the filesystem, fsck just locked up at about 40%. I
 left it sitting there for 12 hours, hoping it was going to come
 back, but I had to power off the server again. When I now reboot the
 server, it is failing to mount my raid5 array.. 
  
   mdadm: /dev/md0 assembled from 3 drives and 1 spare - not enough to 
 start the array.

mdadm -Af /dev/md0
should get it back for you.  But you really want to find out why it
died.
Where there any kernel messages at the time of the first failure?
What kernel version are you running?

  
 I've added the output from the various files/commands at the bottom...
 I am a little confused at the output.. According to /dev/hd[cgh],
 there is only 1 failed disk in the array, so why does it think that
 there are 3 failed disks in the array? 

You need to look at the 'Event' count.  md will look for the device
with the highest event count and reject anything with an event count 2
or more less than that.

NeilBrown



___ 
What kind of emailer are you? Find out today - get a free analysis of your 
email personality. Take the quiz at the Yahoo! Mail Championship. 
http://uk.rd.yahoo.com/evt=44106/*http://mail.yahoo.net/uk 
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


mdadm RAID5 array failure

2007-02-07 Thread jahammonds prost
I'm running an FC4 system. I was copying some files on to the server this 
weekend, and the server locked up hard, and I had to power off. I rebooted the 
server, and the array came up fine, but when I tried to fsck the filesystem, 
fsck just locked up at about 40%. I left it sitting there for 12 hours, hoping 
it was going to come back, but I had to power off the server again. When I now 
reboot the server, it is failing to mount my raid5 array..
 
  mdadm: /dev/md0 assembled from 3 drives and 1 spare - not enough to start 
the array.
 
I've added the output from the various files/commands at the bottom...
I am a little confused at the output.. According to /dev/hd[cgh], there is only 
1 failed disk in the array, so why does it think that there are 3 failed disks 
in the array? It looks like there is only 1 failed disk – I got an error from 
SMARTD about it when I got the server back into multiuser mode, so I know there 
is an issue with the disk (Device: /dev/hde, 8 Offline uncorrectable sectors), 
but there are still enough disks to bring up the array, and for the spare disk 
to start rebuilding.
 
I've spent the last couple of days googling around, and I can't seem to find 
much on how to recover a failed md arrary. Is there any way to get the array 
back and working? Unfortunately I don't have a back up of this array, and I'd 
really like to try and get the data back (there are 3 LVM logical volumes on 
it).
 
Thanks very much for any help.
 
 
Graham
 
 
 
My /etc/mdadm.conf looks like this
 
]# cat /etc/mdadm.conf
DEVICE /dev/hd*[a-z]
ARRAY /dev/md0 level=raid5 num-devices=6 
UUID=96c7d78a:2113ea58:9dc237f1:79a60ddf
  
devices=/dev/hdh,/dev/hdg,/dev/hdf,/dev/hde,/dev/hdd,/dev/hdc,/dev/hdb
 
 
Looking at /proc/mdstat, I am getting this output
 
# cat /proc/mdstat
Personalities : [raid5] [raid4]
md0 : inactive hdc[0] hdb[6] hdh[5] hdg[4] hdf[3] hde[2] hdd[1]
  137832 blocks super non-persistent
 
 
 
 
Here's the output when ran on the device that some think have failed.
 
# mdadm -E /dev/hde
/dev/hde:
  Magic : a92b4efc
Version : 00.90.02
   UUID : 96c7d78a:2113ea58:9dc237f1:79a60ddf
  Creation Time : Wed Feb  1 17:10:39 2006
 Raid Level : raid5
   Raid Devices : 6
  Total Devices : 7
Preferred Minor : 0
 
Update Time : Sun Feb  4 17:29:53 2007
  State : active
 Active Devices : 6
Working Devices : 7
 Failed Devices : 0
  Spare Devices : 1
   Checksum : dcab70d - correct
 Events : 0.840944
 
 Layout : left-symmetric
 Chunk Size : 128K
 
  Number   Major   Minor   RaidDevice State
this 2  3302  active sync   /dev/hde
 
   0 0  2200  active sync   /dev/hdc
   1 1  22   641  active sync   /dev/hdd
   2 2  3302  active sync   /dev/hde
   3 3  33   643  active sync   /dev/hdf
   4 4  3404  active sync   /dev/hdg
   5 5  34   645  active sync   /dev/hdh
   6 6   3   646  spare   /dev/hdb
 
 
Running an mdadm -E on /dev/hd[bcgh] gives this,
 
 
  Number   Major   Minor   RaidDevice State
this 6   3   646  spare   /dev/hdb
 
   0 0  2200  active sync   /dev/hdc
   1 1  22   641  active sync   /dev/hdd
   2 2   002  faulty removed
   3 3  33   643  active sync   /dev/hdf
   4 4  3404  active sync   /dev/hdg
   5 5  34   645  active sync   /dev/hdh
   6 6   3   646  spare   /dev/hdb
 
 
 
And running mdadm -E on /dev/hd[def]
 
  Number   Major   Minor   RaidDevice State
this 3  33   643  active sync   /dev/hdf
 
   0 0  2200  active sync   /dev/hdc
   1 1  22   641  active sync   /dev/hdd
   2 2  3302  active sync   /dev/hde
   3 3  33   643  active sync   /dev/hdf
   4 4  3404  active sync   /dev/hdg
   5 5  34   645  active sync   /dev/hdh
   6 6   3   646  spare   /dev/hdb
 
 
Looking at /var/log/messages, shows the following
 
Feb  6 12:36:42 file01bert kernel: md: bindhdd
Feb  6 12:36:42 file01bert kernel: md: bindhde
Feb  6 12:36:42 file01bert kernel: md: bindhdf
Feb  6 12:36:42 file01bert kernel: md: bindhdg
Feb  6 12:36:42 file01bert kernel: md: bindhdh
Feb  6 12:36:42 file01bert kernel: md: bindhdb
Feb  6 12:36:42 file01bert kernel: md: bindhdc
Feb  6 12:36:42 file01bert kernel: md: kicking non-fresh hdf from array!
Feb  6 12:36:42 file01bert kernel: md: unbindhdf
Feb  6 12:36:42 file01bert kernel: md: export_rdev(hdf)
Feb  6 12:36:42 file01bert kernel: md: kicking non-fresh hde from array!
Feb  6 12:36:42 

Re: mdadm RAID5 array failure

2007-02-07 Thread Neil Brown
On Thursday February 8, [EMAIL PROTECTED] wrote:

 I'm running an FC4 system. I was copying some files on to the server
 this weekend, and the server locked up hard, and I had to power
 off. I rebooted the server, and the array came up fine, but when I
 tried to fsck the filesystem, fsck just locked up at about 40%. I
 left it sitting there for 12 hours, hoping it was going to come
 back, but I had to power off the server again. When I now reboot the
 server, it is failing to mount my raid5 array.. 
  
   mdadm: /dev/md0 assembled from 3 drives and 1 spare - not enough to 
 start the array.

mdadm -Af /dev/md0
should get it back for you.  But you really want to find out why it
died.
Where there any kernel messages at the time of the first failure?
What kernel version are you running?

  
 I've added the output from the various files/commands at the bottom...
 I am a little confused at the output.. According to /dev/hd[cgh],
 there is only 1 failed disk in the array, so why does it think that
 there are 3 failed disks in the array? 

You need to look at the 'Event' count.  md will look for the device
with the highest event count and reject anything with an event count 2
or more less than that.

NeilBrown
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html