Re: FailSpare event?
On 15 Jan 2007, Bill Davidsen told this: Nix wrote: Number Major Minor RaidDevice State 0 860 active sync /dev/sda6 1 8 221 active sync /dev/sdb6 3 2252 active sync /dev/hdc5 Number Major Minor RaidDevice State 0 8 230 active sync /dev/sdb7 1 871 active sync /dev/sda7 3 352 active sync /dev/hda5 0, 1, and *3*. Where has number 2 gone? (And how does `Number' differ from `RaidDevice'? Why have both?) Did you ever move the data to these drives from another? I think this is what you see when you migrate by adding a drive as a spare, then mark an existing drive as failed, so the data is rebuilt on the new drive. Was there ever a device 2? Nope. These arrays were created in one lump and never had a spare. Plenty of pvmoves have happened on them, but that's *inside* the arrays, of course... -- `He accused the FSF of being something of a hypocrit, which shows that he neither understands hypocrisy nor can spell.' --- jimmybgood - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: FailSpare event?
On 14 Jan 2007, Neil Brown told this: A quick look suggests that the following patch might make a difference, but there is more to it than that. I think there are subtle differences due to the use of version-1 superblocks. That might be just another one-line change, but I want to make sure first. Well, that certainly made that warning go away. I don't have any actually-failed disks, so I can't tell if it would *ever* warn anymore ;) ... actually, it just picked up some monthly array check activity: Jan 15 20:03:17 loki daemon warning: mdadm: Rebuild20 event detected on md device /dev/md2 So it looks like it works perfectly well now. (Looking at the code, yeah, without that change it'll never remember state changes at all!) One bit of residue from the state before this patch remains on line 352, where you initialize disc.state and then never use it for anything... -- `He accused the FSF of being something of a hypocrit, which shows that he neither understands hypocrisy nor can spell.' --- jimmybgood - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: FailSpare event?
On 13 Jan 2007, [EMAIL PROTECTED] uttered the following: mdadm-2.6 bug, I fear. I haven't tracked it down yet but will look shortly: I can't afford to not run mdadm --monitor... odd, that code hasn't changed during 2.6 development. Whoo! Compile Monitor.c without optimization and the problem goes away. Hunting: maybe it's a compiler bug (anyone not using GCC 4.1.1 seeing this?), maybe mdadm is tripping undefined behaviour somewhere... -- `He accused the FSF of being something of a hypocrit, which shows that he neither understands hypocrisy nor can spell.' --- jimmybgood - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: FailSpare event?
On Sunday January 14, [EMAIL PROTECTED] wrote: On 13 Jan 2007, [EMAIL PROTECTED] uttered the following: mdadm-2.6 bug, I fear. I haven't tracked it down yet but will look shortly: I can't afford to not run mdadm --monitor... odd, that code hasn't changed during 2.6 development. Whoo! Compile Monitor.c without optimization and the problem goes away. Hunting: maybe it's a compiler bug (anyone not using GCC 4.1.1 seeing this?), maybe mdadm is tripping undefined behaviour somewhere... Probably A quick look suggests that the following patch might make a difference, but there is more to it than that. I think there are subtle differences due to the use of version-1 superblocks. That might be just another one-line change, but I want to make sure first. Thanks, NeilBrown ### Diffstat output ./Monitor.c |2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff .prev/Monitor.c ./Monitor.c --- .prev/Monitor.c 2006-12-21 17:15:55.0 +1100 +++ ./Monitor.c 2007-01-15 08:17:30.0 +1100 @@ -383,7 +383,7 @@ int Monitor(mddev_dev_t devlist, ) alert(SpareActive, dev, dv, mailaddr, mailfrom, alert_cmd, dosyslog); } - st-devstate[i] = disc.state; + st-devstate[i] = newstate; st-devid[i] = makedev(disc.major, disc.minor); } st-active = array.active_disks; - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: FailSpare event?
On 12 Jan 2007, Ernst Herzberg told this: Then every about 60 sec 4 times event=SpareActive mddev=/dev/md3 I see exactly this on both my RAID-5 arrays, neither of which have any spare device --- nor have any active devices transitioned to spare (which is what that event is actually supposed to mean). mdadm-2.6 bug, I fear. I haven't tracked it down yet but will look shortly: I can't afford to not run mdadm --monitor... odd, that code hasn't changed during 2.6 development. -- `He accused the FSF of being something of a hypocrit, which shows that he neither understands hypocrisy nor can spell.' --- jimmybgood - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: FailSpare event?
On Fri, 12 Jan 2007, Neil Brown might have said: On Thursday January 11, [EMAIL PROTECTED] wrote: Can someone tell me what this means please? I just received this in an email from one of my servers: A FailSpare event had been detected on md device /dev/md2. It could be related to component device /dev/sde2. It means that mdadm has just noticed that /dev/sde2 is a spare and is faulty. You would normally expect this if the array is rebuilding a spare and a write to the spare fails however... md2 : active raid5 sdf2[4] sde2[5](F) sdd2[3] sdc2[2] sdb2[1] sda2[0] 560732160 blocks level 5, 256k chunk, algorithm 2 [5/5] [U] That isn't the case here - your array doesn't need rebuilding. Possible a superblock-update failed. Possibly mdadm only just started monitoring the array and the spare has been faulty for some time. Does the email message mean drive sde2[5] has failed? I know the sde2 refers to the second partition of /dev/sde. Here is the partition table It means that md thinks sde2 cannot be trusted. To find out why you would need to look at kernel logs for IO errors. I have partition 2 of drive sde as one of the raid devices for md. Does the (S) on sde3[2](S) mean the device is a spare for md1 and the same for md0? Yes, (S) means the device is spare. You don't have (S) next to sde2 on md2 because (F) (failed) overrides (S). You can tell by the position [5], that it isn't part of the array (being a 5 disk array, the active positions are 0,1,2,3,4). NeilBrown I have cleared the error by: # mdadm --manage /dev/md2 -f /dev/sde2 ( make sure it has failed ) # mdadm --manage /dev/md2 -r /dev/sde2 ( remove from the array ) # mdadm --manage /dev/md2 -a /dev/sde2 ( add the device back to the array ) # mdadm --detail /dev/md2 ( verify there are no faults and the array knows about the spare ) - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: FailSpare event?
On 13 Jan 2007, [EMAIL PROTECTED] spake thusly: On 12 Jan 2007, Ernst Herzberg told this: Then every about 60 sec 4 times event=SpareActive mddev=/dev/md3 I see exactly this on both my RAID-5 arrays, neither of which have any spare device --- nor have any active devices transitioned to spare (which is what that event is actually supposed to mean). Hm, the manual says that it means that a spare has transitioned to active (which seems more likely). Perhaps the comment at line 82 of Monitor.c is wrong, or I just don't understand what a `reverse transition' is supposed to be. -- `He accused the FSF of being something of a hypocrit, which shows that he neither understands hypocrisy nor can spell.' --- jimmybgood - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: FailSpare event?
On 13 Jan 2007, [EMAIL PROTECTED] uttered the following: On 12 Jan 2007, Ernst Herzberg told this: Then every about 60 sec 4 times event=SpareActive mddev=/dev/md3 I see exactly this on both my RAID-5 arrays, neither of which have any spare device --- nor have any active devices transitioned to spare (which is what that event is actually supposed to mean). One oddity has already come to light. My /proc/mdstat says md2 : active raid5 sdb7[0] hda5[3] sda7[1] 19631104 blocks super 1.2 level 5, 64k chunk, algorithm 2 [3/3] [UUU] md1 : active raid5 sda6[0] hdc5[3] sdb6[1] 76807296 blocks super 1.2 level 5, 64k chunk, algorithm 2 [3/3] [UUU] hda5 and hdc5 look odd. Indeed, --examine says Number Major Minor RaidDevice State 0 860 active sync /dev/sda6 1 8 221 active sync /dev/sdb6 3 2252 active sync /dev/hdc5 Number Major Minor RaidDevice State 0 8 230 active sync /dev/sdb7 1 871 active sync /dev/sda7 3 352 active sync /dev/hda5 0, 1, and *3*. Where has number 2 gone? (And how does `Number' differ from `RaidDevice'? Why have both?) -- `He accused the FSF of being something of a hypocrit, which shows that he neither understands hypocrisy nor can spell.' --- jimmybgood - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: FailSpare event?
On Thursday 11 January 2007 23:23, Neil Brown wrote: On Thursday January 11, [EMAIL PROTECTED] wrote: Can someone tell me what this means please? I just received this in an email from one of my servers: Same problem here, on different machines. But only with mdadm 2.6, with mdadm 2.5.5 no problems. First machine sends direct after starting mdadm in monitor mode: (kernel 2.6.20-rc3) - event=DeviceDisappeared mddev=/dev/md1 device=Wrong-Level Personalities : [linear] [raid0] [raid1] [raid6] [raid5] [raid4] md1 : active raid0 sdb2[1] sda2[0] 3904704 blocks 16k chunks md2 : active raid0 sdb3[1] sda3[0] 153930112 blocks 16k chunks md3 : active raid5 sdf1[3] sde1[2] sdd1[1] sdc1[0] 732587712 blocks level 5, 16k chunk, algorithm 2 [4/4] [] md0 : active raid1 sdb1[1] sda1[0] 192640 blocks [2/2] [UU] unused devices: none --- and a second time for md2. Then every about 60 sec 4 times event=SpareActive mddev=/dev/md3 ** Second machine sends about every 60sec 8 messages with: (kernel 2.6.19.2) -- event=SpareActive mddev=/dev/md0 device= Personalities : [linear] [raid0] [raid1] [raid10] [raid6] [raid5] [raid4] md1 : active raid1 sdb1[1] sda1[0] 979840 blocks [2/2] [UU] md3 : active raid5 sdh1[5] sdg1[4] sdf1[3] sde1[2] sdd1[1] sdc1[0] 4899200 blocks level 5, 8k chunk, algorithm 2 [6/6] [UU] md2 : active raid5 sdh2[7] sdg2[6] sdf2[5] sde2[4] sdd2[3] sdc2[2] sdb2[1] sda2[0] 6858880 blocks level 5, 4k chunk, algorithm 2 [8/8] [] md0 : active raid5 sdh3[7] sdg3[6] sdf3[5] sde3[4] sdd3[3] sdc3[2] sdb3[1] sda3[0] 235086656 blocks level 5, 16k chunk, algorithm 2 [8/8] [] unused devices: none -- Both machines had nerver seen any spare device, and there are no failing devices, everything works as expected. earny - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
FailSpare event?
Can someone tell me what this means please? I just received this in an email from one of my servers: From: mdadm monitoring [EMAIL PROTECTED] To: [EMAIL PROTECTED] Subject: FailSpare event on /dev/md2:$HOST.$DOMAIN.com This is an automatically generated mail message from mdadm running on $HOST.$DOMAIN.com A FailSpare event had been detected on md device /dev/md2. It could be related to component device /dev/sde2. Faithfully yours, etc. On this machine I execute: $ cat /proc/mdstat Personalities : [raid5] [raid4] [raid1] md0 : active raid1 sdf1[2](S) sde1[3](S) sdd1[4](S) sdc1[5](S) sdb1[1] sda1[0] 104320 blocks [2/2] [UU] md1 : active raid1 sdf3[2](S) sde3[3](S) sdd3[4](S) sdc3[5](S) sdb3[1] sda3[0] 3068288 blocks [2/2] [UU] md2 : active raid5 sdf2[4] sde2[5](F) sdd2[3] sdc2[2] sdb2[1] sda2[0] 560732160 blocks level 5, 256k chunk, algorithm 2 [5/5] [U] unused devices: none Does the email message mean drive sde2[5] has failed? I know the sde2 refers to the second partition of /dev/sde. Here is the partition table # fdisk -l /dev/sde [EMAIL PROTECTED] ~]# fdisk -l /dev/sde Disk /dev/sde: 146.8 GB, 146815733760 bytes 255 heads, 63 sectors/track, 17849 cylinders Units = cylinders of 16065 * 512 = 8225280 bytes Device Boot Start End Blocks Id System /dev/sde1 * 1 13 104391 fd Linux raid autodetect /dev/sde2 14 17465 140183190 fd Linux raid autodetect /dev/sde3 17466 17847 3068415 fd Linux raid autodetect I have partition 2 of drive sde as one of the raid devices for md. Does the (S) on sde3[2](S) mean the device is a spare for md1 and the same for md0? Mike - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: FailSpare event?
On Thursday January 11, [EMAIL PROTECTED] wrote: Can someone tell me what this means please? I just received this in an email from one of my servers: A FailSpare event had been detected on md device /dev/md2. It could be related to component device /dev/sde2. It means that mdadm has just noticed that /dev/sde2 is a spare and is faulty. You would normally expect this if the array is rebuilding a spare and a write to the spare fails however... md2 : active raid5 sdf2[4] sde2[5](F) sdd2[3] sdc2[2] sdb2[1] sda2[0] 560732160 blocks level 5, 256k chunk, algorithm 2 [5/5] [U] That isn't the case here - your array doesn't need rebuilding. Possible a superblock-update failed. Possibly mdadm only just started monitoring the array and the spare has been faulty for some time. Does the email message mean drive sde2[5] has failed? I know the sde2 refers to the second partition of /dev/sde. Here is the partition table It means that md thinks sde2 cannot be trusted. To find out why you would need to look at kernel logs for IO errors. I have partition 2 of drive sde as one of the raid devices for md. Does the (S) on sde3[2](S) mean the device is a spare for md1 and the same for md0? Yes, (S) means the device is spare. You don't have (S) next to sde2 on md2 because (F) (failed) overrides (S). You can tell by the position [5], that it isn't part of the array (being a 5 disk array, the active positions are 0,1,2,3,4). NeilBrown - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: FailSpare event?
On Fri, 12 Jan 2007, Neil Brown might have said: On Thursday January 11, [EMAIL PROTECTED] wrote: Can someone tell me what this means please? I just received this in an email from one of my servers: A FailSpare event had been detected on md device /dev/md2. It could be related to component device /dev/sde2. It means that mdadm has just noticed that /dev/sde2 is a spare and is faulty. You would normally expect this if the array is rebuilding a spare and a write to the spare fails however... md2 : active raid5 sdf2[4] sde2[5](F) sdd2[3] sdc2[2] sdb2[1] sda2[0] 560732160 blocks level 5, 256k chunk, algorithm 2 [5/5] [U] That isn't the case here - your array doesn't need rebuilding. Possible a superblock-update failed. Possibly mdadm only just started monitoring the array and the spare has been faulty for some time. Does the email message mean drive sde2[5] has failed? I know the sde2 refers to the second partition of /dev/sde. Here is the partition table It means that md thinks sde2 cannot be trusted. To find out why you would need to look at kernel logs for IO errors. I have partition 2 of drive sde as one of the raid devices for md. Does the (S) on sde3[2](S) mean the device is a spare for md1 and the same for md0? Yes, (S) means the device is spare. You don't have (S) next to sde2 on md2 because (F) (failed) overrides (S). You can tell by the position [5], that it isn't part of the array (being a 5 disk array, the active positions are 0,1,2,3,4). NeilBrown Thanks for the quick response. So I'm ok for the moment? Yes, I need to find the error and fix everything back to the (S) state. The messages in $HOST:/var/log/messages for the time of the email are: Jan 11 16:04:25 elo kernel: sd 2:0:4:0: SCSI error: return code = 0x802 Jan 11 16:04:25 elo kernel: sde: Current: sense key: Hardware Error Jan 11 16:04:25 elo kernel: Additional sense: Internal target failure Jan 11 16:04:25 elo kernel: Info fld=0x10b93c4d Jan 11 16:04:25 elo kernel: end_request: I/O error, dev sde, sector 280575053 Jan 11 16:04:25 elo kernel: raid5: Disk failure on sde2, disabling device. Operation continuing on 5 devices This is a dell box running Fedora Core with recent patches. It is a production box so I do not patch each night. On AIX boxes I can blink the drives to identify a bad/failing device. Is there a way to blink the drives in linux? Mike - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: FailSpare event?
On Thursday January 11, [EMAIL PROTECTED] wrote: So I'm ok for the moment? Yes, I need to find the error and fix everything back to the (S) state. Yes, OK for the moment. The messages in $HOST:/var/log/messages for the time of the email are: Jan 11 16:04:25 elo kernel: sd 2:0:4:0: SCSI error: return code = 0x802 Jan 11 16:04:25 elo kernel: sde: Current: sense key: Hardware Error Jan 11 16:04:25 elo kernel: Additional sense: Internal target failure Jan 11 16:04:25 elo kernel: Info fld=0x10b93c4d Jan 11 16:04:25 elo kernel: end_request: I/O error, dev sde, sector 280575053 Jan 11 16:04:25 elo kernel: raid5: Disk failure on sde2, disabling device. Operation continuing on 5 devices Given the sector number it looks likely that it was a superblock update. No idea how bad an 'internal target failure' is. Maybe powercycling the drive would 'fix' it, maybe not. On AIX boxes I can blink the drives to identify a bad/failing device. Is there a way to blink the drives in linux? Unfortunately not. NeilBrown - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: FailSpare event?
google BadBlockHowto Any just google it response sounds glib, but this is actually how to do it :-) If you're new to md and mdadm, don't forget to actually remove the drive from the array before you start working on it with 'dd' -Mike Mike wrote: On Fri, 12 Jan 2007, Neil Brown might have said: On Thursday January 11, [EMAIL PROTECTED] wrote: So I'm ok for the moment? Yes, I need to find the error and fix everything back to the (S) state. Yes, OK for the moment. The messages in $HOST:/var/log/messages for the time of the email are: Jan 11 16:04:25 elo kernel: sd 2:0:4:0: SCSI error: return code = 0x802 Jan 11 16:04:25 elo kernel: sde: Current: sense key: Hardware Error Jan 11 16:04:25 elo kernel: Additional sense: Internal target failure Jan 11 16:04:25 elo kernel: Info fld=0x10b93c4d Jan 11 16:04:25 elo kernel: end_request: I/O error, dev sde, sector 280575053 Jan 11 16:04:25 elo kernel: raid5: Disk failure on sde2, disabling device. Operation continuing on 5 devices Given the sector number it looks likely that it was a superblock update. No idea how bad an 'internal target failure' is. Maybe powercycling the drive would 'fix' it, maybe not. On AIX boxes I can blink the drives to identify a bad/failing device. Is there a way to blink the drives in linux? Unfortunately not. NeilBrown I found the smartctl command. I have a 'long' test running in the background. I checked this drive and the other drives. This drive has been used the least (confirms it is a spare?) and is the only one with 'Total uncorrected errors' 0. How to determine the error, correct the error, or clear the error? Mike [EMAIL PROTECTED] ~]# smartctl -a /dev/sde smartctl version 5.36 [i686-redhat-linux-gnu] Copyright (C) 2002-6 Bruce Allen Home page is http://smartmontools.sourceforge.net/ Device: SEAGATE ST3146707LC Version: D703 Serial number: 3KS30WY8 Device type: disk Transport protocol: Parallel SCSI (SPI-4) Local Time is: Thu Jan 11 17:00:26 2007 CST Device supports SMART and is Enabled Temperature Warning Enabled SMART Health Status: OK Current Drive Temperature: 48 C Drive Trip Temperature:68 C Elements in grown defect list: 0 Vendor (Seagate) cache information Blocks sent to initiator = 66108 Blocks received from initiator = 147374656 Blocks read from cache and sent to initiator = 42215 Number of read and write commands whose size = segment size = 12635583 Number of read and write commands whose size segment size = 0 Vendor (Seagate/Hitachi) factory information number of hours powered up = 3943.42 number of minutes until next internal SMART test = 94 Error counter log: Errors Corrected by Total Correction Gigabytes Total ECC rereads/errors algorithm processed uncorrected fast | delayed rewrites corrected invocations [10^9 bytes] errors read:3540 0 354354 0.546 0 write: 00 0 0 0185.871 1 Non-medium error count:0 SMART Self-test log Num Test Status segment LifeTime LBA_first_err [SK ASC ASQ] Description number (hours) # 1 Background long Completed, segment failed -3943 - [- --] Long (extended) Self Test duration: 2726 seconds [45.4 minutes] - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: FailSpare event?
2007/1/12, Mike [EMAIL PROTECTED]: # 1 Background long Completed, segment failed -3943 This should still be in warranty. Try to get a replacement. Best Martin - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html