Re: RAID5 array showing as degraded after motherboard replacement
James Lee wrote: Hi there, I'm running a 5-drive software RAID5 array across two controllers. The motherboard in that PC recently died - I sent the board back for RMA. When I refitted the motherboard, connected up all the drives, and booted up I found that the array was being reported as degraded (though all the data on it is intact). I have 4 drives on the on board controller and 1 drive on an XFX Revo 64 SATA controller card. The drive which is being reported as not being in the array is the one connected to the XFX controller. The OS can see that drive fine, and "mdadm --examine" on that drive shows that it is part of the array and that there are 5 active devices in the array. Doing "mdadm --examine" on one of the other four drives shows that the array has 4 active drives and one failed. "mdadm --detail" for the array also shows 4 active and one failed. Now I haven't lost any data here and I know I can just force a resync of the array which is fine. However I'm concerned about how this has happened. One worry is that the XFX SATA controller is doing something funny to the drive. I've noticed that it's BIOS has defaulted to RAID0 mode (even though there's only one drive on it) - I can't see how this would cause any particular problems here though. I guess it's possible that some data on the drive got corrupted when the motherboard failed... I notice in your later post that the driver thinks this is a JBOD setup, can you either tell the controller to JBOD or force the driver to consider this a RAID0 single disk setup? I don't know what RAID0 on one drive means, but I suspect that having the controller in the mode you want is desirable. That might have been changed in the hardware failure. -- bill davidsen <[EMAIL PROTECTED]> CTO TMR Associates, Inc Doing interesting things with small computers since 1979 - To unsubscribe from this list: send the line "unsubscribe linux-raid" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: RAID5 array showing as degraded after motherboard replacement
On Wed, 8 Nov 2006, James Lee wrote: > > However I'm still seeing the error messages in my dmesg (the ones I > > posted earlier), and they suggest that there is some kind of hardware > > fault (based on a quick Google of the error codes). So I'm a little > > confused. the fact that the error is in a geometry command really makes me wonder... did you compare the number of blocks on the device vs. what seems to be available when it's on the weird raid card? -dean - To unsubscribe from this list: send the line "unsubscribe linux-raid" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: RAID5 array showing as degraded after motherboard replacement
On 08/11/06, James Lee <[EMAIL PROTECTED]> wrote: On 07/11/06, James Lee <[EMAIL PROTECTED]> wrote: > On 06/11/06, dean gaudet <[EMAIL PROTECTED]> wrote: > > > > > > On Mon, 6 Nov 2006, James Lee wrote: > > > > > Thanks for the reply Dean. I looked through dmesg output from the > > > boot up, to check whether this was just an ordering issue during the > > > system start up (since both evms and mdadm attempt to activate the > > > array, which could cause things to go wrong...). > > > > > > Looking through the dmesg output though, it looks like the 'missing' > > > disk is being detected before the array is assembled, but that the > > > disk is throwing up errors. I've attached the full output of dmesg; > > > grepping it for "hde" gives the following: > > > > > > [17179574.084000] ide2: BM-DMA at 0xd400-0xd407, BIOS settings: > > > hde:DMA, hdf:DMA > > > [17179574.38] hde: NetCell SyncRAID(TM) SR5000 JBOD, ATA DISK drive > > > [17179575.312000] hde: max request size: 512KiB > > > [17179575.312000] hde: 625134827 sectors (320069 MB), CHS=38912/255/63, (U)DMA > > > [17179575.312000] hde: set_geometry_intr: status=0x51 { DriveReady > > > SeekComplete Error } > > > [17179575.312000] hde: set_geometry_intr: error=0x04 { DriveStatusError } > > > [17179575.312000] hde: cache flushes supported > > > > is it possible that the "NetCell SyncRAID" implementation is stealing some > > of the sectors (even though it's marked JBOD)? anyhow it could be the > > disk is bad, but i'd still be tempted to see if the problem stays with the > > controller if you swap the disk with another in the array. > > > > -dean > > > > Looks like you might be right. I removed one of the other drives from > the onboard controller, and moved the 'faulty' drive from the NetCell > controller to the onboard one. Booted up up the machine, and the > drive is still not added to the array correctly (so the array now > fails to assemble, as there's only 3 out of 5 drives). I've run the > Seagate diagnostics tools over the drive and they report successful > when it's connected to the onboard controller and unsuccessful when > it's connected to the NetCell controller (this may be a test tool > issue though). > > I guess this indicates that either: > 1) The NetCell controller is faulty and just not reading/writing data properly. > 2) The NetCell controller's RAID implementation has somehow not been > transparent to the OS and has overwritten/modified md's superblocks. > 3) EVMS somehow messed the config up on that drive when trying to > reassemble the array after the first time the controller came up. > > I'll test for 1) by attaching another drive (not one of the ones in > the array!) to the NetCell contoller and seeing if it passed > diagnostics tests. 3) seems pretty unlikely. > > I bought the NetCell card mainly for its Linux compatibility - do they > have known issues with mdadm? > > Thanks, > James > Well I'm still a little unsure what might have happened here. I've reconnected the 'bad' drive to the NetCell controller, and run badblocks over that device. It isn't reporting any bad blocks at all, which I guess pretty much indicates that neither the hard drive nor the controller are faulty right? However I'm still seeing the error messages in my dmesg (the ones I posted earlier), and they suggest that there is some kind of hardware fault (based on a quick Google of the error codes). So I'm a little confused. If the hard-drive and controller are not faulty, then how can I go about figuring out whether the drive got messed up by the controller going and overwriting some data due to it's internal RAIDing (which would seem unlikely - I'd assume this would have been reported and fixed as it would not just be a Linux problem)? I guess the other possibility is that in the process of the motherboard dying, some data on the drive corrupted - does this seem at all plausible? Basically I'm just not sure how to move forward in a way that I can feel confident that this won't happen again (possibly in a more serious way that means losing all the data on the array). Would dumping the sectors at the start of the drive help at all to figure out what's going on? [Sorry for the double mail - forgot to CC the list] - To unsubscribe from this list: send the line "unsubscribe linux-raid" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: RAID5 array showing as degraded after motherboard replacement
On 06/11/06, dean gaudet <[EMAIL PROTECTED]> wrote: On Mon, 6 Nov 2006, James Lee wrote: > Thanks for the reply Dean. I looked through dmesg output from the > boot up, to check whether this was just an ordering issue during the > system start up (since both evms and mdadm attempt to activate the > array, which could cause things to go wrong...). > > Looking through the dmesg output though, it looks like the 'missing' > disk is being detected before the array is assembled, but that the > disk is throwing up errors. I've attached the full output of dmesg; > grepping it for "hde" gives the following: > > [17179574.084000] ide2: BM-DMA at 0xd400-0xd407, BIOS settings: > hde:DMA, hdf:DMA > [17179574.38] hde: NetCell SyncRAID(TM) SR5000 JBOD, ATA DISK drive > [17179575.312000] hde: max request size: 512KiB > [17179575.312000] hde: 625134827 sectors (320069 MB), CHS=38912/255/63, (U)DMA > [17179575.312000] hde: set_geometry_intr: status=0x51 { DriveReady > SeekComplete Error } > [17179575.312000] hde: set_geometry_intr: error=0x04 { DriveStatusError } > [17179575.312000] hde: cache flushes supported is it possible that the "NetCell SyncRAID" implementation is stealing some of the sectors (even though it's marked JBOD)? anyhow it could be the disk is bad, but i'd still be tempted to see if the problem stays with the controller if you swap the disk with another in the array. -dean Looks like you might be right. I removed one of the other drives from the onboard controller, and moved the 'faulty' drive from the NetCell controller to the onboard one. Booted up up the machine, and the drive is still not added to the array correctly (so the array now fails to assemble, as there's only 3 out of 5 drives). I've run the Seagate diagnostics tools over the drive and they report successful when it's connected to the onboard controller and unsuccessful when it's connected to the NetCell controller (this may be a test tool issue though). I guess this indicates that either: 1) The NetCell controller is faulty and just not reading/writing data properly. 2) The NetCell controller's RAID implementation has somehow not been transparent to the OS and has overwritten/modified md's superblocks. 3) EVMS somehow messed the config up on that drive when trying to reassemble the array after the first time the controller came up. I'll test for 1) by attaching another drive (not one of the ones in the array!) to the NetCell contoller and seeing if it passed diagnostics tests. 3) seems pretty unlikely. I bought the NetCell card mainly for its Linux compatibility - do they have known issues with mdadm? Thanks, James - To unsubscribe from this list: send the line "unsubscribe linux-raid" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: RAID5 array showing as degraded after motherboard replacement
On Mon, 6 Nov 2006, James Lee wrote: > Thanks for the reply Dean. I looked through dmesg output from the > boot up, to check whether this was just an ordering issue during the > system start up (since both evms and mdadm attempt to activate the > array, which could cause things to go wrong...). > > Looking through the dmesg output though, it looks like the 'missing' > disk is being detected before the array is assembled, but that the > disk is throwing up errors. I've attached the full output of dmesg; > grepping it for "hde" gives the following: > > [17179574.084000] ide2: BM-DMA at 0xd400-0xd407, BIOS settings: > hde:DMA, hdf:DMA > [17179574.38] hde: NetCell SyncRAID(TM) SR5000 JBOD, ATA DISK drive > [17179575.312000] hde: max request size: 512KiB > [17179575.312000] hde: 625134827 sectors (320069 MB), CHS=38912/255/63, (U)DMA > [17179575.312000] hde: set_geometry_intr: status=0x51 { DriveReady > SeekComplete Error } > [17179575.312000] hde: set_geometry_intr: error=0x04 { DriveStatusError } > [17179575.312000] hde: cache flushes supported is it possible that the "NetCell SyncRAID" implementation is stealing some of the sectors (even though it's marked JBOD)? anyhow it could be the disk is bad, but i'd still be tempted to see if the problem stays with the controller if you swap the disk with another in the array. -dean - To unsubscribe from this list: send the line "unsubscribe linux-raid" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: RAID5 array showing as degraded after motherboard replacement
Thanks for the reply Dean. I looked through dmesg output from the boot up, to check whether this was just an ordering issue during the system start up (since both evms and mdadm attempt to activate the array, which could cause things to go wrong...). Looking through the dmesg output though, it looks like the 'missing' disk is being detected before the array is assembled, but that the disk is throwing up errors. I've attached the full output of dmesg; grepping it for "hde" gives the following: [17179574.084000] ide2: BM-DMA at 0xd400-0xd407, BIOS settings: hde:DMA, hdf:DMA [17179574.38] hde: NetCell SyncRAID(TM) SR5000 JBOD, ATA DISK drive [17179575.312000] hde: max request size: 512KiB [17179575.312000] hde: 625134827 sectors (320069 MB), CHS=38912/255/63, (U)DMA [17179575.312000] hde: set_geometry_intr: status=0x51 { DriveReady SeekComplete Error } [17179575.312000] hde: set_geometry_intr: error=0x04 { DriveStatusError } [17179575.312000] hde: cache flushes supported [17179575.312000] hde: hde1 [17179967.224000] md: bind [17179967.224000] md: kicking non-fresh hde1 from array! [17179967.224000] md: unbind [17179967.224000] md: export_rdev(hde1) Am I right in thinking this looks that drive is just bad (the two set_geometry_intr errors, and the fact it gets kicked off the array by mdadm)? I'll run it through the Seagate diagnostics suite tomorrow to see whether it's faulty or not... James On 05/11/06, dean gaudet <[EMAIL PROTECTED]> wrote: On Sun, 5 Nov 2006, James Lee wrote: > Hi there, > > I'm running a 5-drive software RAID5 array across two controllers. > The motherboard in that PC recently died - I sent the board back for > RMA. When I refitted the motherboard, connected up all the drives, > and booted up I found that the array was being reported as degraded > (though all the data on it is intact). I have 4 drives on the on > board controller and 1 drive on an XFX Revo 64 SATA controller card. > The drive which is being reported as not being in the array is the one > connected to the XFX controller. > > The OS can see that drive fine, and "mdadm --examine" on that drive > shows that it is part of the array and that there are 5 active devices > in the array. Doing "mdadm --examine" on one of the other four drives > shows that the array has 4 active drives and one failed. "mdadm > --detail" for the array also shows 4 active and one failed. that means the array was assembled without the 5th disk and is currently degraded. > Now I haven't lost any data here and I know I can just force a resync > of the array which is fine. However I'm concerned about how this has > happened. One worry is that the XFX SATA controller is doing > something funny to the drive. I've noticed that it's BIOS has > defaulted to RAID0 mode (even though there's only one drive on it) - I > can't see how this would cause any particular problems here though. I > guess it's possible that some data on the drive got corrupted when the > motherboard failed... no it's more likely the devices were renamed or the 5th device didn't come up before the array was assembled. it's possible that a different bios setting lead to the device using a different driver than is in your initrd... but i'm just guessing. > Any ideas what could cause mdadm to report as I've described above > (I've attached the output of these three commands)? I'm running > Ubuntu Edgy, which is a 2.17.x kernel, and mdadm 2.4.1. In case it's > relevant here, I created the array using EVMS... i've never created an array with evms... but my guess is that it may have used "mapped" device names instead of the normal device names. take a look at /proc/mdstat and see what devices are in the array and use those as a template to find the name of the missing device. below i'll use /dev/sde1 as the example missing device and /dev/md0 as the example array. first thing i'd try is something like this: mdadm /dev/md0 -a /dev/sde1 which hotadds the device into the array... which will start a resync. when the resync is done (cat /proc/mdstat) do this. mdadm -Gb internal /dev/md0 which will add write-intent bitmaps to your device... which will avoid another long wait for a resync after the next reboot if the fix below doesn't help. then do this: dpkg-reconfigure linux-image-`uname -r` which will rebuild the initrd for your kernel ... and if it was a driver change this should include the new driver into the initrd. then reboot and see if it comes up fine. if it doesn't, you can repeat the "-a /dev/sde1" command above... the resync will be quick this time due to the bitmap... and we'll have to investigate further. -dean - To unsubscribe from this list: send the line "unsubscribe linux-raid" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html [17179569.184000] Linux version 2.6.17-10-generic ([EMAIL PROTECTED]) (gcc version 4.1.2 20060928 (prerelease) (Ubuntu 4.1.
Re: RAID5 array showing as degraded after motherboard replacement
On Sun, 5 Nov 2006, James Lee wrote: > Hi there, > > I'm running a 5-drive software RAID5 array across two controllers. > The motherboard in that PC recently died - I sent the board back for > RMA. When I refitted the motherboard, connected up all the drives, > and booted up I found that the array was being reported as degraded > (though all the data on it is intact). I have 4 drives on the on > board controller and 1 drive on an XFX Revo 64 SATA controller card. > The drive which is being reported as not being in the array is the one > connected to the XFX controller. > > The OS can see that drive fine, and "mdadm --examine" on that drive > shows that it is part of the array and that there are 5 active devices > in the array. Doing "mdadm --examine" on one of the other four drives > shows that the array has 4 active drives and one failed. "mdadm > --detail" for the array also shows 4 active and one failed. that means the array was assembled without the 5th disk and is currently degraded. > Now I haven't lost any data here and I know I can just force a resync > of the array which is fine. However I'm concerned about how this has > happened. One worry is that the XFX SATA controller is doing > something funny to the drive. I've noticed that it's BIOS has > defaulted to RAID0 mode (even though there's only one drive on it) - I > can't see how this would cause any particular problems here though. I > guess it's possible that some data on the drive got corrupted when the > motherboard failed... no it's more likely the devices were renamed or the 5th device didn't come up before the array was assembled. it's possible that a different bios setting lead to the device using a different driver than is in your initrd... but i'm just guessing. > Any ideas what could cause mdadm to report as I've described above > (I've attached the output of these three commands)? I'm running > Ubuntu Edgy, which is a 2.17.x kernel, and mdadm 2.4.1. In case it's > relevant here, I created the array using EVMS... i've never created an array with evms... but my guess is that it may have used "mapped" device names instead of the normal device names. take a look at /proc/mdstat and see what devices are in the array and use those as a template to find the name of the missing device. below i'll use /dev/sde1 as the example missing device and /dev/md0 as the example array. first thing i'd try is something like this: mdadm /dev/md0 -a /dev/sde1 which hotadds the device into the array... which will start a resync. when the resync is done (cat /proc/mdstat) do this. mdadm -Gb internal /dev/md0 which will add write-intent bitmaps to your device... which will avoid another long wait for a resync after the next reboot if the fix below doesn't help. then do this: dpkg-reconfigure linux-image-`uname -r` which will rebuild the initrd for your kernel ... and if it was a driver change this should include the new driver into the initrd. then reboot and see if it comes up fine. if it doesn't, you can repeat the "-a /dev/sde1" command above... the resync will be quick this time due to the bitmap... and we'll have to investigate further. -dean - To unsubscribe from this list: send the line "unsubscribe linux-raid" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
RAID5 array showing as degraded after motherboard replacement
Hi there, I'm running a 5-drive software RAID5 array across two controllers. The motherboard in that PC recently died - I sent the board back for RMA. When I refitted the motherboard, connected up all the drives, and booted up I found that the array was being reported as degraded (though all the data on it is intact). I have 4 drives on the on board controller and 1 drive on an XFX Revo 64 SATA controller card. The drive which is being reported as not being in the array is the one connected to the XFX controller. The OS can see that drive fine, and "mdadm --examine" on that drive shows that it is part of the array and that there are 5 active devices in the array. Doing "mdadm --examine" on one of the other four drives shows that the array has 4 active drives and one failed. "mdadm --detail" for the array also shows 4 active and one failed. Now I haven't lost any data here and I know I can just force a resync of the array which is fine. However I'm concerned about how this has happened. One worry is that the XFX SATA controller is doing something funny to the drive. I've noticed that it's BIOS has defaulted to RAID0 mode (even though there's only one drive on it) - I can't see how this would cause any particular problems here though. I guess it's possible that some data on the drive got corrupted when the motherboard failed... Any ideas what could cause mdadm to report as I've described above (I've attached the output of these three commands)? I'm running Ubuntu Edgy, which is a 2.17.x kernel, and mdadm 2.4.1. In case it's relevant here, I created the array using EVMS... Thanks, James [EMAIL PROTECTED]:~$ sudo mdadm --examine /dev/hde1 Password: /dev/hde1: Magic : a92b4efc Version : 00.90.00 UUID : 33d5338b:d2d6baf0:424498ad:47d05087 Creation Time : Sun Jan 15 16:47:51 2006 Raid Level : raid5 Device Size : 312496128 (298.02 GiB 320.00 GB) Array Size : 1249984512 (1192.08 GiB 1279.98 GB) Raid Devices : 5 Total Devices : 5 Preferred Minor : 0 Update Time : Sat Nov 4 16:29:06 2006 State : clean Active Devices : 5 Working Devices : 5 Failed Devices : 0 Spare Devices : 0 Checksum : d628e17e - correct Events : 0.4232131 Layout : left-asymmetric Chunk Size : 256K Number Major Minor RaidDevice State this 4 25464 active sync 0 0 25420 active sync 1 1 25431 active sync 2 2 25442 active sync 3 3 25453 active sync 4 4 25464 active sync [EMAIL PROTECTED]:~$ [EMAIL PROTECTED]:~$ [EMAIL PROTECTED]:~$ [EMAIL PROTECTED]:~$ sudo mdadm --examine /dev/sda1 /dev/sda1: Magic : a92b4efc Version : 00.90.00 UUID : 33d5338b:d2d6baf0:424498ad:47d05087 Creation Time : Sun Jan 15 16:47:51 2006 Raid Level : raid5 Device Size : 312496128 (298.02 GiB 320.00 GB) Array Size : 1249984512 (1192.08 GiB 1279.98 GB) Raid Devices : 5 Total Devices : 4 Preferred Minor : 0 Update Time : Sun Nov 5 11:56:29 2006 State : clean Active Devices : 4 Working Devices : 4 Failed Devices : 1 Spare Devices : 0 Checksum : d629ee25 - correct Events : 0.4232204 Layout : left-asymmetric Chunk Size : 256K Number Major Minor RaidDevice State this 1 811 active sync /dev/sda1 0 0 8 170 active sync /dev/sdb1 1 1 811 active sync /dev/sda1 2 2 8 332 active sync /dev/sdc1 3 3 8 493 active sync /dev/sdd1 4 4 004 faulty removed [EMAIL PROTECTED]:~$ [EMAIL PROTECTED]:~$ [EMAIL PROTECTED]:~$ [EMAIL PROTECTED]:~$ sudo mdadm --detail /dev/md0 /dev/md0: Version : 00.90.03 Creation Time : Sun Jan 15 16:47:51 2006 Raid Level : raid5 Array Size : 1249984512 (1192.08 GiB 1279.98 GB) Device Size : 312496128 (298.02 GiB 320.00 GB) Raid Devices : 5 Total Devices : 4 Preferred Minor : 0 Persistence : Superblock is persistent Update Time : Sun Nov 5 11:56:29 2006 State : clean, degraded Active Devices : 4 Working Devices : 4 Failed Devices : 0 Spare Devices : 0 Layout : left-asymmetric Chunk Size : 256K UUID : 33d5338b:d2d6baf0:424498ad:47d05087 Events : 0.4232204 Number Major Minor RaidDevice State 0 8 170 active sync /dev/sdb1 1 811 active sync /dev/sda1 2 8 332 active sync /dev/sdc1 3 8 493 active sync /dev/sdd1 4 004 removed [EMAIL PROTECTED]:~$