Re: RAID5 array showing as degraded after motherboard replacement

2006-11-08 Thread Bill Davidsen

James Lee wrote:


Hi there,

I'm running a 5-drive software RAID5 array across two controllers.
The motherboard in that PC recently died - I sent the board back for
RMA.  When I refitted the motherboard, connected up all the drives,
and booted up I found that the array was being reported as degraded
(though all the data on it is intact).  I have 4 drives on the on
board controller and 1 drive on an XFX Revo 64 SATA controller card.
The drive which is being reported as not being in the array is the one
connected to the XFX controller.

The OS can see that drive fine, and "mdadm --examine" on that drive
shows that it is part of the array and that there are 5 active devices
in the array.  Doing "mdadm --examine" on one of the other four drives
shows that the array has 4 active drives and one failed.  "mdadm
--detail" for the array also shows 4 active and one failed.

Now I haven't lost any data here and I know I can just force a resync
of the array which is fine.  However I'm concerned about how this has
happened.  One worry is that the XFX SATA controller is doing
something funny to the drive.  I've noticed that it's BIOS has
defaulted to RAID0 mode (even though there's only one drive on it) - I
can't see how this would cause any particular problems here though.  I
guess it's possible that some data on the drive got corrupted when the
motherboard failed... 


I notice in your later post that the driver thinks this is a JBOD setup, 
can you either tell the controller to JBOD or force the driver to 
consider this a RAID0 single disk setup? I don't know what RAID0 on one 
drive means, but I suspect that having the controller in the mode you 
want is desirable. That might have been changed in the hardware failure.


--
bill davidsen <[EMAIL PROTECTED]>
 CTO TMR Associates, Inc
 Doing interesting things with small computers since 1979


-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: RAID5 array showing as degraded after motherboard replacement

2006-11-07 Thread dean gaudet
On Wed, 8 Nov 2006, James Lee wrote:

> > However I'm still seeing the error messages in my dmesg (the ones I
> > posted earlier), and they suggest that there is some kind of hardware
> > fault (based on a quick Google of the error codes).  So I'm a little
> > confused.

the fact that the error is in a geometry command really makes me wonder...

did you compare the number of blocks on the device vs. what seems to be 
available when it's on the weird raid card?

-dean
-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: RAID5 array showing as degraded after motherboard replacement

2006-11-07 Thread James Lee

On 08/11/06, James Lee <[EMAIL PROTECTED]> wrote:

On 07/11/06, James Lee <[EMAIL PROTECTED]> wrote:
> On 06/11/06, dean gaudet <[EMAIL PROTECTED]> wrote:
> >
> >
> > On Mon, 6 Nov 2006, James Lee wrote:
> >
> > > Thanks for the reply Dean.  I looked through dmesg output from the
> > > boot up, to check whether this was just an ordering issue during the
> > > system start up (since both evms and mdadm attempt to activate the
> > > array, which could cause things to go wrong...).
> > >
> > > Looking through the dmesg output though, it looks like the 'missing'
> > > disk is being detected before the array is assembled, but that the
> > > disk is throwing up errors.  I've attached the full output of dmesg;
> > > grepping it for "hde" gives the following:
> > >
> > > [17179574.084000] ide2: BM-DMA at 0xd400-0xd407, BIOS settings:
> > > hde:DMA, hdf:DMA
> > > [17179574.38] hde: NetCell SyncRAID(TM) SR5000 JBOD, ATA DISK drive
> > > [17179575.312000] hde: max request size: 512KiB
> > > [17179575.312000] hde: 625134827 sectors (320069 MB), CHS=38912/255/63, 
(U)DMA
> > > [17179575.312000] hde: set_geometry_intr: status=0x51 { DriveReady
> > > SeekComplete Error }
> > > [17179575.312000] hde: set_geometry_intr: error=0x04 { DriveStatusError }
> > > [17179575.312000] hde: cache flushes supported
> >
> > is it possible that the "NetCell SyncRAID" implementation is stealing some
> > of the sectors (even though it's marked JBOD)?  anyhow it could be the
> > disk is bad, but i'd still be tempted to see if the problem stays with the
> > controller if you swap the disk with another in the array.
> >
> > -dean
> >
>
> Looks like you might be right.  I removed one of the other drives from
> the onboard controller, and moved the 'faulty' drive from the NetCell
> controller to the onboard one.  Booted up up the machine, and the
> drive is still not added to the array correctly (so the array now
> fails to assemble, as there's only 3 out of 5 drives).  I've run the
> Seagate diagnostics tools over the drive and they report successful
> when it's connected to the onboard controller and unsuccessful when
> it's connected to the NetCell controller (this may be a test tool
> issue though).
>
> I guess this indicates that either:
> 1) The NetCell controller is faulty and just not reading/writing data 
properly.
> 2) The NetCell controller's RAID implementation has somehow not been
> transparent to the OS and has overwritten/modified md's superblocks.
> 3) EVMS somehow messed the config up on that drive when trying to
> reassemble the array after the first time the controller came up.
>
> I'll test for 1) by attaching another drive (not one of the ones in
> the array!) to the NetCell contoller and seeing if it passed
> diagnostics tests.  3) seems pretty unlikely.
>
> I bought the NetCell card mainly for its Linux compatibility - do they
> have known issues with mdadm?
>
> Thanks,
> James
>

Well I'm still a little unsure what might have happened here.  I've
reconnected the 'bad' drive to the NetCell controller, and run
badblocks over that device.  It isn't reporting any bad blocks at all,
which I guess pretty much indicates that neither the hard drive nor
the controller are faulty right?

However I'm still seeing the error messages in my dmesg (the ones I
posted earlier), and they suggest that there is some kind of hardware
fault (based on a quick Google of the error codes).  So I'm a little
confused.

If the hard-drive and controller are not faulty, then how can I go
about figuring out whether the drive got messed up by the controller
going and overwriting some data due to it's internal RAIDing (which
would seem unlikely - I'd assume this would have been reported and
fixed as it would not just be a Linux problem)?  I guess the other
possibility is that in the process of the motherboard dying, some data
on the drive corrupted - does this seem at all plausible?

Basically I'm just not sure how to move forward in a way that I can
feel confident that this won't happen again (possibly in a more
serious way that means losing all the data on the array).  Would
dumping the sectors at the start of the drive help at all to figure
out what's going on?



[Sorry for the double mail - forgot to CC the list]
-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: RAID5 array showing as degraded after motherboard replacement

2006-11-06 Thread James Lee

On 06/11/06, dean gaudet <[EMAIL PROTECTED]> wrote:



On Mon, 6 Nov 2006, James Lee wrote:

> Thanks for the reply Dean.  I looked through dmesg output from the
> boot up, to check whether this was just an ordering issue during the
> system start up (since both evms and mdadm attempt to activate the
> array, which could cause things to go wrong...).
>
> Looking through the dmesg output though, it looks like the 'missing'
> disk is being detected before the array is assembled, but that the
> disk is throwing up errors.  I've attached the full output of dmesg;
> grepping it for "hde" gives the following:
>
> [17179574.084000] ide2: BM-DMA at 0xd400-0xd407, BIOS settings:
> hde:DMA, hdf:DMA
> [17179574.38] hde: NetCell SyncRAID(TM) SR5000 JBOD, ATA DISK drive
> [17179575.312000] hde: max request size: 512KiB
> [17179575.312000] hde: 625134827 sectors (320069 MB), CHS=38912/255/63, (U)DMA
> [17179575.312000] hde: set_geometry_intr: status=0x51 { DriveReady
> SeekComplete Error }
> [17179575.312000] hde: set_geometry_intr: error=0x04 { DriveStatusError }
> [17179575.312000] hde: cache flushes supported

is it possible that the "NetCell SyncRAID" implementation is stealing some
of the sectors (even though it's marked JBOD)?  anyhow it could be the
disk is bad, but i'd still be tempted to see if the problem stays with the
controller if you swap the disk with another in the array.

-dean



Looks like you might be right.  I removed one of the other drives from
the onboard controller, and moved the 'faulty' drive from the NetCell
controller to the onboard one.  Booted up up the machine, and the
drive is still not added to the array correctly (so the array now
fails to assemble, as there's only 3 out of 5 drives).  I've run the
Seagate diagnostics tools over the drive and they report successful
when it's connected to the onboard controller and unsuccessful when
it's connected to the NetCell controller (this may be a test tool
issue though).

I guess this indicates that either:
1) The NetCell controller is faulty and just not reading/writing data properly.
2) The NetCell controller's RAID implementation has somehow not been
transparent to the OS and has overwritten/modified md's superblocks.
3) EVMS somehow messed the config up on that drive when trying to
reassemble the array after the first time the controller came up.

I'll test for 1) by attaching another drive (not one of the ones in
the array!) to the NetCell contoller and seeing if it passed
diagnostics tests.  3) seems pretty unlikely.

I bought the NetCell card mainly for its Linux compatibility - do they
have known issues with mdadm?

Thanks,
James
-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: RAID5 array showing as degraded after motherboard replacement

2006-11-06 Thread dean gaudet


On Mon, 6 Nov 2006, James Lee wrote:

> Thanks for the reply Dean.  I looked through dmesg output from the
> boot up, to check whether this was just an ordering issue during the
> system start up (since both evms and mdadm attempt to activate the
> array, which could cause things to go wrong...).
> 
> Looking through the dmesg output though, it looks like the 'missing'
> disk is being detected before the array is assembled, but that the
> disk is throwing up errors.  I've attached the full output of dmesg;
> grepping it for "hde" gives the following:
> 
> [17179574.084000] ide2: BM-DMA at 0xd400-0xd407, BIOS settings:
> hde:DMA, hdf:DMA
> [17179574.38] hde: NetCell SyncRAID(TM) SR5000 JBOD, ATA DISK drive
> [17179575.312000] hde: max request size: 512KiB
> [17179575.312000] hde: 625134827 sectors (320069 MB), CHS=38912/255/63, (U)DMA
> [17179575.312000] hde: set_geometry_intr: status=0x51 { DriveReady
> SeekComplete Error }
> [17179575.312000] hde: set_geometry_intr: error=0x04 { DriveStatusError }
> [17179575.312000] hde: cache flushes supported

is it possible that the "NetCell SyncRAID" implementation is stealing some 
of the sectors (even though it's marked JBOD)?  anyhow it could be the 
disk is bad, but i'd still be tempted to see if the problem stays with the 
controller if you swap the disk with another in the array.

-dean
-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: RAID5 array showing as degraded after motherboard replacement

2006-11-05 Thread James Lee

Thanks for the reply Dean.  I looked through dmesg output from the
boot up, to check whether this was just an ordering issue during the
system start up (since both evms and mdadm attempt to activate the
array, which could cause things to go wrong...).

Looking through the dmesg output though, it looks like the 'missing'
disk is being detected before the array is assembled, but that the
disk is throwing up errors.  I've attached the full output of dmesg;
grepping it for "hde" gives the following:

[17179574.084000] ide2: BM-DMA at 0xd400-0xd407, BIOS settings:
hde:DMA, hdf:DMA
[17179574.38] hde: NetCell SyncRAID(TM) SR5000 JBOD, ATA DISK drive
[17179575.312000] hde: max request size: 512KiB
[17179575.312000] hde: 625134827 sectors (320069 MB), CHS=38912/255/63, (U)DMA
[17179575.312000] hde: set_geometry_intr: status=0x51 { DriveReady
SeekComplete Error }
[17179575.312000] hde: set_geometry_intr: error=0x04 { DriveStatusError }
[17179575.312000] hde: cache flushes supported
[17179575.312000]  hde: hde1
[17179967.224000] md: bind
[17179967.224000] md: kicking non-fresh hde1 from array!
[17179967.224000] md: unbind
[17179967.224000] md: export_rdev(hde1)

Am I right in thinking this looks that drive is just bad (the two
set_geometry_intr errors, and the fact it gets kicked off the array by
mdadm)?  I'll run it through the Seagate diagnostics suite tomorrow to
see whether it's faulty or not...

James

On 05/11/06, dean gaudet <[EMAIL PROTECTED]> wrote:

On Sun, 5 Nov 2006, James Lee wrote:

> Hi there,
>
> I'm running a 5-drive software RAID5 array across two controllers.
> The motherboard in that PC recently died - I sent the board back for
> RMA.  When I refitted the motherboard, connected up all the drives,
> and booted up I found that the array was being reported as degraded
> (though all the data on it is intact).  I have 4 drives on the on
> board controller and 1 drive on an XFX Revo 64 SATA controller card.
> The drive which is being reported as not being in the array is the one
> connected to the XFX controller.
>
> The OS can see that drive fine, and "mdadm --examine" on that drive
> shows that it is part of the array and that there are 5 active devices
> in the array.  Doing "mdadm --examine" on one of the other four drives
> shows that the array has 4 active drives and one failed.  "mdadm
> --detail" for the array also shows 4 active and one failed.

that means the array was assembled without the 5th disk and is currently
degraded.


> Now I haven't lost any data here and I know I can just force a resync
> of the array which is fine.  However I'm concerned about how this has
> happened.  One worry is that the XFX SATA controller is doing
> something funny to the drive.  I've noticed that it's BIOS has
> defaulted to RAID0 mode (even though there's only one drive on it) - I
> can't see how this would cause any particular problems here though.  I
> guess it's possible that some data on the drive got corrupted when the
> motherboard failed...

no it's more likely the devices were renamed or the 5th device didn't come
up before the array was assembled.

it's possible that a different bios setting lead to the device using a
different driver than is in your initrd... but i'm just guessing.

> Any ideas what could cause mdadm to report as I've described above
> (I've attached the output of these three commands)?  I'm running
> Ubuntu Edgy, which is a 2.17.x kernel, and mdadm 2.4.1.  In case it's
> relevant here, I created the array using EVMS...

i've never created an array with evms... but my guess is that it may have
used "mapped" device names instead of the normal device names.  take a
look at /proc/mdstat and see what devices are in the array and use those
as a template to find the name of the missing device.  below i'll use
/dev/sde1 as the example missing device and /dev/md0 as the example array.

first thing i'd try is something like this:

mdadm /dev/md0 -a /dev/sde1

which hotadds the device into the array... which will start a resync.

when the resync is done (cat /proc/mdstat) do this.

mdadm -Gb internal /dev/md0

which will add write-intent bitmaps to your device... which will avoid
another long wait for a resync after the next reboot if the fix below
doesn't help.

then do this:

dpkg-reconfigure linux-image-`uname -r`

which will rebuild the initrd for your kernel ... and if it was a driver
change this should include the new driver into the initrd.

then reboot and see if it comes up fine.  if it doesn't, you can repeat
the "-a /dev/sde1" command above... the resync will be quick this time due
to the bitmap... and we'll have to investigate further.

-dean
-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[17179569.184000] Linux version 2.6.17-10-generic ([EMAIL PROTECTED]) (gcc version 4.1.2 20060928 (prerelease) (Ubuntu 4.1.

Re: RAID5 array showing as degraded after motherboard replacement

2006-11-05 Thread dean gaudet
On Sun, 5 Nov 2006, James Lee wrote:

> Hi there,
> 
> I'm running a 5-drive software RAID5 array across two controllers.
> The motherboard in that PC recently died - I sent the board back for
> RMA.  When I refitted the motherboard, connected up all the drives,
> and booted up I found that the array was being reported as degraded
> (though all the data on it is intact).  I have 4 drives on the on
> board controller and 1 drive on an XFX Revo 64 SATA controller card.
> The drive which is being reported as not being in the array is the one
> connected to the XFX controller.
> 
> The OS can see that drive fine, and "mdadm --examine" on that drive
> shows that it is part of the array and that there are 5 active devices
> in the array.  Doing "mdadm --examine" on one of the other four drives
> shows that the array has 4 active drives and one failed.  "mdadm
> --detail" for the array also shows 4 active and one failed.

that means the array was assembled without the 5th disk and is currently 
degraded.


> Now I haven't lost any data here and I know I can just force a resync
> of the array which is fine.  However I'm concerned about how this has
> happened.  One worry is that the XFX SATA controller is doing
> something funny to the drive.  I've noticed that it's BIOS has
> defaulted to RAID0 mode (even though there's only one drive on it) - I
> can't see how this would cause any particular problems here though.  I
> guess it's possible that some data on the drive got corrupted when the
> motherboard failed...

no it's more likely the devices were renamed or the 5th device didn't come 
up before the array was assembled.

it's possible that a different bios setting lead to the device using a 
different driver than is in your initrd... but i'm just guessing.

> Any ideas what could cause mdadm to report as I've described above
> (I've attached the output of these three commands)?  I'm running
> Ubuntu Edgy, which is a 2.17.x kernel, and mdadm 2.4.1.  In case it's
> relevant here, I created the array using EVMS...

i've never created an array with evms... but my guess is that it may have 
used "mapped" device names instead of the normal device names.  take a 
look at /proc/mdstat and see what devices are in the array and use those 
as a template to find the name of the missing device.  below i'll use 
/dev/sde1 as the example missing device and /dev/md0 as the example array.

first thing i'd try is something like this:

mdadm /dev/md0 -a /dev/sde1

which hotadds the device into the array... which will start a resync.

when the resync is done (cat /proc/mdstat) do this.

mdadm -Gb internal /dev/md0

which will add write-intent bitmaps to your device... which will avoid 
another long wait for a resync after the next reboot if the fix below 
doesn't help.

then do this:

dpkg-reconfigure linux-image-`uname -r`

which will rebuild the initrd for your kernel ... and if it was a driver 
change this should include the new driver into the initrd.

then reboot and see if it comes up fine.  if it doesn't, you can repeat 
the "-a /dev/sde1" command above... the resync will be quick this time due 
to the bitmap... and we'll have to investigate further.

-dean
-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RAID5 array showing as degraded after motherboard replacement

2006-11-05 Thread James Lee

Hi there,

I'm running a 5-drive software RAID5 array across two controllers.
The motherboard in that PC recently died - I sent the board back for
RMA.  When I refitted the motherboard, connected up all the drives,
and booted up I found that the array was being reported as degraded
(though all the data on it is intact).  I have 4 drives on the on
board controller and 1 drive on an XFX Revo 64 SATA controller card.
The drive which is being reported as not being in the array is the one
connected to the XFX controller.

The OS can see that drive fine, and "mdadm --examine" on that drive
shows that it is part of the array and that there are 5 active devices
in the array.  Doing "mdadm --examine" on one of the other four drives
shows that the array has 4 active drives and one failed.  "mdadm
--detail" for the array also shows 4 active and one failed.

Now I haven't lost any data here and I know I can just force a resync
of the array which is fine.  However I'm concerned about how this has
happened.  One worry is that the XFX SATA controller is doing
something funny to the drive.  I've noticed that it's BIOS has
defaulted to RAID0 mode (even though there's only one drive on it) - I
can't see how this would cause any particular problems here though.  I
guess it's possible that some data on the drive got corrupted when the
motherboard failed...

Any ideas what could cause mdadm to report as I've described above
(I've attached the output of these three commands)?  I'm running
Ubuntu Edgy, which is a 2.17.x kernel, and mdadm 2.4.1.  In case it's
relevant here, I created the array using EVMS...

Thanks,
James
[EMAIL PROTECTED]:~$ sudo mdadm --examine /dev/hde1
Password:
/dev/hde1:
  Magic : a92b4efc
Version : 00.90.00
   UUID : 33d5338b:d2d6baf0:424498ad:47d05087
  Creation Time : Sun Jan 15 16:47:51 2006
 Raid Level : raid5
Device Size : 312496128 (298.02 GiB 320.00 GB)
 Array Size : 1249984512 (1192.08 GiB 1279.98 GB)
   Raid Devices : 5
  Total Devices : 5
Preferred Minor : 0

Update Time : Sat Nov  4 16:29:06 2006
  State : clean
 Active Devices : 5
Working Devices : 5
 Failed Devices : 0
  Spare Devices : 0
   Checksum : d628e17e - correct
 Events : 0.4232131

 Layout : left-asymmetric
 Chunk Size : 256K

  Number   Major   Minor   RaidDevice State
this 4 25464  active sync

   0 0 25420  active sync
   1 1 25431  active sync
   2 2 25442  active sync
   3 3 25453  active sync
   4 4 25464  active sync
[EMAIL PROTECTED]:~$ 
[EMAIL PROTECTED]:~$ 
[EMAIL PROTECTED]:~$ 
[EMAIL PROTECTED]:~$ sudo mdadm --examine /dev/sda1
/dev/sda1:
  Magic : a92b4efc
Version : 00.90.00
   UUID : 33d5338b:d2d6baf0:424498ad:47d05087
  Creation Time : Sun Jan 15 16:47:51 2006
 Raid Level : raid5
Device Size : 312496128 (298.02 GiB 320.00 GB)
 Array Size : 1249984512 (1192.08 GiB 1279.98 GB)
   Raid Devices : 5
  Total Devices : 4
Preferred Minor : 0

Update Time : Sun Nov  5 11:56:29 2006
  State : clean
 Active Devices : 4
Working Devices : 4
 Failed Devices : 1
  Spare Devices : 0
   Checksum : d629ee25 - correct
 Events : 0.4232204

 Layout : left-asymmetric
 Chunk Size : 256K

  Number   Major   Minor   RaidDevice State
this 1   811  active sync   /dev/sda1

   0 0   8   170  active sync   /dev/sdb1
   1 1   811  active sync   /dev/sda1
   2 2   8   332  active sync   /dev/sdc1
   3 3   8   493  active sync   /dev/sdd1
   4 4   004  faulty removed
[EMAIL PROTECTED]:~$ 
[EMAIL PROTECTED]:~$ 
[EMAIL PROTECTED]:~$ 
[EMAIL PROTECTED]:~$ sudo mdadm --detail /dev/md0 
/dev/md0:
Version : 00.90.03
  Creation Time : Sun Jan 15 16:47:51 2006
 Raid Level : raid5
 Array Size : 1249984512 (1192.08 GiB 1279.98 GB)
Device Size : 312496128 (298.02 GiB 320.00 GB)
   Raid Devices : 5
  Total Devices : 4
Preferred Minor : 0
Persistence : Superblock is persistent

Update Time : Sun Nov  5 11:56:29 2006
  State : clean, degraded
 Active Devices : 4
Working Devices : 4
 Failed Devices : 0
  Spare Devices : 0

 Layout : left-asymmetric
 Chunk Size : 256K

   UUID : 33d5338b:d2d6baf0:424498ad:47d05087
 Events : 0.4232204

Number   Major   Minor   RaidDevice State
   0   8   170  active sync   /dev/sdb1
   1   811  active sync   /dev/sda1
   2   8   332  active sync   /dev/sdc1
   3   8   493  active sync   /dev/sdd1
   4   004  removed
[EMAIL PROTECTED]:~$