Re: degraded raid5 refuses to start

2006-07-01 Thread Jason Lunz
[EMAIL PROTECTED] said:
> How do I get this array going again?  Am I doing something wrong?
> Reading the list archives indicates that there could be bugs in this
> area, or that I may need to recreate the array with -C (though that
> seems heavyhanded to me).

This is what I ended up doing. I made backups of the three superblocks,
then recreated them with:

# mdadm -C /dev/md2 -n4 -l5 /dev/sda3 missing /dev/hda1 /dev/hdc1

(I knew the chunk size and layout would be the same, since I just use
the defaults).

After this, the array works again. I have before and after images of the
three superblocks if anyone wants to look into how they got into this
state.

As far as I can see, the problem was that the broken array got into a
state where the superblock counts were like this:

   Raid Devices : 4
  Total Devices : 4
Preferred Minor : 2

Update Time : Mon Jun 26 22:51:12 2006
  State : active
 Active Devices : 3
Working Devices : 3
 Failed Devices : 2
  Spare Devices : 0

notice how the total number of Working + Failed (5) exceeds the number
of disks in the array. Maybe there's a bug to be fixed here that lets
these counters get out of whack somehow?

After reconstructing the array, the Failed count went back down to 1,
and everything started working normally again. I wonder if simply
decrementing that one value in each superblock would have been enough to 
get the array going again, rather than rewriting all the superblocks. If
so, maybe that can be safely built into mdadm? 

Either that, or it was having two disks marked State: active and one
marked clean in the degraded array.

anyway, I have a dead disk and kept all my data, so thanks.

Jason

-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


degraded raid5 refuses to start

2006-07-01 Thread Jason Lunz
I have a 4-disk raid5 (sda3, sdb3, hda1, hdc1). sda and sdb share a
silicon image sata card.  sdb died completely, then 20 minutes later,
the sata_sil driver became fatally confused and the machine locked up.
I shut down the machine and waited until I had a replacement for sdb.

I've got a replacement for sdb now, but I can't get the array to start
so that I can add it and resync. When I try to assemble the degraded
array, I get this:

[EMAIL PROTECTED]:~# mdadm -Af /dev/md2 /dev/sda3 /dev/hda1 /dev/hdc1
mdadm: failed to RUN_ARRAY /dev/md2: Input/output error

[EMAIL PROTECTED]:~# dmesg | tail -n 15
md: bind
md: bind
md: bind
md: md2: raid array is not clean -- starting background reconstruction
raid5: device sda3 operational as raid disk 0
raid5: device hdc1 operational as raid disk 3
raid5: device hda1 operational as raid disk 2
raid5: cannot start dirty degraded array for md2
RAID5 conf printout:
 --- rd:4 wd:3 fd:1
 disk 0, o:1, dev:sda3
 disk 2, o:1, dev:hda1
 disk 3, o:1, dev:hdc1
raid5: failed to run raid set md2
md: pers->run() failed ...

How do I convince the array to start? I can add the new disk to the
array, but it simply becomes a spare and the raid5 remains inactive.

The superblock on the 1 of the 3 drives is a little different than the
other two:

[EMAIL PROTECTED]:~# mdadm -E /dev/hda1 > sb-hda1
[EMAIL PROTECTED]:~# mdadm -E /dev/hdc1 > sb-hdc1
[EMAIL PROTECTED]:~# mdadm -E /dev/sda3 > sb-sda3
[EMAIL PROTECTED]:~# diff -u sb-hda1 sb-hdc1
--- sb-hda1 2006-07-01 17:17:36.0 -0400
+++ sb-hdc1 2006-07-01 17:17:41.0 -0400
@@ -1,4 +1,4 @@
-/dev/hda1:
+/dev/hdc1:
   Magic : a92b4efc
 Version : 00.90.00
UUID : 6b8b4567:327b23c6:643c9869:66334873
@@ -16,14 +16,14 @@
 Working Devices : 3
  Failed Devices : 2
   Spare Devices : 0
-   Checksum : a2163da6 - correct
+   Checksum : a2163dbb - correct
  Events : 0.47575379

  Layout : left-symmetric
  Chunk Size : 64K

   Number   Major   Minor   RaidDevice State
-this 2   312  active sync   /dev/hda1
+this 3  2213  active sync   /dev/hdc1

0 0   830  active sync   /dev/sda3
1 1   001  faulty removed
[EMAIL PROTECTED]:~# diff -u sb-hda1 sb-sda3
--- sb-hda1 2006-07-01 17:17:36.0 -0400
+++ sb-sda3 2006-07-01 17:17:43.0 -0400
@@ -1,4 +1,4 @@
-/dev/hda1:
+/dev/sda3:
   Magic : a92b4efc
 Version : 00.90.00
UUID : 6b8b4567:327b23c6:643c9869:66334873
@@ -10,22 +10,22 @@
   Total Devices : 4
 Preferred Minor : 2

-Update Time : Mon Jun 26 22:51:12 2006
-  State : active
+Update Time : Mon Jun 26 22:51:06 2006
+  State : clean
  Active Devices : 3
 Working Devices : 3
  Failed Devices : 2
   Spare Devices : 0
-   Checksum : a2163da6 - correct
- Events : 0.47575379
+   Checksum : a4ec2eec - correct
+ Events : 0.47575378

  Layout : left-symmetric
  Chunk Size : 64K

   Number   Major   Minor   RaidDevice State
-this 2   312  active sync   /dev/hda1
+this 0   830  active sync   /dev/sda3

0 0   830  active sync   /dev/sda3
-   1 1   001  faulty removed
+   1 1   001  spare
2 2   312  active sync   /dev/hda1
3 3  2213  active sync   /dev/hdc1

How do I get this array going again?  Am I doing something wrong?
Reading the list archives indicates that there could be bugs in this
area, or that I may need to recreate the array with -C (though that
seems heavyhanded to me).

thanks,

Jason

-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html