Re: Strange intermittant errors + RAID doesn't fail the disk.

2006-07-01 Thread Christian Pernegger

Looks very much like a problem with the SATA controller.
If the repeat look you have shown there is an infinite loop, then
presumably some failure is not being handled properly.


I agree, even though the AHCI driver was supposed to be stable. The
loop is not quite infinite btw., it does time out after a few minutes.


I suggest you find a SATA related mailing list to post this to (Look
in the MAINTAINERS file maybe) or post it to linux-kernel.


Will do. linux-ide should do the trick.


I doubt this is directly related to the raid code at all.


The only problem I see with the RAID code is that it does not fail the
disk when it hangs in this way. How is this possible? The libata
driver shows lots of errors and even if md does not react to these
there should be a (short) timeout for the request somewhere.

The disks do have limited integrated error correction, because RAID
controllers like to handle that themselves - could that have something
to do with it? Marketing blurb:

[...] Inside a RAID system, where the RAID controller handles error
recovery, the drive needn't pause for extended periods to recover
data. In fact, heroic error recovery attempts can cause a RAID system
to drop a drive out of the array. WD RE2 is engineered to prevent hard
drive error recovery fallout by limiting the drive's error recovery
time. With error recovery factory set to seven seconds, the drive has
time to attempt a recovery, allow the RAID controller to log the
error, and still stay online. [...]


Good luck :-)


Thanks :)

C.
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: raid issues after power failure

2006-07-01 Thread Ákos Maróy
Neil Brown wrote:
 Try adding '--force' to the -A line.
 That tells mdadm to try really hard to assemble the array.

thanks, this seems to have solved the issue...


Akos

-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


degraded raid5 refuses to start

2006-07-01 Thread Jason Lunz
I have a 4-disk raid5 (sda3, sdb3, hda1, hdc1). sda and sdb share a
silicon image sata card.  sdb died completely, then 20 minutes later,
the sata_sil driver became fatally confused and the machine locked up.
I shut down the machine and waited until I had a replacement for sdb.

I've got a replacement for sdb now, but I can't get the array to start
so that I can add it and resync. When I try to assemble the degraded
array, I get this:

[EMAIL PROTECTED]:~# mdadm -Af /dev/md2 /dev/sda3 /dev/hda1 /dev/hdc1
mdadm: failed to RUN_ARRAY /dev/md2: Input/output error

[EMAIL PROTECTED]:~# dmesg | tail -n 15
md: bindhda1
md: bindhdc1
md: bindsda3
md: md2: raid array is not clean -- starting background reconstruction
raid5: device sda3 operational as raid disk 0
raid5: device hdc1 operational as raid disk 3
raid5: device hda1 operational as raid disk 2
raid5: cannot start dirty degraded array for md2
RAID5 conf printout:
 --- rd:4 wd:3 fd:1
 disk 0, o:1, dev:sda3
 disk 2, o:1, dev:hda1
 disk 3, o:1, dev:hdc1
raid5: failed to run raid set md2
md: pers-run() failed ...

How do I convince the array to start? I can add the new disk to the
array, but it simply becomes a spare and the raid5 remains inactive.

The superblock on the 1 of the 3 drives is a little different than the
other two:

[EMAIL PROTECTED]:~# mdadm -E /dev/hda1  sb-hda1
[EMAIL PROTECTED]:~# mdadm -E /dev/hdc1  sb-hdc1
[EMAIL PROTECTED]:~# mdadm -E /dev/sda3  sb-sda3
[EMAIL PROTECTED]:~# diff -u sb-hda1 sb-hdc1
--- sb-hda1 2006-07-01 17:17:36.0 -0400
+++ sb-hdc1 2006-07-01 17:17:41.0 -0400
@@ -1,4 +1,4 @@
-/dev/hda1:
+/dev/hdc1:
   Magic : a92b4efc
 Version : 00.90.00
UUID : 6b8b4567:327b23c6:643c9869:66334873
@@ -16,14 +16,14 @@
 Working Devices : 3
  Failed Devices : 2
   Spare Devices : 0
-   Checksum : a2163da6 - correct
+   Checksum : a2163dbb - correct
  Events : 0.47575379

  Layout : left-symmetric
  Chunk Size : 64K

   Number   Major   Minor   RaidDevice State
-this 2   312  active sync   /dev/hda1
+this 3  2213  active sync   /dev/hdc1

0 0   830  active sync   /dev/sda3
1 1   001  faulty removed
[EMAIL PROTECTED]:~# diff -u sb-hda1 sb-sda3
--- sb-hda1 2006-07-01 17:17:36.0 -0400
+++ sb-sda3 2006-07-01 17:17:43.0 -0400
@@ -1,4 +1,4 @@
-/dev/hda1:
+/dev/sda3:
   Magic : a92b4efc
 Version : 00.90.00
UUID : 6b8b4567:327b23c6:643c9869:66334873
@@ -10,22 +10,22 @@
   Total Devices : 4
 Preferred Minor : 2

-Update Time : Mon Jun 26 22:51:12 2006
-  State : active
+Update Time : Mon Jun 26 22:51:06 2006
+  State : clean
  Active Devices : 3
 Working Devices : 3
  Failed Devices : 2
   Spare Devices : 0
-   Checksum : a2163da6 - correct
- Events : 0.47575379
+   Checksum : a4ec2eec - correct
+ Events : 0.47575378

  Layout : left-symmetric
  Chunk Size : 64K

   Number   Major   Minor   RaidDevice State
-this 2   312  active sync   /dev/hda1
+this 0   830  active sync   /dev/sda3

0 0   830  active sync   /dev/sda3
-   1 1   001  faulty removed
+   1 1   001  spare
2 2   312  active sync   /dev/hda1
3 3  2213  active sync   /dev/hdc1

How do I get this array going again?  Am I doing something wrong?
Reading the list archives indicates that there could be bugs in this
area, or that I may need to recreate the array with -C (though that
seems heavyhanded to me).

thanks,

Jason

-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


changing MD device names

2006-07-01 Thread Jon Lewis
I have a system which was running several raid1 devices (md0 - md2) using 
2 physical drives (hde, and hdg).  I wanted to swap out these drives for 
two different ones, so I did the following:


1) swap out hdg for a new drive
2) create degraded raid1's (md3 and md4) using partitions on new hdg
3) format md3 and md4 and copy data from md0-2 to md3-4
4) install grub on new hdg
5) pull hde

Now, after a bit of fixing in the grub menu and fstab, I have a system 
that boots up using just 1 of the new drives, but the md devices are md3 
and md4.  What's the easiest way to change the prefered minor # and get 
these to be md0 and md1?  Will just booting from a rescue or live CD and 
assembling the new drives as md0  md1 automatically update the prefered 
minor in their superblocks?


The system is running Centos 4 (2.6.9-34.0.1.EL kernel).

--
 Jon Lewis   |  I route
 Senior Network Engineer |  therefore you are
 Atlantic Net|
_ http://www.lewis.org/~jlewis/pgp for PGP public key_
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: degraded raid5 refuses to start

2006-07-01 Thread Jason Lunz
[EMAIL PROTECTED] said:
 How do I get this array going again?  Am I doing something wrong?
 Reading the list archives indicates that there could be bugs in this
 area, or that I may need to recreate the array with -C (though that
 seems heavyhanded to me).

This is what I ended up doing. I made backups of the three superblocks,
then recreated them with:

# mdadm -C /dev/md2 -n4 -l5 /dev/sda3 missing /dev/hda1 /dev/hdc1

(I knew the chunk size and layout would be the same, since I just use
the defaults).

After this, the array works again. I have before and after images of the
three superblocks if anyone wants to look into how they got into this
state.

As far as I can see, the problem was that the broken array got into a
state where the superblock counts were like this:

   Raid Devices : 4
  Total Devices : 4
Preferred Minor : 2

Update Time : Mon Jun 26 22:51:12 2006
  State : active
 Active Devices : 3
Working Devices : 3
 Failed Devices : 2
  Spare Devices : 0

notice how the total number of Working + Failed (5) exceeds the number
of disks in the array. Maybe there's a bug to be fixed here that lets
these counters get out of whack somehow?

After reconstructing the array, the Failed count went back down to 1,
and everything started working normally again. I wonder if simply
decrementing that one value in each superblock would have been enough to 
get the array going again, rather than rewriting all the superblocks. If
so, maybe that can be safely built into mdadm? 

Either that, or it was having two disks marked State: active and one
marked clean in the degraded array.

anyway, I have a dead disk and kept all my data, so thanks.

Jason

-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: changing MD device names

2006-07-01 Thread Neil Brown
On Saturday July 1, [EMAIL PROTECTED] wrote:
 I have a system which was running several raid1 devices (md0 - md2) using 
 2 physical drives (hde, and hdg).  I wanted to swap out these drives for 
 two different ones, so I did the following:
 
 1) swap out hdg for a new drive
 2) create degraded raid1's (md3 and md4) using partitions on new hdg
 3) format md3 and md4 and copy data from md0-2 to md3-4
 4) install grub on new hdg
 5) pull hde
 
 Now, after a bit of fixing in the grub menu and fstab, I have a system 
 that boots up using just 1 of the new drives, but the md devices are md3 
 and md4.  What's the easiest way to change the prefered minor # and get 
 these to be md0 and md1?  Will just booting from a rescue or live CD and 
 assembling the new drives as md0  md1 automatically update the prefered 
 minor in their superblocks?
 
 The system is running Centos 4 (2.6.9-34.0.1.EL kernel).

You need to do a tiny bit more than assemble the new drives as md0 and
md1.  You also need to cause some write activity so that md bothers to
update the superblock.  Mounting and unmounting the filesystem should
do it.
Or you could assemble with --update=super-minor

NeilBrown
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html