Re: Trying to start dirty, degraded RAID6 array

2006-04-26 Thread Neil Brown
On Thursday April 27, [EMAIL PROTECTED] wrote:
> Neil Brown wrote:
> > The '-f' is meant to make this work.  However it seems there is a bug.
> > 
> > Could you please test this patch?  It isn't exactly the right fix, but
> > it definitely won't hurt.
> 
> Thanks, Neil, I'll give this a go when I get home tonight.
> 
> Is there any way to start an array without kicking off a rebuild ?

echo 1 > /sys/module/md_mod/parameters/start_ro 

If you do this, then arrays will be read-only when they are started,
and so will not do a rebuild.  The first write request to the array
(e.g. if you mount a filesystem) will cause a switch to read/write and
any required rebuild will start. 

echo 0 > 
will revert the effect.

This requires a reasonably recent kernel.

NeilBrown
-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Trying to start dirty, degraded RAID6 array

2006-04-26 Thread Christopher Smith

Neil Brown wrote:

The '-f' is meant to make this work.  However it seems there is a bug.

Could you please test this patch?  It isn't exactly the right fix, but
it definitely won't hurt.


Thanks, Neil, I'll give this a go when I get home tonight.

Is there any way to start an array without kicking off a rebuild ?

CS
-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Trying to start dirty, degraded RAID6 array

2006-04-26 Thread Neil Brown
On Thursday April 27, [EMAIL PROTECTED] wrote:
> The short version:
> 
> I have a 12-disk RAID6 array that has lost a device and now whenever I 
> try to start it with:
> 
> mdadm -Af /dev/md0 /dev/sd[abcdefgijkl]1
> 
> I get:
> 
> mdadm: failed to RUN_ARRAY /dev/md0: Input/output error
> 
...
> raid6: cannot start dirty degraded array for md0

The '-f' is meant to make this work.  However it seems there is a bug.

Could you please test this patch?  It isn't exactly the right fix, but
it definitely won't hurt.

Thanks,
NeilBrown

Signed-off-by: Neil Brown <[EMAIL PROTECTED]>

### Diffstat output
 ./super0.c |1 +
 1 file changed, 1 insertion(+)

diff ./super0.c~current~ ./super0.c
--- ./super0.c~current~ 2006-03-28 17:10:51.0 +1100
+++ ./super0.c  2006-04-27 10:03:40.0 +1000
@@ -372,6 +372,7 @@ static int update_super0(struct mdinfo *
if (sb->level == 5 || sb->level == 4 || sb->level == 6)
/* need to force clean */
sb->state |= (1 << MD_SB_CLEAN);
+   rv = 1;
}
if (strcmp(update, "assemble")==0) {
int d = info->disk.number;
-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Trying to start dirty, degraded RAID6 array

2006-04-26 Thread Christopher Smith

The short version:

I have a 12-disk RAID6 array that has lost a device and now whenever I 
try to start it with:


mdadm -Af /dev/md0 /dev/sd[abcdefgijkl]1

I get:

mdadm: failed to RUN_ARRAY /dev/md0: Input/output error

And in dmesg:

md: bind
md: bind
md: bind
md: bind
md: bind
md: bind
md: bind
md: bind
md: bind
md: bind
md: bind
md: md0: raid array is not clean -- starting background reconstruction
raid6: device sdl1 operational as raid disk 0
raid6: device sdc1 operational as raid disk 11
raid6: device sda1 operational as raid disk 10
raid6: device sdd1 operational as raid disk 9
raid6: device sdb1 operational as raid disk 8
raid6: device sdg1 operational as raid disk 6
raid6: device sdf1 operational as raid disk 5
raid6: device sde1 operational as raid disk 4
raid6: device sdj1 operational as raid disk 3
raid6: device sdi1 operational as raid disk 2
raid6: device sdk1 operational as raid disk 1
raid6: cannot start dirty degraded array for md0
RAID6 conf printout:
 --- rd:12 wd:11 fd:1
 disk 0, o:1, dev:sdl1
 disk 1, o:1, dev:sdk1
 disk 2, o:1, dev:sdi1
 disk 3, o:1, dev:sdj1
 disk 4, o:1, dev:sde1
 disk 5, o:1, dev:sdf1
 disk 6, o:1, dev:sdg1
 disk 8, o:1, dev:sdb1
 disk 9, o:1, dev:sdd1
 disk 10, o:1, dev:sda1
 disk 11, o:1, dev:sdc1
raid6: failed to run raid set md0
md: pers->run() failed ...


I'm 99% sure the data is ok and I'd like to know how to force the array 
online.




Longer version:

A couple of days ago I started having troubles with my fileserver 
mysteriously hanging during boot (I was messing with trying to get Xen 
running at the time, so lots of reboots were involved).  I finally 
nailed it down to the autostarting of the RAID array.


After several hours of pulling CPUs, SATA cards, RAM (not to mention 
some scary problems with memtest86+ that turned out to be because "USB 
Legacy" was enabled) I finally managed to figure out that one of my 
drives would simply stop transferring data after about the first gig 
(tested with dd, monitoring with iostat).  About 30 seconds after the 
drive "stops", the rest of the machine also hangs.


Interestingly, there are no error messages anywhere I could find 
indicating the drive was having problem.  Even its SMART test (smartctl 
-t long) says it's ok.  This made the problem substantially more 
difficult to figure out.


I then tried to start the array without the broken disk and had the 
problem mentioned in the short version above - the array wouldn't start, 
presumably because its rebuild had been started and (uncleanly) stopped 
about a dozen times since it last succeeeded.  I finally managed to get 
the array online by starting it with all the disks, then immediately 
knocking the one I knew to be bad offline with 'mdadm /dev/md0 -f 
/dev/sdh1' before it hit the point where it would hang.  After that the 
rebuild completed without error (I didn't touch the machine at all while 
it was rebuilding).


However, a few hours after the rebuild completed, a power failure killed 
the machine again and now I can't start the array, as outlined in the 
"short version" above.  I must admit I find it a bit weird that the 
array is "dirty and degraded" after it had successfully completed a rebuild.


Unfortunately the original failed drive (/dev/sdh) is no longer 
available, so I can't do my original trick again.  I'm pretty sure - 
based on the rebuild completing previously - that the data will be fine 
if I can just get the array back online, is there some sort of 
--really-force switch to mdadm ?  Can the array be brought back online 
*without* triggering a rebuild, so I can get as much data as possible 
off and then start from scratch again ?


CS

Here is the 'mdadm --examine /dev/sdX' output for each of the remaining 
drives, if it is helpful:


/dev/sda1:
 Magic : a92b4efc
   Version : 00.90.02
  UUID : 78ddbb47:e4dfcf9e:5f24461a:19104298
 Creation Time : Wed Feb  1 01:09:11 2006
Raid Level : raid6
   Device Size : 244195904 (232.88 GiB 250.06 GB)
Array Size : 2441959040 (2328.83 GiB 2500.57 GB)
  Raid Devices : 12
 Total Devices : 11
Preferred Minor : 0

   Update Time : Wed Apr 26 22:30:01 2006
 State : active
 Active Devices : 11
Working Devices : 11
 Failed Devices : 1
 Spare Devices : 0
  Checksum : 1685ebfc - correct
Events : 0.11176511


 Number   Major   Minor   RaidDevice State
this10   81   10  active sync   /dev/sda1

  0 0   8  1770  active sync   /dev/sdl1
  1 1   8  1611  active sync   /dev/sdk1
  2 2   8  1292  active sync   /dev/sdi1
  3 3   8  1453  active sync   /dev/sdj1
  4 4   8   654  active sync   /dev/sde1
  5 5   8   815  active sync   /dev/sdf1
  6 6   8   976  active sync   /dev/sdg1
  7 7   007  faulty removed
  8 8   8   178