Re: raid issues after power failure

2006-07-03 Thread Ákos Maróy
Francois Barre wrote:
> AFAIK, mdadm -A  will use /etc/mdadm.conf to know what
> underlying partitions you mean with your /dev/md0.
> 
> So, try
> 
> # mdadm --stop /dev/md0
> # mdadm -A /dev/md0 /dev/sd[abcd]1
> 
> And then have a look on your /etc/mdadm.conf, especially the line
> starting by
> ARRAY /dev/md0 ...

yes, I already found the problem in /etc/mdadm.conf , and it was totally
my mistake..

sorry to bother you with such lame problems.


Akos
-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: raid issues after power failure

2006-07-03 Thread Francois Barre

# mdadm --stop /dev/md0
# mdadm -A /dev/md0

will result in the array started with 3 drives out of 4 again. what am I
doing wrong?


Akos


AFAIK, mdadm -A  will use /etc/mdadm.conf to know what
underlying partitions you mean with your /dev/md0.

So, try

# mdadm --stop /dev/md0
# mdadm -A /dev/md0 /dev/sd[abcd]1

And then have a look on your /etc/mdadm.conf, especially the line starting by
ARRAY /dev/md0 ...

Regards,
-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: raid issues after power failure

2006-07-03 Thread Ákos Maróy
Francois Barre wrote:
> Well, Neil, I'm wondering,
> It seemed to me that Akos' description of the problem was that
> re-adding the drive (with mdadm not complaining about anything) would
> trigger a resync that would not even start.
> But as your '--force' does the trick, it implies that the resync was
> not really triggered after all without it... Or did I miss a bit of
> log Akos provided that did say so ?
> Could there be a place here for an error message ?

well, thing is it's still not totally OK. after doing an

# mdadm -A --force /dev/md0
# mdadm /dev/md0 -a /dev/sdb1

it starts to re-assemble the array. it takes a lot of time (like about 4
hours), which is OK. after the re-assembly is ready, all seems fine:

# cat /proc/mdstat
Personalities : [raid5] [raid4]
md0 : active raid5 sdb1[1] sda1[0] sdd1[3] sdc1[2]
  1172126208 blocks level 5, 64k chunk, algorithm 2 [4/4] []

unused devices: 
# mdadm --query /dev/md0
/dev/md0: 1117.83GiB raid5 4 devices, 0 spares. Use mdadm --detail for
more detail.
# mdadm --query --detail /dev/md0
/dev/md0:
Version : 00.90.03
  Creation Time : Tue Apr 25 16:17:14 2006
 Raid Level : raid5
 Array Size : 1172126208 (1117.83 GiB 1200.26 GB)
Device Size : 390708736 (372.61 GiB 400.09 GB)
   Raid Devices : 4
  Total Devices : 4
Preferred Minor : 0
Persistence : Superblock is persistent

Update Time : Mon Jul  3 00:16:39 2006
  State : clean
 Active Devices : 4
Working Devices : 4
 Failed Devices : 0
  Spare Devices : 0

 Layout : left-symmetric
 Chunk Size : 64K

   UUID : 8a66d568:0be5b0a0:93b729eb:6f23c014
 Events : 0.2701837

Number   Major   Minor   RaidDevice State
   0   810  active sync   /dev/sda1
   1   8   171  active sync   /dev/sdb1
   2   8   332  active sync   /dev/sdc1
   3   8   493  active sync   /dev/sdd1
#

right? checking with

# mdadm -E /dev/sb[a-d]1

will also show that all drives have the same event count, etc.

but, just doing a

# mdadm --stop /dev/md0
# mdadm -A /dev/md0

will result in the array started with 3 drives out of 4 again. what am I
doing wrong?


Akos
-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: raid issues after power failure

2006-07-02 Thread David Greaves
Francois Barre wrote:
> 2006/7/1, Ákos Maróy <[EMAIL PROTECTED]>:
>> Neil Brown wrote:
>> > Try adding '--force' to the -A line.
>> > That tells mdadm to try really hard to assemble the array.
>>
>> thanks, this seems to have solved the issue...
>>
>>
>> Akos
>>
>>
> 
> Well, Neil, I'm wondering,
> It seemed to me that Akos' description of the problem was that
> re-adding the drive (with mdadm not complaining about anything) would
> trigger a resync that would not even start.
> But as your '--force' does the trick, it implies that the resync was
> not really triggered after all without it... Or did I miss a bit of
> log Akos provided that did say so ?
> Could there be a place here for an error message ?
> 
> More generally, could it be usefull to build up a recovery howto,
> based on the experiences on this list (I guess 90% of the posts a
> related to recoveries) ?
> Not in terms of a standard disk loss, but in terms of a power failure
> or a major disk problem. You know, re-creating the array, rolling the
> dices, and *tada !* your data is back again... I could not find a bit
> of doc about this.
> 

Francois,
I have started to put a wiki in place here:
  http://linux-raid.osdl.org/

My reasoning was *exactly* that - there is reference information for md
but sometimes the incantations need a little explanation and often the
diagnostics are not obvious...

I've been subscribed to linux-raid since the middle of last year and
I've been going through old messages looking for nuggets to base some
docs around.

I haven't had a huge amount of time recently so I've just scribbled on
it for now - I wanted to present something a little more polished to the
community - but since you're asking...

So don't consider this an official announcement of a useable work yet -
more a 'Please contact me if you would like to contribute' (just so I
can keep track of interested parties) and we can build something up...

David
-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: raid issues after power failure

2006-07-02 Thread Francois Barre

2006/7/1, Ákos Maróy <[EMAIL PROTECTED]>:

Neil Brown wrote:
> Try adding '--force' to the -A line.
> That tells mdadm to try really hard to assemble the array.

thanks, this seems to have solved the issue...


Akos




Well, Neil, I'm wondering,
It seemed to me that Akos' description of the problem was that
re-adding the drive (with mdadm not complaining about anything) would
trigger a resync that would not even start.
But as your '--force' does the trick, it implies that the resync was
not really triggered after all without it... Or did I miss a bit of
log Akos provided that did say so ?
Could there be a place here for an error message ?

More generally, could it be usefull to build up a recovery howto,
based on the experiences on this list (I guess 90% of the posts a
related to recoveries) ?
Not in terms of a standard disk loss, but in terms of a power failure
or a major disk problem. You know, re-creating the array, rolling the
dices, and *tada !* your data is back again... I could not find a bit
of doc about this.

Regards,
-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: raid issues after power failure

2006-07-01 Thread Ákos Maróy
Neil Brown wrote:
> Try adding '--force' to the -A line.
> That tells mdadm to try really hard to assemble the array.

thanks, this seems to have solved the issue...


Akos

-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: raid issues after power failure

2006-06-30 Thread Neil Brown
On Friday June 30, [EMAIL PROTECTED] wrote:
> On Fri, 30 Jun 2006, Francois Barre wrote:
> > Did you try upgrading mdadm yet ?
> 
> yes, I have version 2.5 now, and it produces the same results.
> 

Try adding '--force' to the -A line.
That tells mdadm to try really hard to assemble the array.

You should be aware that when a degraded array has an unclean
shutdown it is possible that data corruption could result, possibly in
files that have not be changed for a long time.  It is also quite
possible that there is no data corruption, or it is only on part of
the array that are not actually in use.

I recommend at least a full 'fsck' in this situation.

NeilBrown
-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: raid issues after power failure

2006-06-30 Thread Francois Barre

2006/6/30, Akos Maroy <[EMAIL PROTECTED]>:

On Fri, 30 Jun 2006, Francois Barre wrote:
> Did you try upgrading mdadm yet ?

yes, I have version 2.5 now, and it produces the same results.


Akos




And I suppose there is no change in the various outputs mdadm is able
to produce (i.e. -D or -E).
Can you still access your md drive while it's resyncing ? There's
nothing showing up in the dmesg ? Did you try to play with the sysfs
access to md (forcing resync, ...).

I have not seen any deadlock-related patch in md from 2.6.16 to
current 2.6.17.2, but, if you have nothing to do, you could give it a
try I guess..
-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: raid issues after power failure

2006-06-30 Thread Akos Maroy

On Fri, 30 Jun 2006, Francois Barre wrote:

Did you try upgrading mdadm yet ?


yes, I have version 2.5 now, and it produces the same results.


Akos

-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: raid issues after power failure

2006-06-30 Thread Francois Barre

so, the situation seems that my array was degraded already when the power
failure happened, and then it got into the dirty state. what can one do
about such a situation?


Did you try upgrading mdadm yet ?
-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: raid issues after power failure

2006-06-30 Thread Akos Maroy

On Fri, 30 Jun 2006, Francois Barre wrote:

I'm wondering :
sd[acd] has an Update Time : Thu Jun 29 09:10:39 2006
sdb has an Update Time : Mon Jun 26 20:27:44 2006

When did your power failure happen ?


yesterday (29th). so it seems that /dev/sdb1 failed out on the 26th, and I 
just didn't take notice? :(



When did you run your mdadm -A /dev/md0 ?


after the power failure was over, and I had a chance to restart the 
machine. late on the 29th...


so, the situation seems that my array was degraded already when the power 
failure happened, and then it got into the dirty state. what can one do 
about such a situation?



Akos

-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: raid issues after power failure

2006-06-30 Thread Francois Barre

(answering to myself is one of my favourite hobbies)


Yep, this looks like it.
The events difference is quite big : 0.2701790 vs. 0.2607131... Could
it be that the sdb1 was marked faulty a couple of seconds before the
power failure ?



I'm wondering :
sd[acd] has an Update Time : Thu Jun 29 09:10:39 2006
sdb has an Update Time : Mon Jun 26 20:27:44 2006

When did your power failure happen ?
When did you run your mdadm -A /dev/md0 ?
-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: raid issues after power failure

2006-06-30 Thread Francois Barre

2006/6/30, Ákos Maróy <[EMAIL PROTECTED]>:

Francois,

Thank you for the very swift response.

> First, what is your mdadm version ?

# mdadm --version
mdadm - v1.12.0 - 14 June 2005


Rather old, version, isn't it ?
The freshest meat is 2.5.2, and can be grabbed here :
http://www.cse.unsw.edu.au/~neilb/source/mdadm/

It's quite strange that you have such an old mdadm with a 2.6.16 kernel.
That could be a good idea to upgrade...


what I see is that /dev/sdb1 is signaled as faulty (though it was known to
be not fault before the power failure), and that the whole array is dirty
because of the power failure - thus it can't resync.


Yep, this looks like it.
The events difference is quite big : 0.2701790 vs. 0.2607131... Could
it be that the sdb1 was marked faulty a couple of seconds before the
power failure ?



but even if I hot-add /dev/sdb1 after starting the array, it will say that
it's resyncing, but actually nothing will happen (no disk activity
according to vmstat, no CPU load, etc.)



That's strange. Once again, maybe upgrading your mdadm.

If it does not make it, I guess you'll have to re-constuct your array.
That does not necessarily mean you'll loose data, just playing with -C
and miscelaneous stuff, but I'm not good at that games... Neil is,
definitly.

Good luck.
-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: raid issues after power failure

2006-06-30 Thread Ákos Maróy
Francois,

Thank you for the very swift response.

> First, what is your mdadm version ?

# mdadm --version
mdadm - v1.12.0 - 14 June 2005

>
> Then, could you please show us the result of :
>
> mdadm -E /dev/sd[abcd]1


# mdadm -E /dev/sd[abcd]1
/dev/sda1:
  Magic : a92b4efc
Version : 00.90.03
   UUID : 8a66d568:0be5b0a0:93b729eb:6f23c014
  Creation Time : Tue Apr 25 16:17:14 2006
 Raid Level : raid5
   Raid Devices : 4
  Total Devices : 3
Preferred Minor : 0

Update Time : Thu Jun 29 09:10:39 2006
  State : active
 Active Devices : 3
Working Devices : 3
 Failed Devices : 1
  Spare Devices : 0
   Checksum : e2b8644f - correct
 Events : 0.2701790

 Layout : left-symmetric
 Chunk Size : 64K

  Number   Major   Minor   RaidDevice State
this 0   810  active sync   /dev/sda1

   0 0   810  active sync   /dev/sda1
   1 1   001  faulty removed
   2 2   8   332  active sync   /dev/sdc1
   3 3   8   493  active sync   /dev/sdd1
/dev/sdb1:
  Magic : a92b4efc
Version : 00.90.03
   UUID : 8a66d568:0be5b0a0:93b729eb:6f23c014
  Creation Time : Tue Apr 25 16:17:14 2006
 Raid Level : raid5
   Raid Devices : 4
  Total Devices : 4
Preferred Minor : 0

Update Time : Mon Jun 26 20:27:44 2006
  State : clean
 Active Devices : 4
Working Devices : 4
 Failed Devices : 0
  Spare Devices : 0
   Checksum : e2db6503 - correct
 Events : 0.2607131

 Layout : left-symmetric
 Chunk Size : 64K

  Number   Major   Minor   RaidDevice State
this 1   8   171  active sync   /dev/sdb1

   0 0   810  active sync   /dev/sda1
   1 1   8   171  active sync   /dev/sdb1
   2 2   8   332  active sync   /dev/sdc1
   3 3   8   493  active sync   /dev/sdd1
/dev/sdc1:
  Magic : a92b4efc
Version : 00.90.03
   UUID : 8a66d568:0be5b0a0:93b729eb:6f23c014
  Creation Time : Tue Apr 25 16:17:14 2006
 Raid Level : raid5
   Raid Devices : 4
  Total Devices : 3
Preferred Minor : 0

Update Time : Thu Jun 29 09:10:39 2006
  State : active
 Active Devices : 3
Working Devices : 3
 Failed Devices : 1
  Spare Devices : 0
   Checksum : e2b86473 - correct
 Events : 0.2701790

 Layout : left-symmetric
 Chunk Size : 64K

  Number   Major   Minor   RaidDevice State
this 2   8   332  active sync   /dev/sdc1

   0 0   810  active sync   /dev/sda1
   1 1   001  faulty removed
   2 2   8   332  active sync   /dev/sdc1
   3 3   8   493  active sync   /dev/sdd1
/dev/sdd1:
  Magic : a92b4efc
Version : 00.90.03
   UUID : 8a66d568:0be5b0a0:93b729eb:6f23c014
  Creation Time : Tue Apr 25 16:17:14 2006
 Raid Level : raid5
   Raid Devices : 4
  Total Devices : 3
Preferred Minor : 0

Update Time : Thu Jun 29 09:10:39 2006
  State : active
 Active Devices : 3
Working Devices : 3
 Failed Devices : 1
  Spare Devices : 0
   Checksum : e2b86485 - correct
 Events : 0.2701790

 Layout : left-symmetric
 Chunk Size : 64K

  Number   Major   Minor   RaidDevice State
this 3   8   493  active sync   /dev/sdd1

   0 0   810  active sync   /dev/sda1
   1 1   001  faulty removed
   2 2   8   332  active sync   /dev/sdc1
   3 3   8   493  active sync   /dev/sdd1




what I see is that /dev/sdb1 is signaled as faulty (though it was known to
be not fault before the power failure), and that the whole array is dirty
because of the power failure - thus it can't resync.

but even if I hot-add /dev/sdb1 after starting the array, it will say that
it's resyncing, but actually nothing will happen (no disk activity
according to vmstat, no CPU load, etc.)


Akos

-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: raid issues after power failure

2006-06-30 Thread Francois Barre

2006/6/30, Ákos Maróy <[EMAIL PROTECTED]>:

Hi,

Hi,



I have some issues reviving my raid5 array after a power failure.
[...]
strange - why wouldn't it take all four disks (it's omitting /dev/sdb1)?


First, what is your mdadm version ?

Then, could you please show us the result of :

mdadm -E /dev/sd[abcd]1

It would show whever sdb is accessible or not, and what the
superblocks look like...
-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


raid issues after power failure

2006-06-30 Thread Ákos Maróy
Hi,

I have some issues reviving my raid5 array after a power failure. I'm
running gentoo Linux 2.6.16, and I have a raid5 array /dev/md0 if 4 disks,
/dev/sd[a-d]1. On top of this, I have a crypto devmap with LUKS.

After the power failure, the array sort of starts up and doesn't at the
same time:

# cat /proc/mdstat
Personalities : [raid5] [raid4]
unused devices: 
# mdadm -A /dev/md0
mdadm: failed to RUN_ARRAY /dev/md0: Input/output error
# cat /proc/mdstat
Personalities : [raid5] [raid4]
md0 : inactive sda1[0] sdd1[3] sdc1[2]
  1172126208 blocks

unused devices: 
# mdadm --query /dev/md0
/dev/md0: 0.00KiB raid5 4 devices, 0 spares. Use mdadm --detail for more
detail./dev/md0: is too small to be an md component.
# mdadm --query --detail /dev/md0
/dev/md0:
Version : 00.90.03
  Creation Time : Tue Apr 25 16:17:14 2006
 Raid Level : raid5
Device Size : 390708736 (372.61 GiB 400.09 GB)
   Raid Devices : 4
  Total Devices : 3
Preferred Minor : 0
Persistence : Superblock is persistent

Update Time : Thu Jun 29 09:10:39 2006
  State : active, degraded
 Active Devices : 3
Working Devices : 3
 Failed Devices : 0
  Spare Devices : 0

 Layout : left-symmetric
 Chunk Size : 64K

   UUID : 8a66d568:0be5b0a0:93b729eb:6f23c014
 Events : 0.2701790

Number   Major   Minor   RaidDevice State
   0   810  active sync   /dev/sda1
   1   00-  removed
   2   8   332  active sync   /dev/sdc1
   3   8   493  active sync   /dev/sdd1
#


so it's sort of strange that on one hand (in /proc/mdstat) it's inactive,
but according to mdstat it's active? also, mdstat --query says it's
0.00KiB in size?


also, on the mdadm -A /dev/md0 call, the following is written to the syslog:


Jun 30 11:10:22 tower md: md0 stopped.
Jun 30 11:10:22 tower md: bind
Jun 30 11:10:22 tower md: bind
Jun 30 11:10:22 tower md: bind
Jun 30 11:10:22 tower md: md0: raid array is not clean -- starting
background reconstruction
Jun 30 11:10:22 tower raid5: device sda1 operational as raid disk 0
Jun 30 11:10:22 tower raid5: device sdd1 operational as raid disk 3
Jun 30 11:10:22 tower raid5: device sdc1 operational as raid disk 2
Jun 30 11:10:22 tower raid5: cannot start dirty degraded array for md0
Jun 30 11:10:22 tower RAID5 conf printout:
Jun 30 11:10:22 tower --- rd:4 wd:3 fd:1
Jun 30 11:10:22 tower disk 0, o:1, dev:sda1
Jun 30 11:10:22 tower disk 2, o:1, dev:sdc1
Jun 30 11:10:22 tower disk 3, o:1, dev:sdd1
Jun 30 11:10:22 tower raid5: failed to run raid set md0
Jun 30 11:10:22 tower md: pers->run() failed ...


strange - why wouldn't it take all four disks (it's omitting /dev/sdb1)?


Though probably these are very lame questions, I'd still appreciate any
help...


Akos

-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html