Re: PROBLEM: RAID5 reshape data corruption

2008-01-06 Thread Nagilum

- Message from [EMAIL PROTECTED] -
Date: Sun, 06 Jan 2008 00:31:46 +0100
From: Nagilum [EMAIL PROTECTED]


At the moment I'm thinking about writing a small perl program that
will generate me a shell script or makefile containing dd commands
that will copy the chunks from the drive to /dev/md0. I don't care if
that will be dog slow as long as I get most of my data back. (I'd
probably go forward instead of backward to take advantage of the
readahead, after I've determined the exact start chunk.)
For that I need to know one more thing.
Used Dev Size is 488308672k for md0 as well as the disk, 16k chunk size.
488308672/16 = 30519292.00
so the first dd would look like:
 dd if=/dev/sdg of=/dev/md0 bs=16k count=1 skip=30519291 seek=X

The big question now being how to calculate X.
Since I have a working testcase I can do a lot of testing before
touching the real thing. The formula to get X will probably contain a
5 for the 5(+1) devices the raid spans now, a 4 for the 4(+1) devices
the raid spanned before the reshape, a 3 for the device number of the
disk that failed and of course the skip/current chunk number.
Can you help me come up with it?
Thanks again for looking into the whole issue.

- End message from [EMAIL PROTECTED] -

Ok, the spare time over the weekend allowed me to make some headway.
I'm not sure if the attachment will make it through to the ML so I  
uploaded the perl script to: http://www.nagilum.de/md/rdrep.pl
First tests show already promising results although I seem to miss the  
start of the error corruption. Anyway unlike with the testcase at the  
real array I have to start after the area that is unreadable. I have  
already determined that last Friday.

Anyway I would appreciate it if someone could have a look over the script.
I'll probably change it a little bit and make every other dd run via  
exec instead of system to use some parallelism. (I guess the overhead  
for runnung dd will take about as much time as the transfer itself)

Thanks again,
Alex


#_  __  _ __ http://www.nagilum.org/ \n icq://69646724 #
#   / |/ /__  _(_) /_  _  [EMAIL PROTECTED] \n +491776461165 #
#  // _ `/ _ `/ / / // /  ' \  Amiga (68k/PPC): AOS/NetBSD/Linux   #
# /_/|_/\_,_/\_, /_/_/\_,_/_/_/_/   Mac (PPC): MacOS-X / NetBSD /Linux #
#   /___/ x86: FreeBSD/Linux/Solaris/Win2k  ARM9: EPOC EV6 #




cakebox.homeunix.net - all the machine one needs..



rdrep.pl
Description: Perl program


pgpqtVehc384R.pgp
Description: PGP Digital Signature


Re: PROBLEM: RAID5 reshape data corruption

2008-01-05 Thread Nagilum

- Message from [EMAIL PROTECTED] -
 Date: Fri, 4 Jan 2008 09:37:24 +1100
 From: Neil Brown [EMAIL PROTECTED]
Reply-To: Neil Brown [EMAIL PROTECTED]
  Subject: Re: PROBLEM: RAID5 reshape data corruption
   To: Nagilum [EMAIL PROTECTED]
   Cc: linux-raid@vger.kernel.org, Dan Williams
[EMAIL PROTECTED], H. Peter Anvin [EMAIL PROTECTED]



I'm not just interested in a simple behaviour fix I'm also interested
in what actually happens and if possible a repair program for that
kind of data corruption.


What happens is that when reshape happens while a device is missing,
the data on that device should be computed from the other data devices
and parity.  However because of the above bug, the data is copied into
the new layout before the compute is complete.  This means that the
data that was on that device is really lost beyond recovery.

I'm really sorry about that, but there is nothing that can be done to
recover the lost data.


Thanks a lot Neil!
I can confirm your findings, the data in the chunks is the data from  
the broken device. Now to my particular case:

I still have the old disk and I haven't touched the array since.
I just run a dd_rescue -r (reverse) on the old disk and as I expected  
most of it (99%) is still readable. So what I want to do is read the  
chunks from that disk - starting at the end down to the 4% point where  
the reshape was interrupted due to the disk read error - and replace  
the chunks on md0.

That should restore most of the data.
Now in order to do so I need to know how to calculate the different  
positions of the chunks.

So for the old disk I have:
nas:~# mdadm -E /dev/sdg
/dev/sdg:
  Magic : a92b4efc
Version : 00.91.00
   UUID : 25da80a6:d56eb9d6:0d7656f3:2f233380
  Creation Time : Sat Sep 15 21:11:41 2007
 Raid Level : raid5
  Used Dev Size : 488308672 (465.69 GiB 500.03 GB)
 Array Size : 2441543360 (2328.44 GiB 2500.14 GB)
   Raid Devices : 6
  Total Devices : 7
Preferred Minor : 0

  Reshape pos'n : 118360960 (112.88 GiB 121.20 GB)
  Delta Devices : 1 (5-6)

Update Time : Fri Nov 23 20:05:50 2007
  State : active
 Active Devices : 6
Working Devices : 7
 Failed Devices : 0
  Spare Devices : 1
   Checksum : 9a8358c4 - correct
 Events : 0.677965

 Layout : left-symmetric
 Chunk Size : 16K

  Number   Major   Minor   RaidDevice State
this 3   8   963  active sync   /dev/sdg

   0 0   800  active sync   /dev/sda
   1 1   8   161  active sync   /dev/sdb
   2 2   8   322  active sync   /dev/sdc
   3 3   8   963  active sync   /dev/sdg
   4 4   8   644  active sync   /dev/sde
   5 5   8   805  active sync   /dev/sdf
   6 6   8   486  spare   /dev/sdd

the current array is:

nas:~# mdadm -Q --detail /dev/md0
/dev/md0:
Version : 00.90.03
  Creation Time : Sat Sep 15 21:11:41 2007
 Raid Level : raid5
 Array Size : 2441543360 (2328.44 GiB 2500.14 GB)
  Used Dev Size : 488308672 (465.69 GiB 500.03 GB)
   Raid Devices : 6
  Total Devices : 6
Preferred Minor : 0
Persistence : Superblock is persistent

Update Time : Sat Jan  5 17:53:54 2008
  State : clean
 Active Devices : 6
Working Devices : 6
 Failed Devices : 0
  Spare Devices : 0

 Layout : left-symmetric
 Chunk Size : 16K

   UUID : 25da80a6:d56eb9d6:0d7656f3:2f233380
 Events : 0.986918

Number   Major   Minor   RaidDevice State
   0   800  active sync   /dev/sda
   1   8   161  active sync   /dev/sdb
   2   8   322  active sync   /dev/sdc
   3   8   483  active sync   /dev/sdd
   4   8   644  active sync   /dev/sde
   5   8   805  active sync   /dev/sdf

At the moment I'm thinking about writing a small perl program that  
will generate me a shell script or makefile containing dd commands  
that will copy the chunks from the drive to /dev/md0. I don't care if  
that will be dog slow as long as I get most of my data back. (I'd  
probably go forward instead of backward to take advantage of the  
readahead, after I've determined the exact start chunk.)

For that I need to know one more thing.
Used Dev Size is 488308672k for md0 as well as the disk, 16k chunk size.
488308672/16 = 30519292.00
so the first dd would look like:
 dd if=/dev/sdg of=/dev/md0 bs=16k count=1 skip=30519291 seek=X

The big question now being how to calculate X.
Since I have a working testcase I can do a lot of testing before  
touching the real thing. The formula to get X will probably contain a  
5 for the 5(+1) devices the raid spans now, a 4 for the 4(+1) devices  
the raid spanned before the reshape, a 3 for the device number of the  
disk that failed

Re: PROBLEM: RAID5 reshape data corruption

2008-01-03 Thread Neil Brown
On Monday December 31, [EMAIL PROTECTED] wrote:
 Ok, since my previous thread didn't seem to attract much attention,
 let me try again.

Thank you for your report and your patience.

 An interrupted RAID5 reshape will cause the md device in question to
 contain one corrupt chunk per stripe if resumed in the wrong manner.
 A testcase can be found at http://www.nagilum.de/md/ .
 The first testcase can be initialized with start.sh the real test
 can then be run with test.sh. The first testcase also uses dm-crypt
 and xfs to show the corruption.

It looks like this can be fixed with the patch:

Signed-off-by: Neil Brown [EMAIL PROTECTED]

### Diffstat output
 ./drivers/md/raid5.c |2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff .prev/drivers/md/raid5.c ./drivers/md/raid5.c
--- .prev/drivers/md/raid5.c2008-01-04 09:20:54.0 +1100
+++ ./drivers/md/raid5.c2008-01-04 09:21:05.0 +1100
@@ -2865,7 +2865,7 @@ static void handle_stripe5(struct stripe
md_done_sync(conf-mddev, STRIPE_SECTORS, 1);
}
 
-   if (s.expanding  s.locked == 0)
+   if (s.expanding  s.locked == 0  s.req_compute == 0)
handle_stripe_expansion(conf, sh, NULL);
 
if (sh-ops.count)


With this patch in place, the v2 test only reports errors after the end
of the original array, as you would expect (the new space is
initialised to 0).

 I'm not just interested in a simple behaviour fix I'm also interested
 in what actually happens and if possible a repair program for that
 kind of data corruption.

What happens is that when reshape happens while a device is missing,
the data on that device should be computed from the other data devices
and parity.  However because of the above bug, the data is copied into
the new layout before the compute is complete.  This means that the
data that was on that device is really lost beyond recovery.

I'm really sorry about that, but there is nothing that can be done to
recover the lost data.

NeilBrown
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


PROBLEM: RAID5 reshape data corruption

2007-12-31 Thread Nagilum

Ok, since my previous thread didn't seem to attract much attention,
let me try again.
An interrupted RAID5 reshape will cause the md device in question to
contain one corrupt chunk per stripe if resumed in the wrong manner.
A testcase can be found at http://www.nagilum.de/md/ .
The first testcase can be initialized with start.sh the real test
can then be run with test.sh. The first testcase also uses dm-crypt
and xfs to show the corruption.
The second testcase uses nothing but mdadm and testpat - a small
program to write and verify a simple testpattern designed to find
block data corruptions. Use v2_start.sh  v2_test.sh to run.
At the end it will point out all the wrong bytes on the md device.
I'm not just interested in a simple behaviour fix I'm also interested
in what actually happens and if possible a repair program for that
kind of data corruption.
The bug is architectural agnostic. I first came across it using  
2.6.23.8 on amd64 but I verified it on 2.6.23.[8-12] and  
2.6.24-rc[5,6] on ppc. Always using mdadm 2.6.4.

The situation the bug first showed up was as follows:
1. A RAID5 reshape from 5-6 device was started.
2. After about 4% one disk failed, the machine appeared unresponsive  
and was rebooted.

3. A spare disk was added to the array.
4. The bad drive was re-added to the array in a different bay and the  
reshape resumed.

5. The drive failed again but the reshape continued.
6. The reshaped finished and after that the resync. The data after at  
about 4% on the md device is broken as described above.


Kind regards,
Alex.



#_  __  _ __ http://www.nagilum.org/ \n icq://69646724 #
#   / |/ /__  _(_) /_  _  [EMAIL PROTECTED] \n +491776461165 #
#  // _ `/ _ `/ / / // /  ' \  Amiga (68k/PPC): AOS/NetBSD/Linux   #
# /_/|_/\_,_/\_, /_/_/\_,_/_/_/_/   Mac (PPC): MacOS-X / NetBSD /Linux #
#   /___/ x86: FreeBSD/Linux/Solaris/Win2k  ARM9: EPOC EV6 #






cakebox.homeunix.net - all the machine one needs..



pgp41FEJ6D5Gy.pgp
Description: PGP Digital Signature