Re: Bad drive discovered during raid5 reshape

2007-10-30 Thread Neil Brown
On Tuesday October 30, [EMAIL PROTECTED] wrote:
 Neil Brown wrote:
  On Monday October 29, [EMAIL PROTECTED] wrote:
  Hi,
  I bought two new hard drives to expand my raid array today and
  unfortunately one of them appears to be bad. The problem didn't arise
 
  Looks like you are in real trouble.  Both the drives seem bad in some
  way.  If it was just sdc that was failing it would have picked up
  after the -Af, but when it tried, sdb gave errors.
 
 Humble enquiry :)
 
 I'm not sure that's right?
 He *removed* sdb and sdc when the failure occurred so sdc would indeed be 
 non-fresh.

I'm not sure what point you are making here.
In any case, remove two drives from a raid5 is always a bad thing.
Part of the array was striped over 8 drives by this time.  With only
six still in the array, some data will be missing.

 
 The key question I think is: will md continue to grow an array even if it 
 enters
 degraded mode during the grow?
 ie grow from a 6 drive array to a 7-of-8 degraded array?
 
 Technically I guess it should be able to.

Yes, md can grow to a degraded array.  If you get a single failure I
would expect it to abort the growth process, then restart where it
left off (after checking that that made sense).

 
 In which case should he be able to re-add /dev/sdc and allow md to retry the
 grow? (possibly losing some data due to the sdc staleness)

He only needs one of the two drives in there.  I got the impression
that both sdc and sdb had reported errors.  If not, and sdc really
seems OK, then --assemble --force listing all drives except sdb
should make it all work again.

NeilBrown
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Bad drive discovered during raid5 reshape

2007-10-30 Thread Kyle Stuart
David Greaves wrote:
 I read that he aborted it, then removed both drives before giving md a chance 
 to
 restart.

 He said:
 After several minutes dmesg indicated that mdadm gave up and
 the grow process stopped. After googling around I tried the solutions
 that seemed most likely to work, including removing the new drives with
 mdadm --remove --force /dev/md1 /dev/sd[bc]1 and rebooting

 and *then* he: ran mdadm -Af /dev/md1.
   
This is correct. I first removed sdb and sdc then rebooted and ran mdadm
-Af /dev/md1.

 Kyle - I think you need to clarify this as it may not be too bad. Apologies 
 if I
 misread something and sdc is bad too :)

 It may be an idea to let us (Neil) know what you've done and if you've done 
 any
 writes to any devices before trying this assemble.

 David
When I sent the first email I thought only sdb had failed. After digging
into the log files it appears sdc also reported several bad blocks
during the grow. This is what I get for not testing cheap refurbed
drives before trusting them with my data, but hindsight is 20/20.
Fortunately all of the important data is backed up so if I can't recover
anything using Neil's suggestions it's not a total loss.

Thank you both for the help.
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Bad drive discovered during raid5 reshape

2007-10-29 Thread Kyle Stuart
Hi,
I bought two new hard drives to expand my raid array today and
unfortunately one of them appears to be bad. The problem didn't arise
until after I attempted to grow the raid array. I was trying to expand
the array from 6 to 8 drives. I added both drives using mdadm --add
/dev/md1 /dev/sdb1 which completed, then mdadm --add /dev/md1 /dev/sdc1
which also completed. I then ran mdadm --grow /dev/md1 --raid-devices=8.
It passed the critical section, then began the grow process.

After a few minutes I started to hear unusual sounds from within the
case. Fearing the worst I tried to cat /proc/mdstat which resulted in no
output so I checked dmesg which showed that /dev/sdb1 was not working
correctly. After several minutes dmesg indicated that mdadm gave up and
the grow process stopped. After googling around I tried the solutions
that seemed most likely to work, including removing the new drives with
mdadm --remove --force /dev/md1 /dev/sd[bc]1 and rebooting after which I
ran mdadm -Af /dev/md1. The grow process restarted then failed almost
immediately. Trying to mount the drive gives me a reiserfs replay
failure and suggests running fsck. I don't dare fsck the array since
I've already messed it up so badly. Is there any way to go back to the
original working 6 disc configuration with minimal data loss? Here's
where I'm at right now, please let me know if I need to include any
additional information.

# uname -a
Linux nas 2.6.22-gentoo-r5 #1 SMP Thu Aug 23 16:59:47 MDT 2007 x86_64
AMD Athlon(tm) 64 Processor 3500+ AuthenticAMD GNU/Linux

# mdadm --version
mdadm - v2.6.2 - 21st May 2007

# cat /proc/mdstat
Personalities : [raid6] [raid5] [raid4]
md1 : active raid5 hdb1[0] sdb1[8](F) sda1[5] sdf1[4] sde1[3] sdg1[2]
sdd1[1]
  1220979520 blocks super 0.91 level 5, 64k chunk, algorithm 2 [8/6]
[UU__]

unused devices: none

# mdadm --detail --verbose /dev/md1
/dev/md1:
Version : 00.91.03
  Creation Time : Sun Apr  8 19:48:01 2007
 Raid Level : raid5
 Array Size : 1220979520 (1164.42 GiB 1250.28 GB)
  Used Dev Size : 244195904 (232.88 GiB 250.06 GB)
   Raid Devices : 8
  Total Devices : 7
Preferred Minor : 1
Persistence : Superblock is persistent

Update Time : Mon Oct 29 00:53:21 2007
  State : clean, degraded
 Active Devices : 6
Working Devices : 6
 Failed Devices : 1
  Spare Devices : 0

 Layout : left-symmetric
 Chunk Size : 64K

  Delta Devices : 2, (6-8)

   UUID : 56e7724e:9a5d0949:ff52889f:ac229049
 Events : 0.487460

Number   Major   Minor   RaidDevice State
   0   3   650  active sync   /dev/hdb1
   1   8   491  active sync   /dev/sdd1
   2   8   972  active sync   /dev/sdg1
   3   8   653  active sync   /dev/sde1
   4   8   814  active sync   /dev/sdf1
   5   815  active sync   /dev/sda1
   6   006  removed
   8   8   177  faulty spare rebuilding   /dev/sdb1

#dmesg
snip
md: md1 stopped.
md: unbindhdb1
md: export_rdev(hdb1)
md: unbindsdc1
md: export_rdev(sdc1)
md: unbindsdb1
md: export_rdev(sdb1)
md: unbindsda1
md: export_rdev(sda1)
md: unbindsdf1
md: export_rdev(sdf1)
md: unbindsde1
md: export_rdev(sde1)
md: unbindsdg1
md: export_rdev(sdg1)
md: unbindsdd1
md: export_rdev(sdd1)
md: bindsdd1
md: bindsdg1
md: bindsde1
md: bindsdf1
md: bindsda1
md: bindsdb1
md: bindsdc1
md: bindhdb1
md: md1 stopped.
md: unbindhdb1
md: export_rdev(hdb1)
md: unbindsdc1
md: export_rdev(sdc1)
md: unbindsdb1
md: export_rdev(sdb1)
md: unbindsda1
md: export_rdev(sda1)
md: unbindsdf1
md: export_rdev(sdf1)
md: unbindsde1
md: export_rdev(sde1)
md: unbindsdg1
md: export_rdev(sdg1)
md: unbindsdd1
md: export_rdev(sdd1)
md: bindsdd1
md: bindsdg1
md: bindsde1
md: bindsdf1
md: bindsda1
md: bindsdb1
md: bindsdc1
md: bindhdb1
md: kicking non-fresh sdc1 from array!
md: unbindsdc1
md: export_rdev(sdc1)
raid5: reshape will continue
raid5: device hdb1 operational as raid disk 0
raid5: device sdb1 operational as raid disk 7
raid5: device sda1 operational as raid disk 5
raid5: device sdf1 operational as raid disk 4
raid5: device sde1 operational as raid disk 3
raid5: device sdg1 operational as raid disk 2
raid5: device sdd1 operational as raid disk 1
raid5: allocated 8462kB for md1
raid5: raid level 5 set md1 active with 7 out of 8 devices, algorithm 2
RAID5 conf printout:
 --- rd:8 wd:7
 disk 0, o:1, dev:hdb1
 disk 1, o:1, dev:sdd1
 disk 2, o:1, dev:sdg1
 disk 3, o:1, dev:sde1
 disk 4, o:1, dev:sdf1
 disk 5, o:1, dev:sda1
 disk 7, o:1, dev:sdb1
...ok start reshape thread
md: reshape of RAID array md1
md: minimum _guaranteed_  speed: 1000 KB/sec/disk.
md: using maximum available idle IO bandwidth (but not more than 20
KB/sec) for reshape.
md: using 128k window, over a total of 244195904 blocks.
ata2.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x2 frozen
ata2.00: cmd 

Re: Bad drive discovered during raid5 reshape

2007-10-29 Thread Neil Brown
On Monday October 29, [EMAIL PROTECTED] wrote:
 Hi,
 I bought two new hard drives to expand my raid array today and
 unfortunately one of them appears to be bad. The problem didn't arise
 until after I attempted to grow the raid array. I was trying to expand
 the array from 6 to 8 drives. I added both drives using mdadm --add
 /dev/md1 /dev/sdb1 which completed, then mdadm --add /dev/md1 /dev/sdc1
 which also completed. I then ran mdadm --grow /dev/md1 --raid-devices=8.
 It passed the critical section, then began the grow process.
 
 After a few minutes I started to hear unusual sounds from within the
 case. Fearing the worst I tried to cat /proc/mdstat which resulted in no
 output so I checked dmesg which showed that /dev/sdb1 was not working
 correctly. After several minutes dmesg indicated that mdadm gave up and
 the grow process stopped. After googling around I tried the solutions
 that seemed most likely to work, including removing the new drives with
 mdadm --remove --force /dev/md1 /dev/sd[bc]1 and rebooting after which I
 ran mdadm -Af /dev/md1. The grow process restarted then failed almost
 immediately. Trying to mount the drive gives me a reiserfs replay
 failure and suggests running fsck. I don't dare fsck the array since
 I've already messed it up so badly. Is there any way to go back to the
 original working 6 disc configuration with minimal data loss? Here's
 where I'm at right now, please let me know if I need to include any
 additional information.

Looks like you are in real trouble.  Both the drives seem bad in some
way.  If it was just sdc that was failing it would have picked up
after the -Af, but when it tried, sdb gave errors.

Have two failed devices in a RAID5 is not good!

Your best bet goes like this:

  The reshape has started and got up to some point.  The data
  before that point is spread over 8 drives.  The data after is over
  6.
  We need to restripe the 8drive data back to 6 drives.  This can be
  done with the test_stripe tool that can be built from the mdadm
  source. 

  1/ Find out how far the reshape progressed, by using mdadm -E on
 one of the devices.
  2/ use something like
test_stripe save /some/file 8 $chunksize 5 2 0 $length  /dev/..

 If you get all the args right, this should copy the data from
 the array into /some/file.
 You could possibly do the same thing by assembling the array 
 read-only (set /sys/modules/md_mod/parameters/start_ro to 1)
 and 'dd' from the array.  It might be worth doing both and
 checking you get the same result.

  3/ use something like
test_stripe restore /some/file 6 ..
 to restore the data to just 6 devices.

  4/ use mdadm -C to create the array a-new on the 6 devices.  Make
 sure the order and the chunksize etc is preserved.

 Once you have done this, the start of the array should (again)
 look like the content of /some/file.  It wouldn't hurt to check.

   Then your data would be as much back together as possible.
   You will probably still need to do an fsck, but I think you did the
   right thing in holding off.  Don't do an fsck until you are sure
   the array is writable.

You can probably do the above without using test_stripe by using dd to
copy of the array before you recreate it, then using dd to put the
same data back.  Using test_stripe as well might give you extra
confidence. 

Feel free to ask questions

NeilBrown
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html