Re: Kernel 2.6.23.9 / P35 Chipset + WD 750GB Drives (reset port)

2007-12-10 Thread Tejun Heo
Bill Davidsen wrote:
 Jan Engelhardt wrote:
 On Dec 1 2007 06:26, Justin Piszcz wrote:
 I ran the following:

 dd if=/dev/zero of=/dev/sdc
 dd if=/dev/zero of=/dev/sdd
 dd if=/dev/zero of=/dev/sde

 (as it is always a very good idea to do this with any new disk)

 Why would you care about what's on the disk? fdisk, mkfs and
 the day-to-day operation will overwrite it _anyway_.

 (If you think the disk is not empty, you should look at it
 and copy off all usable warez beforehand :-)

 Do you not test your drive for minimum functionality before using them?

I personally don't.

 Also, if you have the tools to check for relocated sectors before and
 after doing this, that's a good idea as well. S.M.A.R.T is your friend.
 And when writing /dev/zero to a drive, if it craps out you have less
 emotional attachment to the data.

Writing all zero isn't too useful tho.  Drive failing reallocation on
write is catastrophic failure.  It means that the drive wanna relocate
but can't because it used up all its extra space which usually indicates
something else is seriously wrong with the drive.  The drive will have
to go to the trash can.  This is all serious and bad but the catch is
that in such cases the problem usually stands like a sore thumb so
either vendor doesn't ship such drive or you'll find the failure very
early.  I personally haven't seen any such failure yet.  Maybe I'm lucky.

Most data loss occurs when the drive fails to read what it thought it
wrote successfully and the opposite - reading and dumping the whole disk
to /dev/null periodically is probably much better than writing zeros as
it allows the drive to find out deteriorating sector early while it's
still readable and relocate.  But then again I think it's an overkill.

Writing zeros to sectors is more useful as cure rather than prevention.
 If your drive fails to read a sector, write whatever value to the
sector.  The drive will forget about the data on the damaged sector and
reallocate and write new data to it.  Of course, you lose data which was
originally on the sector.

I personally think it's enough to just throw in an extra disk and make
it RAID0 or 5 and rebuild the array if read fails on one of the disks.
If write fails or read fail continues, replace the disk.  Of course, if
you wanna be extra cautious, good for you.  :-)

-- 
tejun
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Kernel 2.6.23.9 / P35 Chipset + WD 750GB Drives (reset port)

2007-12-10 Thread Tejun Heo
Justin Piszcz wrote:
 The badblocks did not do anything; however, when I built a software raid
 5 and the performed a dd:
 
 /usr/bin/time dd if=/dev/zero of=fill_disk bs=1M
 
 [42332.936615] ata5.00: exception Emask 0x2 SAct 0x7000 SErr 0x0 action
 0x2 frozen
 [42332.936706] ata5.00: spurious completions during NCQ issue=0x0
 SAct=0x7000 FIS=004040a1:0800
 
 Next test, I will turn off NCQ and try to make the problem re-occur.
 If anyone else has any thoughts here..?
 I ran long smart tests on all 3 disks, they all ran successfully.
 
 Perhaps these drives need to be NCQ BLACKLISTED with the P35 chipset?

That was me being stupid.  Patches for both upstream and -stable
branches are posted.  These will go away.

Thanks.

-- 
tejun
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: unable to remove failed drive

2007-12-10 Thread Bill Davidsen

Jeff Breidenbach wrote:

... and all access to array hangs indefinitely, resulting in unkillable zombie
processes. Have to hard reboot the machine. Any thoughts on the matter?

===

# cat /proc/mdstat
Personalities : [raid1]
md1 : active raid1 sde1[6](F) sdg1[1] sdb1[4] sdd1[3] sdc1[2]
  488383936 blocks [6/4] [__]

unused devices: none

# mdadm --fail /dev/md1 /dev/sde1
mdadm: set /dev/sde1 faulty in /dev/md1

# mdadm --remove /dev/md1 /dev/sde1
mdadm: hot remove failed for /dev/sde1: Device or resource busy

# mdadm -D /dev/md1
/dev/md1:
Version : 00.90.03
  Creation Time : Sun Dec 25 16:12:34 2005
 Raid Level : raid1
 Array Size : 488383936 (465.76 GiB 500.11 GB)
Device Size : 488383936 (465.76 GiB 500.11 GB)
   Raid Devices : 6
  Total Devices : 5
Preferred Minor : 1
Persistence : Superblock is persistent

Update Time : Fri Dec  7 11:37:46 2007
  State : active, degraded
 Active Devices : 4
Working Devices : 4
 Failed Devices : 1
  Spare Devices : 0

   UUID : f3ee6aa3:2f1d5767:f443dfc0:c23e80af
 Events : 0.22331500

Number   Major   Minor   RaidDevice State
   0   00-  removed
   1   8   971  active sync   /dev/sdg1
   2   8   332  active sync   /dev/sdc1
   3   8   493  active sync   /dev/sdd1
   4   8   174  active sync   /dev/sdb1
   5   00-  removed

   6   8   650  faulty   /dev/sde1

  
This is without doubt really messed up! You have four active devices, 
four working devices, five total devices, and six(!) raid devices. And 
at the end of the output seven(!!) devices, four active, two removed, 
and one faulty. I wouldn't even be able to make a guess how you go to 
this point, but I would guess that some system administration was involved.


If this is an array you can live without and still have a working system 
I do have a thought, however. If you can unmount everything on this 
device and then stop it, you may be able to assemble (-A) it with just 
the four working drives. If that succeeds you may be able to remove 
sde1, although I suspect that the two removed drives shown are really 
caused by partially removal of sde1 in the past. Either that or you have 
a serious problem with reliability...


I'm sure others will have some ideas on this, if it were mine a backup 
would be my first order of business.


--
Bill Davidsen [EMAIL PROTECTED]
 Woe unto the statesman who makes war without a reason that will still
 be valid when the war is over... Otto von Bismark 



-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 003 of 3] md: Update md bitmap during resync.

2007-12-10 Thread Mike Snitzer
On Dec 7, 2007 12:42 AM, NeilBrown [EMAIL PROTECTED] wrote:

 Currently and md array with a write-intent bitmap does not updated
 that bitmap to reflect successful partial resync.  Rather the entire
 bitmap is updated when the resync completes.

 This is because there is no guarentee that resync requests will
 complete in order, and tracking each request individually is
 unnecessarily burdensome.

 However there is value in regularly updating the bitmap, so add code
 to periodically pause while all pending sync requests complete, then
 update the bitmap.  Doing this only every few seconds (the same as the
 bitmap update time) does not notciably affect resync performance.

 Signed-off-by: Neil Brown [EMAIL PROTECTED]

Hi Neil,

You forgot to export bitmap_cond_end_sync.  Please see the attached patch.

regards,
Mike
diff --git a/drivers/md/bitmap.c b/drivers/md/bitmap.c
index f31ea4f..b596538 100644
--- a/drivers/md/bitmap.c
+++ b/drivers/md/bitmap.c
@@ -1566,3 +1566,4 @@ EXPORT_SYMBOL(bitmap_start_sync);
 EXPORT_SYMBOL(bitmap_end_sync);
 EXPORT_SYMBOL(bitmap_unplug);
 EXPORT_SYMBOL(bitmap_close_sync);
+EXPORT_SYMBOL(bitmap_cond_end_sync);