Re: raid6 check/repair

2007-11-21 Thread Thiemo Nagel

Dear Neal,


I have been looking a bit at the check/repair functionality in the
raid6 personality.

It seems that if an inconsistent stripe is found during repair, md
does not try to determine which block is corrupt (using e.g. the
method in section 4 of HPA's raid6 paper), but just recomputes the
parity blocks - i.e. the same way as inconsistent raid5 stripes are
handled.

Correct?


Correct!

The mostly likely cause of parity being incorrect is if a write to
data + P + Q was interrupted when one or two of those had been
written, but the other had not.

No matter which was or was not written, correctly P and Q will produce
a 'correct' result, and it is simple.  I really don't see any
justification for being more clever.


My opinion about that is quite different.  Speaking just for myself:

a) When I put my data on a RAID running on Linux, I'd expect the 
software to do everything which is possible to protect and when 
necessary to restore data integrity.  (This expectation was one of the 
reasons why I chose software RAID with Linux.)


b) As a consequence of a):  When I'm using a RAID level that has extra 
redundancy, I'd expect Linux to make use of that extra redundancy during 
a 'repair'.  (Otherwise I'd consider repair a misnomer and rather call 
it 'recalc parity'.)


c) Why should 'repair' be implemented in a way that only works in most 
cases when there exists a solution that works in all cases?  (After all, 
possibilities for corruption are many, e.g. bad RAM, bad cables, chipset 
bugs, driver bugs, last but not least human mistake.  From all these 
errors I'd like to be able to recover gracefully without putting the 
array at risk by removing and readding a component device.)


Bottom line:  So far I was talking about *my* expectations, is it 
reasonable to assume that it is shared by others?  Are there any 
arguments that I'm not aware of speaking against an improved 
implementation of 'repair'?


BTW:  I just checked, it's the same for RAID 1:  When I intentionally 
corrupt a sector in the first device of a set of 16, 'repair' copies the 
corrupted data to the 15 remaining devices instead of restoring the 
correct sector from one of the other fifteen devices to the first.


Thank you for your time.

Kind regards,

Thiemo Nagel
begin:vcard
fn:Thiemo Nagel
n:Nagel;Thiemo
org;quoted-printable:Technische Universit=C3=A4t M=C3=BCnchen;Physik Department E18
adr;quoted-printable:;;James-Franck-Stra=C3=9Fe;Garching;;85748;Germany
email;internet:[EMAIL PROTECTED]
title:Dipl. Phys.
tel;work:+49 (0)89 289-12592
x-mozilla-html:FALSE
version:2.1
end:vcard



RE: Offtopic: hardware advice for SAS RAID6

2007-11-21 Thread Daniel Korstad
I would be very interested to hear how that card works or other suggestions 
discussions too.
 
I have a 10 disk RAID 6 all inside a large case, using on board MB SATA and a 
couple 4 port PCI SATA cards.  For one, my PCI bus is saturated and a bottle 
neck and it is a bit of a pain to replace drives by having to open the case 
each time.
 
I have been running this for years without issues, but would like to upgrade 
the system at some point and would like to use mdadm software raid, with an 
external enclosure allowing easy drive swapping.
 
The most I can contribute to this thread are SATA Multilane Enclosures I have 
been eyeing.
http://www.pc-pitstop.com/sata_enclosures/
 
But like you, I need a card that is Linux friendly and provides hard drive raid 
control to mdadm.
 
I'll follow this one and would like to hear about your experiences with the SAS 
card you choose.

- Original Message -
From: Richard Michael [EMAIL PROTECTED]
Sent: Tue, 11/20/2007 10:08am
To: linux-raid@vger.kernel.org
Subject: Offtopic: hardware advice for SAS RAID6

On the heels of last week's post asking about hardware recommendations,
I'd like to ask a few questions too. :)

I'm considering my first SAS purchase.  I'm planning to build a software
RAID6 array using a SAS JBOD attached to a linux box.  I haven't decided
on any of the hardware specifics.

I'm leaning toward this PCI express LSI 3801e controller:
http://www.lsi.com/storage_home/products_home/host_bus_adapters/sas_hbas/lsisas3801e/index.html

Although, Adaptec has a similar PCI-X model.

I'd probably purchase a cheap Dell rackmount 1U server (e.g. PowerEdge
860) for the controller.  It has dual Gb ethernet, which I'd channel
bond for decent network I/O performance.

The JBOD would ideally be 1U or 2U holding 8 or 10 disks.  If I
understand SAS correctly, I'd probably have a unit with 2 SFF-8088
miniSAS connectors (although, I believe these connectors only support 4
devices, so if the JBOD is 8 disks, I don't know what would happen).

I'm completely undecided on the JBOD itself; recommendations here would
be greatly appreciated.  It's a bit of a shot in the dark.

I'd appreciate feedback and suggestions on the hardware above, or a
discussion of the performance.  (E.g. a discussion of SAS
ports/bandwith, PCI express lanes/bandwidth, disks and network to
determine the through put of this setup.)

Cheers,
Richard
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: raid6 check/repair

2007-11-21 Thread Neil Brown
On Wednesday November 21, [EMAIL PROTECTED] wrote:
 Dear Neal,
 
  I have been looking a bit at the check/repair functionality in the
  raid6 personality.
  
  It seems that if an inconsistent stripe is found during repair, md
  does not try to determine which block is corrupt (using e.g. the
  method in section 4 of HPA's raid6 paper), but just recomputes the
  parity blocks - i.e. the same way as inconsistent raid5 stripes are
  handled.
  
  Correct?
  
  Correct!
  
  The mostly likely cause of parity being incorrect is if a write to
  data + P + Q was interrupted when one or two of those had been
  written, but the other had not.
  
  No matter which was or was not written, correctly P and Q will produce
  a 'correct' result, and it is simple.  I really don't see any
  justification for being more clever.
 
 My opinion about that is quite different.  Speaking just for myself:
 
 a) When I put my data on a RAID running on Linux, I'd expect the 
 software to do everything which is possible to protect and when 
 necessary to restore data integrity.  (This expectation was one of the 
 reasons why I chose software RAID with Linux.)

Yes, of course.  possible is an import aspect of this.

 
 b) As a consequence of a):  When I'm using a RAID level that has extra 
 redundancy, I'd expect Linux to make use of that extra redundancy during 
 a 'repair'.  (Otherwise I'd consider repair a misnomer and rather call 
 it 'recalc parity'.)

The extra redundancy in RAID6 is there to enable you to survive two
drive failure.  Nothing more.

While it is possible to use the RAID6 P+Q information to deduce which
data block is wrong if it is known that either 0 or 1 datablocks is
wrong, it is *not* possible to deduce which block or blocks are wrong
if it is possible that more than 1 data block is wrong.
As it is quite possible for a write to be aborted in the middle
(during unexpected power down) with an unknown number of blocks in a
given stripe updated but others not, we do not know how many blocks
might be wrong so we cannot try to recover some wrong block.  Doing
so would quite possibly corrupt a block that is not wrong.

The repair process repairs the parity (redundancy information).
It does not repair the data.  It cannot.

The only possible scenario that md/raid recognises for the parity
information being wrong is the case of an unexpected shutdown in the
middle of a stripe write, where some blocks have been written and some
have not.
Further (for raid 4/5/6), it only supports this case when your array
is not degraded.  If you have a degraded array, then an unexpected
shutdown is potentially fatal to your data (the chances of it actually
being fatal is actually quite small, but the potential is still there).
There is nothing RAID can do about this.  It is not designed to
protect against power failure.  It is designed to protect again drive
failure.  It does that quite well.

If you have wrong data appearing on your device for some other reason,
then you have a serious hardware problem and RAID cannot help you.

The best approach to dealing with data on drives getting spontaneously
corrupted is for the filesystem to perform strong checksums on the
data block, and store the checksums in the indexing information.  This
provides detection, not recovery of course.

 
 c) Why should 'repair' be implemented in a way that only works in most 
 cases when there exists a solution that works in all cases?  (After all, 
 possibilities for corruption are many, e.g. bad RAM, bad cables, chipset 
 bugs, driver bugs, last but not least human mistake.  From all these 
 errors I'd like to be able to recover gracefully without putting the 
 array at risk by removing and readding a component device.)

As I said above - there is no solution that works in all cases.  If
more that one block is corrupt, and you don't know which ones, then
you lose and there is now way around that.
RAID is not designed to protect again bad RAM, bad cables, chipset
bugs drivers bugs etc.  It is only designed to protect against drive
failure, where the drive failure is apparent.  i.e. a read must return
either the same data that was last written, or a failure indication.
Anything else is beyond the design parameters for RAID.
It might be possible to design a data storage system that was
resilient to these sorts of errors.  It would be much more
sophisticated than RAID though.

NeilBrown


 
 Bottom line:  So far I was talking about *my* expectations, is it 
 reasonable to assume that it is shared by others?  Are there any 
 arguments that I'm not aware of speaking against an improved 
 implementation of 'repair'?
 
 BTW:  I just checked, it's the same for RAID 1:  When I intentionally 
 corrupt a sector in the first device of a set of 16, 'repair' copies the 
 corrupted data to the 15 remaining devices instead of restoring the 
 correct sector from one of the other fifteen devices to the first.
 
 Thank you for your time.
 
-
To unsubscribe from this list: send 

Re: BUG: soft lockup detected on CPU#1! (was Re: raid6 resync blocks the entire system)

2007-11-21 Thread Neil Brown
On Tuesday November 20, [EMAIL PROTECTED] wrote:
 
 My personal (wild) guess for this problem is, that there is somewhere a 
 global 
 lock, preventing all other CPUs to do something. At 100%s (at 80 MB/s) 
 there's probably not left any time frame to wake up the other CPUs or its 
 sufficiently small to only allow high priority kernel threads to do 
 something.
 When I limit the sync to 40MB/s each resync-CPU has to wait sufficiently long 
 to allow the other CPUs to wake up.
 
 

md doesn't hold any locks that would interfere with other parts of the
kernel from working.

I cannot imagine what would be causing your problems.  The resync
thread makes a point of calling cond_resched() periodically so that it
will let other processes run even if it constantly has work to do.

If you have nothing that could write to the RAID6 arrays, then I
cannot see how the resync could affect the rest of the system except
to reduce the amount of available CPU time.  And as CPU is normally
much faster than drives, you wouldn't expect that effect to be very
great.

Very strange.

Can you do 'alt-sysrq-T' when it is frozen and get the process traces
from the kernel logs?

Can you send me cat /proc/mdstat after the resyn has started, but
before the system has locked up.

I'm sorry that I cannot suggest anything more useful.

NeilBrown

-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html