Re: raid6 check/repair
Dear Neal, I have been looking a bit at the check/repair functionality in the raid6 personality. It seems that if an inconsistent stripe is found during repair, md does not try to determine which block is corrupt (using e.g. the method in section 4 of HPA's raid6 paper), but just recomputes the parity blocks - i.e. the same way as inconsistent raid5 stripes are handled. Correct? Correct! The mostly likely cause of parity being incorrect is if a write to data + P + Q was interrupted when one or two of those had been written, but the other had not. No matter which was or was not written, correctly P and Q will produce a 'correct' result, and it is simple. I really don't see any justification for being more clever. My opinion about that is quite different. Speaking just for myself: a) When I put my data on a RAID running on Linux, I'd expect the software to do everything which is possible to protect and when necessary to restore data integrity. (This expectation was one of the reasons why I chose software RAID with Linux.) b) As a consequence of a): When I'm using a RAID level that has extra redundancy, I'd expect Linux to make use of that extra redundancy during a 'repair'. (Otherwise I'd consider repair a misnomer and rather call it 'recalc parity'.) c) Why should 'repair' be implemented in a way that only works in most cases when there exists a solution that works in all cases? (After all, possibilities for corruption are many, e.g. bad RAM, bad cables, chipset bugs, driver bugs, last but not least human mistake. From all these errors I'd like to be able to recover gracefully without putting the array at risk by removing and readding a component device.) Bottom line: So far I was talking about *my* expectations, is it reasonable to assume that it is shared by others? Are there any arguments that I'm not aware of speaking against an improved implementation of 'repair'? BTW: I just checked, it's the same for RAID 1: When I intentionally corrupt a sector in the first device of a set of 16, 'repair' copies the corrupted data to the 15 remaining devices instead of restoring the correct sector from one of the other fifteen devices to the first. Thank you for your time. Kind regards, Thiemo Nagel begin:vcard fn:Thiemo Nagel n:Nagel;Thiemo org;quoted-printable:Technische Universit=C3=A4t M=C3=BCnchen;Physik Department E18 adr;quoted-printable:;;James-Franck-Stra=C3=9Fe;Garching;;85748;Germany email;internet:[EMAIL PROTECTED] title:Dipl. Phys. tel;work:+49 (0)89 289-12592 x-mozilla-html:FALSE version:2.1 end:vcard
RE: Offtopic: hardware advice for SAS RAID6
I would be very interested to hear how that card works or other suggestions discussions too. I have a 10 disk RAID 6 all inside a large case, using on board MB SATA and a couple 4 port PCI SATA cards. For one, my PCI bus is saturated and a bottle neck and it is a bit of a pain to replace drives by having to open the case each time. I have been running this for years without issues, but would like to upgrade the system at some point and would like to use mdadm software raid, with an external enclosure allowing easy drive swapping. The most I can contribute to this thread are SATA Multilane Enclosures I have been eyeing. http://www.pc-pitstop.com/sata_enclosures/ But like you, I need a card that is Linux friendly and provides hard drive raid control to mdadm. I'll follow this one and would like to hear about your experiences with the SAS card you choose. - Original Message - From: Richard Michael [EMAIL PROTECTED] Sent: Tue, 11/20/2007 10:08am To: linux-raid@vger.kernel.org Subject: Offtopic: hardware advice for SAS RAID6 On the heels of last week's post asking about hardware recommendations, I'd like to ask a few questions too. :) I'm considering my first SAS purchase. I'm planning to build a software RAID6 array using a SAS JBOD attached to a linux box. I haven't decided on any of the hardware specifics. I'm leaning toward this PCI express LSI 3801e controller: http://www.lsi.com/storage_home/products_home/host_bus_adapters/sas_hbas/lsisas3801e/index.html Although, Adaptec has a similar PCI-X model. I'd probably purchase a cheap Dell rackmount 1U server (e.g. PowerEdge 860) for the controller. It has dual Gb ethernet, which I'd channel bond for decent network I/O performance. The JBOD would ideally be 1U or 2U holding 8 or 10 disks. If I understand SAS correctly, I'd probably have a unit with 2 SFF-8088 miniSAS connectors (although, I believe these connectors only support 4 devices, so if the JBOD is 8 disks, I don't know what would happen). I'm completely undecided on the JBOD itself; recommendations here would be greatly appreciated. It's a bit of a shot in the dark. I'd appreciate feedback and suggestions on the hardware above, or a discussion of the performance. (E.g. a discussion of SAS ports/bandwith, PCI express lanes/bandwidth, disks and network to determine the through put of this setup.) Cheers, Richard - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: raid6 check/repair
On Wednesday November 21, [EMAIL PROTECTED] wrote: Dear Neal, I have been looking a bit at the check/repair functionality in the raid6 personality. It seems that if an inconsistent stripe is found during repair, md does not try to determine which block is corrupt (using e.g. the method in section 4 of HPA's raid6 paper), but just recomputes the parity blocks - i.e. the same way as inconsistent raid5 stripes are handled. Correct? Correct! The mostly likely cause of parity being incorrect is if a write to data + P + Q was interrupted when one or two of those had been written, but the other had not. No matter which was or was not written, correctly P and Q will produce a 'correct' result, and it is simple. I really don't see any justification for being more clever. My opinion about that is quite different. Speaking just for myself: a) When I put my data on a RAID running on Linux, I'd expect the software to do everything which is possible to protect and when necessary to restore data integrity. (This expectation was one of the reasons why I chose software RAID with Linux.) Yes, of course. possible is an import aspect of this. b) As a consequence of a): When I'm using a RAID level that has extra redundancy, I'd expect Linux to make use of that extra redundancy during a 'repair'. (Otherwise I'd consider repair a misnomer and rather call it 'recalc parity'.) The extra redundancy in RAID6 is there to enable you to survive two drive failure. Nothing more. While it is possible to use the RAID6 P+Q information to deduce which data block is wrong if it is known that either 0 or 1 datablocks is wrong, it is *not* possible to deduce which block or blocks are wrong if it is possible that more than 1 data block is wrong. As it is quite possible for a write to be aborted in the middle (during unexpected power down) with an unknown number of blocks in a given stripe updated but others not, we do not know how many blocks might be wrong so we cannot try to recover some wrong block. Doing so would quite possibly corrupt a block that is not wrong. The repair process repairs the parity (redundancy information). It does not repair the data. It cannot. The only possible scenario that md/raid recognises for the parity information being wrong is the case of an unexpected shutdown in the middle of a stripe write, where some blocks have been written and some have not. Further (for raid 4/5/6), it only supports this case when your array is not degraded. If you have a degraded array, then an unexpected shutdown is potentially fatal to your data (the chances of it actually being fatal is actually quite small, but the potential is still there). There is nothing RAID can do about this. It is not designed to protect against power failure. It is designed to protect again drive failure. It does that quite well. If you have wrong data appearing on your device for some other reason, then you have a serious hardware problem and RAID cannot help you. The best approach to dealing with data on drives getting spontaneously corrupted is for the filesystem to perform strong checksums on the data block, and store the checksums in the indexing information. This provides detection, not recovery of course. c) Why should 'repair' be implemented in a way that only works in most cases when there exists a solution that works in all cases? (After all, possibilities for corruption are many, e.g. bad RAM, bad cables, chipset bugs, driver bugs, last but not least human mistake. From all these errors I'd like to be able to recover gracefully without putting the array at risk by removing and readding a component device.) As I said above - there is no solution that works in all cases. If more that one block is corrupt, and you don't know which ones, then you lose and there is now way around that. RAID is not designed to protect again bad RAM, bad cables, chipset bugs drivers bugs etc. It is only designed to protect against drive failure, where the drive failure is apparent. i.e. a read must return either the same data that was last written, or a failure indication. Anything else is beyond the design parameters for RAID. It might be possible to design a data storage system that was resilient to these sorts of errors. It would be much more sophisticated than RAID though. NeilBrown Bottom line: So far I was talking about *my* expectations, is it reasonable to assume that it is shared by others? Are there any arguments that I'm not aware of speaking against an improved implementation of 'repair'? BTW: I just checked, it's the same for RAID 1: When I intentionally corrupt a sector in the first device of a set of 16, 'repair' copies the corrupted data to the 15 remaining devices instead of restoring the correct sector from one of the other fifteen devices to the first. Thank you for your time. - To unsubscribe from this list: send
Re: BUG: soft lockup detected on CPU#1! (was Re: raid6 resync blocks the entire system)
On Tuesday November 20, [EMAIL PROTECTED] wrote: My personal (wild) guess for this problem is, that there is somewhere a global lock, preventing all other CPUs to do something. At 100%s (at 80 MB/s) there's probably not left any time frame to wake up the other CPUs or its sufficiently small to only allow high priority kernel threads to do something. When I limit the sync to 40MB/s each resync-CPU has to wait sufficiently long to allow the other CPUs to wake up. md doesn't hold any locks that would interfere with other parts of the kernel from working. I cannot imagine what would be causing your problems. The resync thread makes a point of calling cond_resched() periodically so that it will let other processes run even if it constantly has work to do. If you have nothing that could write to the RAID6 arrays, then I cannot see how the resync could affect the rest of the system except to reduce the amount of available CPU time. And as CPU is normally much faster than drives, you wouldn't expect that effect to be very great. Very strange. Can you do 'alt-sysrq-T' when it is frozen and get the process traces from the kernel logs? Can you send me cat /proc/mdstat after the resyn has started, but before the system has locked up. I'm sorry that I cannot suggest anything more useful. NeilBrown - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html