Re: PATA/SATA Disk Reliability paper
Mark Hahn wrote: In contrast, ever since these holes appeared, drive failures became the norm. wow, great conspiracy theory! I think you misunderstand. I just meant plain old-fashioned mis-engineering. maybe the hole is plugged at the factory with a substance which evaporates at 1/warranty-period ;) Actually it's plugged with a thin paper-like filter, which does not seem to evaporate easily. And it's got nothing to do with warranty, although if you get lucky and the failure happens within the warranty period, you can probably demand a replacement drive to make you feel better. But remember, the google report mentions a great number of drives failing for no apparent reason, not even a smart warning, so failing within the warranty period is just pure luck. seriously, isn't it easy to imagine a bladder-like arrangement that permits equilibration without net flow? disk spec-sheets do limit this - I checked the seagate 7200.10: 10k feet operating, 40k max. amusingly -200 feet is the min either way... Well, it looks like filtered net flow on wd's. What's it look like on seagate? Doe anyone rememnber that you had to let you drives acclimate to your machine room for a day or so before you used them. The problem is, that's not enough; the room temperature/humidity has to be controlled too. In a desktop environment, that's not really feasible. 5-90% humidity, operating, 95% non-op, and 30%/hour. seems pretty easy to me. in fact, I frequently ask people to justify the assumption that a good machineroom needs tight control over humidity. (assuming, like most machinerooms, you aren't frequently handling the innards.) I agree, but reality has a different opinion, and it may take down that drive, specs or no specs. A good way to deal with reality is to find the real reasons for failure. Once these reasons are known, engineering quality drives becomes, thank GOD, really rather easy. Thanks! -- Al - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Linux Software RAID a bit of a weakness?
On Fri, 2007-02-23 at 14:55 -0500, Steve Cousins wrote: Yes, this is an important thing to keep on top of, both for hardware RAID and software RAID. For md: echo check /sys/block/md0/md/sync_action This should be done regularly. I have cron do it once a week. Check out: http://neil.brown.name/blog/20050727141521-002 Good luck, Steve Thanks for all the info. A further search around seems to reveal the seriousness of this issue. So called Disk/Data Scrubbing seems to be vital for keeping a modern large RAID healthy. I've found a few interesting links. http://www.ashtech.net/~syntax/blog/archives/53-Data-Scrub-with-Linux-RAID-or-Die.html The link of particular interest from the above is http://www.nber.org/sys-admin/linux-nas-raid.html The really scary item is entitled, Why do drive failures come in pairs?, it has the following : === Let's repeat the reliability calculation with our new knowledge of the situation. In our experience perhaps half of drives have at least one unreadable sector in the first year. Again assume a 6 percent chance of a single failure. The chance of at least one of the remaining two drives having a bad sector is 75% (1-(1-.5)^2). So the RAID 5 failure rate is about 4.5%/year, which is .5% MORE than the 4% failure rate one would expect from a two drive RAID 0 with the same capacity. Alternatively, if you just had two drives with a partition on each and no RAID of any kind, the chance of a failure would still be 4%/year but only half the data loss per incident, which is considerably better than the RAID 5 can even hope for under the current reconstruction policy even with the most expensive hardware. === That's got my attention! My RAID 5 is worse than a 2 disk RAID 0. It goes on about a surface scan being used to mitigate this problem. The article also talks about how on reconstruction perhaps the md driver should not just give up is it finds bad blocks on the disk but do something cleverer. I don't know if that's valid or not. But this all leaves me with a big problem. As the systems I have Software RAID running are fully supported RH 4 ES systems (running the 2.6.9-42.0.8 kernel, I can't really change it without losing RH support). They therefore do not have the check option in the kernel. Is there anything else I can do? Would forcing a resync achieve the same result (or is that down right dangerous as the array is not considered consistent for a while). Any thoughts apart from my one being to upgrade them to RH5 when that appears with a probably 2.6.18 kernel (which will presumably have check)? Any thoughts? Is this something that should be added to the Software-RAID-HOWTO? Just for reference the current Dell Perc 5i controllers has a thing called Patrol Read, which goes off and does a scrub in the background. Thanks again Colin This email and any files transmitted with it are confidential and are intended solely for the use of the individual or entity to whom they are addressed. If you are not the original recipient or the person responsible for delivering the email to the intended recipient, be advised that you have received this email in error, and that any use, dissemination, forwarding, printing, or copying of this email is strictly prohibited. If you received this email in error, please immediately notify the sender and delete the original. - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Linux Software RAID Bitmap Question
Anyone have a good explanation for the use of bitmaps? Anyone on the list use them? http://gentoo-wiki.com/HOWTO_Gentoo_Install_on_Software_RAID#Data_Scrubbing Provides an explanation on that page. I believe Neil stated that using bitmaps does incur a 10% performance penalty. If one's box never (or rarely) crashes, is a bitmap needed? The one question I had regarding a bitmap is as follows: The mismatch_cnt file. If I have bitmaps turned on for my RAID DEVICES, is it possible that the 'mismatch_cnt' will be updated when it finds a bad block? That would be VERY nice instead of running a check all the time. Justin. - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: PATA/SATA Disk Reliability paper
In contrast, ever since these holes appeared, drive failures became the norm. wow, great conspiracy theory! I think you misunderstand. I just meant plain old-fashioned mis-engineering. I should have added a smilie. but I find it dubious that the whole industry would have made a major bungle if so many failures are due to the hole... But remember, the google report mentions a great number of drives failing for no apparent reason, not even a smart warning, so failing within the warranty period is just pure luck. are we reading the same report? I look at it and see: - lowest failures from medium-utilization drives, 30-35C. - higher failures from young drives in general, but especially if cold or used hard. - higher failures from end-of-life drives, especially 40C. - scan errors, realloc counts, offline realloc and probation counts are all significant in drives which fail. the paper seems unnecessarily gloomy about these results. to me, they're quite exciting, and provide good reason to pay a lot of attention to these factors. I hate to criticize such a valuable paper, but I think they've missed a lot by not considering the results in a fully factorial analysis as most medical/behavioral/social studies do. for instance, they bemoan a 56% false negative rate from only SMART signals, and mention that if 40C is added, the FN rate falls to 36%. also incorporating the low-young risk factor would help. I would guess that a full-on model, especially if it incorporated utilization, age, performance could comfortable levels. The problem is, that's not enough; the room temperature/humidity has to be controlled too. In a desktop environment, that's not really feasible. 5-90% humidity, operating, 95% non-op, and 30%/hour. seems pretty easy to me. in fact, I frequently ask people to justify the assumption that a good machineroom needs tight control over humidity. (assuming, like most machinerooms, you aren't frequently handling the innards.) I agree, but reality has a different opinion, and it may take down that drive, specs or no specs. why do you say this? I have my machineroom set for 35% (which appears to be it's natural point, with a wide 20% margin on either side. I don't really want to waste cooling capacity on dehumidification, for instance, unless there's a good reason. A good way to deal with reality is to find the real reasons for failure. Once these reasons are known, engineering quality drives becomes, thank GOD, really rather easy. that would be great, but depends rather much on relatively small number of variables, which are manifest, not hidden. there are billions of studies (in medical/behavioral/social fields) which assume large numbers of more or less hidden variables, and which still manage good success... regards, mark hahn. - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: PATA/SATA Disk Reliability paper
Mark Hahn wrote: this - I checked the seagate 7200.10: 10k feet operating, 40k max. amusingly -200 feet is the min either way... Which means you could not use this drive on the shores of the Dead Sea, which is at about -1300ft. Regards, Richard - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: nonzero mismatch_cnt with no earlier error
Sorry to hijack the thread a little but I just noticed that the mismatch_cnt for my mirror is at 256. I'd always thought the monthly check done by the mdadm Debian package does repair as well - apparently it doesn't. So I guess I should run repair but I'm wondering ... - is it safe / bugfree considering my oldish software? (mdadm 2.5.2 + linux 2.6.17.4) - is there any way to check which files (if any) have been corrupted? - I have grub installed by hand on both mirror components, but that shouldn't show up as mismatch, should it? The box in question is in production so I'd rather not update mdadm and/or kernel if possible. Chris - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Linux Software RAID a bit of a weakness?
You could configure smartd to do regular long selftests, which would notify you on failures and allow you to take the drive offline and dd, replace etc. is it known what a long self-test does? for instance, ultimately you want the disk to be scrubbed over some fairly lengthy period of time. that is, not just read and checked, possibly with parity fixed, but all blocks read and rewritten (with verify, I suppose!) this starts to get a bit hair-raising to have entirely in the kernel - I wonder if anyone is thinking about how to pull some such activity out into user-space. regards, mark hahn. - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Linux Software RAID a bit of a weakness?
Mark Hahn wrote: is it known what a long self-test does? for instance, ultimately you want the disk to be scrubbed over some fairly lengthy period of time. that is, not just read and checked, possibly with parity fixed, but all blocks read and rewritten (with verify, I suppose!) The smartctl man page is a little vague, but it looks like it does no writing. Paraphrasing somewhat: short selftest - The Self tests check the electrical and mechanical performance as well as the read performance of the disk. long selftest - This is a longer and more thorough version of the Short Self Test described above. Richard - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: PATA/SATA Disk Reliability paper
Mark Hahn wrote: - disks are very complicated, so their failure rates are a combination of conditional failure rates of many components. to take a fully reductionist approach would require knowing how each of ~1k parts responds to age, wear, temp, handling, etc. and none of those can be assumed to be independent. those are the real reasons, but most can't be measured directly outside a lab and the number of combinatorial interactions is huge. It seems to me that the biggest problem are the 7.2k+ rpm platters themselves, especially with those heads flying closely on top of them. So, we can probably forget the rest of the ~1k non-moving parts, as they have proven to be pretty reliable, most of the time. - factorial analysis of the data. temperature is a good example, because both low and high temperature affect AFR, and in ways that interact with age and/or utilization. this is a common issue in medical studies, which are strikingly similar in design (outcome is subject or disk dies...) there is a well-established body of practice for factorial analysis. Agreed. We definitely need more sensors. - recognition that the relative results are actually quite good, even if the absolute results are not amazing. for instance, assume we have 1k drives, and a 10% overall failure rate. using all SMART but temp detects 64 of the 100 failures and misses 36. essentially, the failure rate is now .036. I'm guessing that if utilization and temperature were included, the rate would be much lower. feedback from active testing (especially scrubbing) and performance under the normal workload would also help. Are you saying, you are content with pre-mature disk failure, as long as there is a smart warning sign? If so, then I don't think that is enough. I think the sensors should trigger some kind of shutdown mechanism as a protective measure, when some threshold is reached. Just like the protective measure you see for CPUs to prevent meltdown. Thanks! -- Al - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: nonzero mismatch_cnt with no earlier error
Justin Piszcz wrote: On Sat, 24 Feb 2007, Michael Tokarev wrote: Jason Rainforest wrote: I tried doing a check, found a mismatch_cnt of 8 (7*250Gb SW RAID5, multiple controllers on Linux 2.6.19.2, SMP x86-64 on Athlon64 X2 4200 +). I then ordered a resync. The mismatch_cnt returned to 0 at the start of As pointed out later it was repair, not resync. the resync, but around the same time that it went up to 8 with the check, it went up to 8 in the resync. After the resync, it still is 8. I haven't ordered a check since the resync completed. As far as I understand, repair will do the same as check does, but ALSO will try to fix the problems found. So the number in mismatch_cnt after a repair will indicate the amount of mismatches found _and fixed_ /mjt That is what I thought too (I will have to wait until I get another mismatch to verify), but FYI-- Yesterday I had 512 mismatches for my swap partition (RAID1) after I ran the check. I ran repair. I catted the mismatch_cnt again, still 512. I re-ran the check, back to 0. AFAIK the repair action will give you a count of the repairs it does, and will fail a drive if a read does not succeed after the sector is rewritten. That's the way I read it, and the way it seems to work. -- bill davidsen [EMAIL PROTECTED] CTO TMR Associates, Inc Doing interesting things with small computers since 1979 - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: nonzero mismatch_cnt with no earlier error
On Sun, 25 Feb 2007, Christian Pernegger wrote: Sorry to hijack the thread a little but I just noticed that the mismatch_cnt for my mirror is at 256. I'd always thought the monthly check done by the mdadm Debian package does repair as well - apparently it doesn't. So I guess I should run repair but I'm wondering ... - is it safe / bugfree considering my oldish software? (mdadm 2.5.2 + linux 2.6.17.4) - is there any way to check which files (if any) have been corrupted? - I have grub installed by hand on both mirror components, but that shouldn't show up as mismatch, should it? The box in question is in production so I'd rather not update mdadm and/or kernel if possible. Chris - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html That is a very good question.. Also I hope you are not running XFS with 2.6.17.4. (corruption bug) Besides that, I wonder if it would be possible (with bitmaps perhaps(?)) to have the kernel increment that and then post via ring buffer/dmesg, something like: kernel: md1: mismatch_cnt: 512, file corrupted: /etc/resolv.conf I would take a performance hit for something like that :) Justin. - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: PATA/SATA Disk Reliability paper
and none of those can be assumed to be independent. those are the real reasons, but most can't be measured directly outside a lab and the number of combinatorial interactions is huge. It seems to me that the biggest problem are the 7.2k+ rpm platters themselves, especially with those heads flying closely on top of them. So, we can probably forget the rest of the ~1k non-moving parts, as they have proven to be pretty reliable, most of the time. donno. non-moving parts probably have much higher reliability, but so many of them makes them a concern. if a discrete resistor has a 1e9 hour MTBF, 1k of them are 1e6 and that's starting to approach the claimed MTBF of a disk. any lower (or more components) and it takes over as a dominant failure mode... the Google paper doesn't really try to diagnose, but it does indicate that metrics related to media/head problems tend to promptly lead to failure. (scan errors, reallocations, etc.) I guess that's circumstantial support for your theory that crashes of media/heads are the primary failure mode. - factorial analysis of the data. temperature is a good example, because both low and high temperature affect AFR, and in ways that interact with age and/or utilization. this is a common issue in medical studies, which are strikingly similar in design (outcome is subject or disk dies...) there is a well-established body of practice for factorial analysis. Agreed. We definitely need more sensors. just to be clear, I'm not saying we need more sensors, just that the existing metrics (including temp and utilization) need to be considered jointly, not independently. more metrics would be better as well, assuming they're direct readouts, not idiot-lights... and performance under the normal workload would also help. Are you saying, you are content with pre-mature disk failure, as long as there is a smart warning sign? I'm saying that disk failures are inevitable. ways to reduce the chance of data loss are what we have to focus on. the Google paper shows that disks like to be at around 35C - not too cool or hot (though this is probably conflated with utilization.) the paper also shows that warning signs can indicate a majority of failures (though it doesn't present the factorial analysis necessary to tell which ones, how well, avoid false-positives, etc.) I think the sensors should trigger some kind of shutdown mechanism as a protective measure, when some threshold is reached. Just like the protective measure you see for CPUs to prevent meltdown. but they already do. persistent bad reads or writes to a block will trigger its reallocation to spares, etc. for CPUs, the main threat is heat, and it's easy to throttle to cool down. for disks, the main threat is probably wear, which seems quite different - more catastrophic and less mitigatable once it starts. I'd love to hear from an actual drive engineer on the failure modes they worry about... regards, mark hahn. - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
trouble creating array
Just built a new FC6 machine, with 5x 320Gb drives and 1x 300Gb drive. Made a 300Gb partition on all the drives /dev/hd{c,d,e} and /dev/sd{a,b,c}... Trying to create an array gave me an error, since it thought there was already an array on some of the disks (and there was an old one). I decided to clear off the superblock on the drives with mdadm --zero-superblock on all the drives. It worked fine on all drives, except for /dev/sd{b,c)1, which returns an error mdadm: Couldn't open /dev/sdb1 for write - not zeroing. There doesn't seem to be a problem with the drive, as I've run a non destructive badblocks on it, and also done a dd if=/dev/zero of=/dev/sdb1 on it, and Ive written out 300Gb onto the partition. When I try and create an array using these 2 partitions, I get an error mdadm: Cannot open /dev/sdb1: Device or resource busy mdadm: Cannot open /dev/sdc1: Device or resource busy and it aborts. I've double checked that the drives aren't mounted anywhere. There's nothing in /var/log/messages either... Any suggestions where to check next? Graham ___ To help you stay safe and secure online, we've developed the all new Yahoo! Security Centre. http://uk.security.yahoo.com - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: trouble creating array
On Sun, 25 Feb 2007, jahammonds prost wrote: Just built a new FC6 machine, with 5x 320Gb drives and 1x 300Gb drive. Made a 300Gb partition on all the drives /dev/hd{c,d,e} and /dev/sd{a,b,c}... Trying to create an array gave me an error, since it thought there was already an array on some of the disks (and there was an old one). I decided to clear off the superblock on the drives with mdadm --zero-superblock on all the drives. It worked fine on all drives, except for /dev/sd{b,c)1, which returns an error mdadm: Couldn't open /dev/sdb1 for write - not zeroing. There doesn't seem to be a problem with the drive, as I've run a non destructive badblocks on it, and also done a dd if=/dev/zero of=/dev/sdb1 on it, and Ive written out 300Gb onto the partition. When I try and create an array using these 2 partitions, I get an error mdadm: Cannot open /dev/sdb1: Device or resource busy mdadm: Cannot open /dev/sdc1: Device or resource busy and it aborts. I've double checked that the drives aren't mounted anywhere. There's nothing in /var/log/messages either... Any suggestions where to check next? Graham ___ To help you stay safe and secure online, we've developed the all new Yahoo! Security Centre. http://uk.security.yahoo.com - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Do you have an active md array? mdadm -S /dev/md0 mdadm -S /dev/md1 mdadm -S /dev/md2 .. etc lsof | egrep '(sdb|sdc)' Something thinks its in use, that is why you cannot format it/make it part of a new array, a reboot would also fix the problem. Justin. - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: PATA/SATA Disk Reliability paper
-BEGIN PGP SIGNED MESSAGE- Hash: SHA512 Mark Hahn wrote: | if a discrete resistor has a 1e9 hour MTBF, 1k of them are 1e6 That's not actually true. As a (contrived) example, consider two cases. Case 1: failures occur at constant rate from hours 0 through 2e9. Case 2: failures occur at constant rate from 1e9-10 hours through 1e9+10 hours. Clearly in the former case, over 1000 components there will almost certainly be a failure by 1e8 hours. In the latter case, there will not be. Yet both have the same MTTF. MTTF says nothing about the shape of the failure curve. It indicates only where its midpoint is. To compute the MTTF of 1000 devices, you'll need to know the probability distribution of failures over time of those 1000 devices, which can be computed from the distribution of failures over time for a single device. But, although MTTF is derived from this distribution, you cannot reconstruct the distribution knowing only MTTF. In fact, the recent papers on disk failure indicate that common assumptions about the shape of that distribution (either a bathtub curve, or increasing failures due to wear-out after 3ish years) do not hold. - -Ben -BEGIN PGP SIGNATURE- Version: GnuPG v1.4.5 (MingW32) Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org iD8DBQFF4hHHcsocGMHJ2H8RCqPfAKCYYlcOTW3OKGyJlYdXIRq802US+ACfTaBG ZzVJSUNyU/htda/JCxWvc4A= =DouE -END PGP SIGNATURE- - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: PATA/SATA Disk Reliability paper
| if a discrete resistor has a 1e9 hour MTBF, 1k of them are 1e6 That's not actually true. As a (contrived) example, consider two cases. if you know nothing else, it's the best you can do. it's also a conservative estimate (where conservative means to expect a failure sooner). distribution knowing only MTTF. In fact, the recent papers on disk failure indicate that common assumptions about the shape of that distribution (either a bathtub curve, or increasing failures due to wear-out after 3ish years) do not hold. the data in both the Google and SchroederGibson papers are fairly noisy. yes, the strong bathtub hypthothesis is apparently wrong (that infant mortality is an exp decreasing failure rate over the first year, that disks stay at a constant failure rate for the next 4-6 years, then have an exp increasing failure rate). both papers, though, show what you might call a swimming pool curve: a short period of high mortality (clock starts when the drive leaves the factory) with a minimum failure rate at about 1 year. that's the deep end of the pool ;) then increasing failures out to the end of expected service life (warranty period). what happens after is probably too noisy to conclude much, since most people prefer not to use disks which have already seen the death of ~25% of their peers. (Google's paper has, halleluiah, error bars showing high variance at 3 years.) both papers (and most people's experience, I think) agree that: - there may be an infant mortality curve, but it depends on when you start counting, conditions and load in early life, etc. - failure rates increase with age. - failure rates in the prime of life are dramatically higher than the vendor spec sheets. - failure rates in senescence (post warranty) are very bad. after all, real bathtubs don't have flat bottoms! as for models and fits, well, it's complicated. consider that in a lot of environments, it takes a year or two for a new disk array to fill. so a wear-related process will initially be focused on a small area of disk, perhaps not even spread across individual disks. or consider that once the novelty of a new installation wears off, people get more worried about failures, perhaps altering their replacement strategy... - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: trouble creating array
Do you have an active md array? mdadm -S /dev/md0 Nothing was showing up in /proc/mdstat, but when I try and stop md0, I get this # mdadm -S /dev/md0 mdadm: stopped /dev/md0 lsof | egrep '(sdb|sdc)' I had tried that before, and nothing is showing. A reboot didn't help, but something is definately keeping it open. I tried an mkfs ]# mkfs.ext3 /dev/sdb1 mke2fs 1.39 (29-May-2006) /dev/sdb1 is apparently in use by the system; will not make a filesystem here! Any ideas how to find out what has it open? I can happily write all over the disk with dd... I can create and delete the partition, and it's all good... I will try deleting the sd{b,c}1 partitions, reboot, and see what happens. Graham - Original Message From: Justin Piszcz [EMAIL PROTECTED] To: jahammonds prost [EMAIL PROTECTED] Cc: linux-raid@vger.kernel.org Sent: Sunday, 25 February, 2007 5:35:33 PM Subject: Re: trouble creating array On Sun, 25 Feb 2007, jahammonds prost wrote: Just built a new FC6 machine, with 5x 320Gb drives and 1x 300Gb drive. Made a 300Gb partition on all the drives /dev/hd{c,d,e} and /dev/sd{a,b,c}... Trying to create an array gave me an error, since it thought there was already an array on some of the disks (and there was an old one). I decided to clear off the superblock on the drives with mdadm --zero-superblock on all the drives. It worked fine on all drives, except for /dev/sd{b,c)1, which returns an error mdadm: Couldn't open /dev/sdb1 for write - not zeroing. There doesn't seem to be a problem with the drive, as I've run a non destructive badblocks on it, and also done a dd if=/dev/zero of=/dev/sdb1 on it, and Ive written out 300Gb onto the partition. When I try and create an array using these 2 partitions, I get an error mdadm: Cannot open /dev/sdb1: Device or resource busy mdadm: Cannot open /dev/sdc1: Device or resource busy and it aborts. I've double checked that the drives aren't mounted anywhere. There's nothing in /var/log/messages either... Any suggestions where to check next? Graham ___ To help you stay safe and secure online, we've developed the all new Yahoo! Security Centre. http://uk.security.yahoo.com - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Do you have an active md array? mdadm -S /dev/md0 mdadm -S /dev/md1 mdadm -S /dev/md2 .. etc lsof | egrep '(sdb|sdc)' Something thinks its in use, that is why you cannot format it/make it part of a new array, a reboot would also fix the problem. Justin. ___ To help you stay safe and secure online, we've developed the all new Yahoo! Security Centre. http://uk.security.yahoo.com - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Linux Software RAID Bitmap Question
On Sunday February 25, [EMAIL PROTECTED] wrote: Anyone have a good explanation for the use of bitmaps? Anyone on the list use them? http://gentoo-wiki.com/HOWTO_Gentoo_Install_on_Software_RAID#Data_Scrubbing Provides an explanation on that page. I believe Neil stated that using bitmaps does incur a 10% performance penalty. If one's box never (or rarely) crashes, is a bitmap needed? I think I said it can incur such a penalty. The actual cost is very dependant on work-load. The one question I had regarding a bitmap is as follows: The mismatch_cnt file. If I have bitmaps turned on for my RAID DEVICES, is it possible that the 'mismatch_cnt' will be updated when it finds a bad block? That would be VERY nice instead of running a check all the time. When md find a bad block (read failure) it either fixes it (by successfully over-writing the correct date) or fails the drive. The count of the times that this has happened is available via /sys/block/mdX/md/errors If you use version-1 superblocks, then this count is maintained throughout the life of the array. If you use v0.90, the count is zeroed whenever you assemble the array. This count is completely separate from the 'mismatch_cnt'. 'mismatch_cnt' referred to when md check if redundant information (copies or parity) is consistent or not. This does not happen at all during normal operation. It only happens when you ask for a 'check' or 'repair' operation. It might also happen when the array automatically performs a 'sync' after an unclean shutdown. And all this has very little to do with bitmaps. So I'm afraid I don't understand your question. NeilBrown - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: nonzero mismatch_cnt with no earlier error
On Saturday February 24, [EMAIL PROTECTED] wrote: But is this not a good opportunity to repair the bad stripe for a very low cost (no complete resync required)? In this case, 'md' knew nothing about an error. The SCSI layer detected something and thought it had fixed it itself. Nothing for md to do. At time of error we actually know which disk failed and can re-write it, something we do not know at resync time, so I assume we always write to the parity disk. md only knows of a 'problem' if the lower level driver reports one. If it reports a problem for a write request, md will fail the device. If it reports a problem for a read request, md will try to over-write correct data on the failed block. But if the driver doesn't report the failure, there is nothing md can do. When performing a check/repair md looks for consistencies and fixes the 'arbitrarily'. For raid5/6, it just 'corrects' the parity. For raid1/10, it chooses one block and over-writes the other(s) with it. Mapping these corrections back to blocks in files in the filesystem is extremely non-trivial. NeilBrown - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: trouble creating array
On Sunday February 25, [EMAIL PROTECTED] wrote: Any ideas how to find out what has it open? I can happily write all over the disk with dd... I can create and delete the partition, and it's all good... I will try deleting the sd{b,c}1 partitions, reboot, and see what happens. ls -l /sys/block/*/holders/* ?? NeilBrown - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: nonzero mismatch_cnt with no earlier error
Ok, so hearing all the excitement I ran a check on a multi-disk RAID-1. One of the RAID-1 disks failed out, maybe by coincidence but presumably due to the check. (I also have another disk in the array deliberately removed as a backup mechanism.) And of course there is a big mismatch count. Questions: will repair do the right thing for multidisk RAID-1, e.g. vote or something? Do I need a special version of mdadm to do this safely? What am I forgetting to ask? Jeff # cat /proc/mdstat Personalities : [raid1] md1 : active raid1 sdf1[0] sdb1[4] sdd1[6](F) sdc1[2] sde1[1] 488383936 blocks [6/4] [UUU_U_] # cat /sys/block/md1/md/mismatch_cnt 128 # cat /proc/version Linux version 2.6.17-2-amd64 (Debian 2.6.17-7) ([EMAIL PROTECTED]) (gcc version 4.1.2 20060814 (prerelease) (Debian 4.1.1-11)) #1 SMP Thu Aug 24 16:13:57 UTC 2006 # dpkg -l | grep mdadm ii mdadm 1.9.0-4sarge1 Manage MD devices aka Linux Software Raid - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: end to end error recovery musings
On Friday February 23, [EMAIL PROTECTED] wrote: On Fri, Feb 23, 2007 at 05:37:23PM -0700, Andreas Dilger wrote: Probably the only sane thing to do is to remember the bad sectors and avoid attempting reading them; that would mean marking automatic versus explicitly requested requests to determine whether or not to filter them against a list of discovered bad blocks. And clearing this list when the sector is overwritten, as it will almost certainly be relocated at the disk level. For that matter, a huge win would be to have the MD RAID layer rewrite only the bad sector (in hopes of the disk relocating it) instead of failing the whiole disk. Otherwise, a few read errors on different disks in a RAID set can take the whole system offline. Apologies if this is already done in recent kernels... Yes, current md does this. And having a way of making this list available to both the filesystem and to a userspace utility, so they can more easily deal with doing a forced rewrite of the bad sector, after determining which file is involved and perhaps doing something intelligent (up to and including automatically requesting a backup system to fetch a backup version of the file, and if it can be determined that the file shouldn't have been changed since the last backup, automatically fixing up the corrupted data block :-). - Ted So we want a clear path for media read errors from the device up to user-space. Stacked devices (like md) would do appropriate mappings maybe (for raid0/linear at least. Other levels wouldn't tolerate errors). There would need to be a limit on the number of 'bad blocks' that is recorded. Maybe a mechanism to clear old bad blocks from the list is needed. Maybe if generic make request gets a request for a block which overlaps a 'bad-block' it returns an error immediately. Do we want a path in the other direction to handle write errors? The file system could say Don't worry to much if this block cannot be written, just return an error and I will write it somewhere else? This might allow md not to fail a whole drive if there is a single write error. Or is that completely un-necessary as all modern devices do bad-block relocation for us? Is there any need for a bad-block-relocating layer in md or dm? What about corrected-error counts? Drives provide them with SMART. The SCSI layer could provide some as well. Md can do a similar thing to some extent. Where these are actually useful predictors of pending failure is unclear, but there could be some value. e.g. after a certain number of recovered errors raid5 could trigger a background consistency check, or a filesystem could trigger a background fsck should it support that. Lots of interesting questions... not so many answers. NeilBrown - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: end to end error recovery musings
H. Peter Anvin wrote: Ric Wheeler wrote: We still have the following challenges: (1) read-ahead often means that we will retry every bad sector at least twice from the file system level. The first time, the fs read ahead request triggers a speculative read that includes the bad sector (triggering the error handling mechanisms) right before the real application triggers a read does the same thing. Not sure what the answer is here since read-ahead is obviously a huge win in the normal case. Probably the only sane thing to do is to remember the bad sectors and avoid attempting reading them; that would mean marking automatic versus explicitly requested requests to determine whether or not to filter them against a list of discovered bad blocks. Some disks are doing their own read-ahead in the form of a background media scan. Scans are done on request or periodically (e.g. once per day or once per week) and we have tools that can fetch the scan results from a disk (e.g. a list of unreadable sectors). What we don't have is any way to feed such information to a file system that may be impacted. Doug Gilbert - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html