Re: PATA/SATA Disk Reliability paper

2007-02-25 Thread Al Boldi
Mark Hahn wrote:
  In contrast, ever since these holes appeared, drive failures became the
  norm.

 wow, great conspiracy theory!

I think you misunderstand.  I just meant plain old-fashioned mis-engineering.

 maybe the hole is plugged at
 the factory with a substance which evaporates at 1/warranty-period ;)

Actually it's plugged with a thin paper-like filter, which does not seem to 
evaporate easily.

And it's got nothing to do with warranty, although if you get lucky and the 
failure happens within the warranty period, you can probably demand a 
replacement drive to make you feel better.

But remember, the google report mentions a great number of drives failing for 
no apparent reason, not even a smart warning, so failing within the warranty 
period is just pure luck.

 seriously, isn't it easy to imagine a bladder-like arrangement that
 permits equilibration without net flow?  disk spec-sheets do limit
 this - I checked the seagate 7200.10: 10k feet operating, 40k max.
 amusingly -200 feet is the min either way...

Well, it looks like filtered net flow on wd's.

What's it look like on seagate?

 Doe anyone rememnber that you had to let you drives acclimate to
  your machine room for a day or so before you used them.
 
  The problem is, that's not enough; the room temperature/humidity has to
  be controlled too.  In a desktop environment, that's not really
  feasible.

 5-90% humidity, operating, 95% non-op, and 30%/hour.  seems pretty easy
 to me.  in fact, I frequently ask people to justify the assumption that
 a good machineroom needs tight control over humidity.  (assuming, like
 most machinerooms, you aren't frequently handling the innards.)

I agree, but reality has a different opinion, and it may take down that 
drive, specs or no specs.

A good way to deal with reality is to find the real reasons for failure.  
Once these reasons are known, engineering quality drives becomes, thank GOD, 
really rather easy.


Thanks!

--
Al

-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Linux Software RAID a bit of a weakness?

2007-02-25 Thread Colin Simpson
On Fri, 2007-02-23 at 14:55 -0500, Steve Cousins wrote:
 Yes, this is an important thing to keep on top of, both for hardware 
 RAID and software RAID.  For md:
 
   echo check  /sys/block/md0/md/sync_action
 
 This should be done regularly. I have cron do it once a week.
 
 Check out: http://neil.brown.name/blog/20050727141521-002
 
 Good luck,
 
 Steve

Thanks for all the info. 

A further search around seems to reveal the seriousness of this issue. 
So called Disk/Data Scrubbing seems to be vital for keeping a modern
large RAID healthy.

I've found a few interesting links. 

http://www.ashtech.net/~syntax/blog/archives/53-Data-Scrub-with-Linux-RAID-or-Die.html

The link of particular interest from the above is

http://www.nber.org/sys-admin/linux-nas-raid.html

The really scary item is entitled, Why do drive failures come in
pairs?, it has the following :

===
Let's repeat the reliability calculation with our new knowledge of the
situation. In our experience perhaps half of drives have at least one
unreadable sector in the first year. Again assume a 6 percent chance of
a single failure. The chance of at least one of the remaining two drives
having a bad sector is 75% (1-(1-.5)^2). So the RAID 5 failure rate is
about 4.5%/year, which is .5% MORE than the 4% failure rate one would
expect from a two drive RAID 0 with the same capacity. Alternatively, if
you just had two drives with a partition on each and no RAID of any
kind, the chance of a failure would still be 4%/year but only half the
data loss per incident, which is considerably better than the RAID 5 can
even hope for under the current reconstruction policy even with the most
expensive hardware.
===

That's got my attention! My RAID 5 is worse than a 2 disk RAID 0. It
goes on about a surface scan being used to mitigate this problem. The
article also talks about how on reconstruction perhaps the md driver
should not just give up is it finds bad blocks on the disk but do
something cleverer. I don't know if that's valid or not.

But this all leaves me with a big problem. As the systems I have
Software RAID running are fully supported RH 4 ES systems (running the
2.6.9-42.0.8 kernel, I can't really change it without losing RH
support). 

They therefore do not have the check option in the kernel. Is there
anything else I can do? Would forcing a resync achieve the same result
(or is that down right dangerous as the array is not considered
consistent for a while). Any thoughts apart from my one being to upgrade
them to RH5 when that appears with a probably 2.6.18 kernel (which will
presumably have check)? Any thoughts?

Is this something that should be added to the Software-RAID-HOWTO? 

Just for reference the current Dell Perc 5i controllers has a thing
called Patrol Read, which goes off and does a scrub in the background.

Thanks again

Colin


This email and any files transmitted with it are confidential and are intended 
solely for the use of the individual or entity to whom they are addressed.  If 
you are not the original recipient or the person responsible for delivering the 
email to the intended recipient, be advised that you have received this email 
in error, and that any use, dissemination, forwarding, printing, or copying of 
this email is strictly prohibited. If you received this email in error, please 
immediately notify the sender and delete the original.


-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Linux Software RAID Bitmap Question

2007-02-25 Thread Justin Piszcz

Anyone have a good explanation for the use of bitmaps?

Anyone on the list use them?

http://gentoo-wiki.com/HOWTO_Gentoo_Install_on_Software_RAID#Data_Scrubbing

Provides an explanation on that page.

I believe Neil stated that using bitmaps does incur a 10% performance 
penalty.  If one's box never (or rarely) crashes, is a bitmap needed?


The one question I had regarding a bitmap is as follows:

The mismatch_cnt file.

If I have bitmaps turned on for my RAID DEVICES, is it possible that the 
'mismatch_cnt' will be updated when it finds a bad block?


That would be VERY nice instead of running a check all the time.

Justin.
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: PATA/SATA Disk Reliability paper

2007-02-25 Thread Mark Hahn

In contrast, ever since these holes appeared, drive failures became the
norm.


wow, great conspiracy theory!


I think you misunderstand.  I just meant plain old-fashioned mis-engineering.


I should have added a smilie.  but I find it dubious that the whole 
industry would have made a major bungle if so many failures are due to 
the hole...



But remember, the google report mentions a great number of drives failing for
no apparent reason, not even a smart warning, so failing within the warranty
period is just pure luck.


are we reading the same report?  I look at it and see:

- lowest failures from medium-utilization drives, 30-35C.
- higher failures from young drives in general, but especially
if cold or used hard.
- higher failures from end-of-life drives, especially  40C.
- scan errors, realloc counts, offline realloc and probation
counts are all significant in drives which fail.

the paper seems unnecessarily gloomy about these results.  to me, they're
quite exciting, and provide good reason to pay a lot of attention to these
factors.  I hate to criticize such a valuable paper, but I think they've
missed a lot by not considering the results in a fully factorial analysis
as most medical/behavioral/social studies do.  for instance, they bemoan
a 56% false negative rate from only SMART signals, and mention that if 

40C is added, the FN rate falls to 36%.  also incorporating the low-young

risk factor would help.  I would guess that a full-on model, especially
if it incorporated utilization, age, performance could comfortable levels.


The problem is, that's not enough; the room temperature/humidity has to
be controlled too.  In a desktop environment, that's not really
feasible.


5-90% humidity, operating, 95% non-op, and 30%/hour.  seems pretty easy
to me.  in fact, I frequently ask people to justify the assumption that
a good machineroom needs tight control over humidity.  (assuming, like
most machinerooms, you aren't frequently handling the innards.)


I agree, but reality has a different opinion, and it may take down that
drive, specs or no specs.


why do you say this?  I have my machineroom set for 35% (which appears 
to be it's natural point, with a wide 20% margin on either side.

I don't really want to waste cooling capacity on dehumidification,
for instance, unless there's a good reason.


A good way to deal with reality is to find the real reasons for failure.
Once these reasons are known, engineering quality drives becomes, thank GOD,
really rather easy.


that would be great, but depends rather much on relatively small number of 
variables, which are manifest, not hidden.  there are billions of studies

(in medical/behavioral/social fields) which assume large numbers of more
or less hidden variables, and which still manage good success...

regards, mark hahn.
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: PATA/SATA Disk Reliability paper

2007-02-25 Thread Richard Scobie

Mark Hahn wrote:


this - I checked the seagate 7200.10: 10k feet operating, 40k max.
amusingly -200 feet is the min either way...


Which means you could not use this drive on the shores of the Dead Sea, 
which is at about -1300ft.


Regards,

Richard
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: nonzero mismatch_cnt with no earlier error

2007-02-25 Thread Christian Pernegger

Sorry to hijack the thread a little but I just noticed that the
mismatch_cnt for my mirror is at 256.

I'd always thought the monthly check done by the mdadm Debian package
does repair as well - apparently it doesn't.

So I guess I should run repair but I'm wondering ...
- is it safe / bugfree considering my oldish software? (mdadm 2.5.2 +
linux 2.6.17.4)
- is there any way to check which files (if any) have been corrupted?
- I have grub installed by hand on both mirror components, but that
shouldn't show up as mismatch, should it?

The box in question is in production so I'd rather not update mdadm
and/or kernel if possible.

Chris
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Linux Software RAID a bit of a weakness?

2007-02-25 Thread Mark Hahn
You could configure smartd to do regular long selftests, which would notify 
you on failures and allow you to take the drive offline and dd, replace etc.


is it known what a long self-test does?  for instance, ultimately you
want the disk to be scrubbed over some fairly lengthy period of time.
that is, not just read and checked, possibly with parity fixed,
but all blocks read and rewritten (with verify, I suppose!)

this starts to get a bit hair-raising to have entirely in the kernel - 
I wonder if anyone is thinking about how to pull some such activity 
out into user-space.


regards, mark hahn.
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Linux Software RAID a bit of a weakness?

2007-02-25 Thread Richard Scobie

Mark Hahn wrote:


is it known what a long self-test does?  for instance, ultimately you
want the disk to be scrubbed over some fairly lengthy period of time.
that is, not just read and checked, possibly with parity fixed,
but all blocks read and rewritten (with verify, I suppose!)


The smartctl man page is a little vague, but it looks like it does no 
writing.



Paraphrasing somewhat:

short selftest - The  Self  tests check  the  electrical and 
mechanical performance as well as the read performance of the disk.


long selftest - This is a  longer  and  more  thorough version  of the 
Short Self Test described above.


Richard
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: PATA/SATA Disk Reliability paper

2007-02-25 Thread Al Boldi
Mark Hahn wrote:
   - disks are very complicated, so their failure rates are a
   combination of conditional failure rates of many components.
   to take a fully reductionist approach would require knowing
   how each of ~1k parts responds to age, wear, temp, handling, etc.
   and none of those can be assumed to be independent.  those are the
   real reasons, but most can't be measured directly outside a lab
   and the number of combinatorial interactions is huge.

It seems to me that the biggest problem are the 7.2k+ rpm platters 
themselves, especially with those heads flying closely on top of them.  So, 
we can probably forget the rest of the ~1k non-moving parts, as they have 
proven to be pretty reliable, most of the time.

   - factorial analysis of the data.  temperature is a good
   example, because both low and high temperature affect AFR,
   and in ways that interact with age and/or utilization.  this
   is a common issue in medical studies, which are strikingly
   similar in design (outcome is subject or disk dies...)  there
   is a well-established body of practice for factorial analysis.

Agreed.  We definitely need more sensors.

   - recognition that the relative results are actually quite good,
   even if the absolute results are not amazing.  for instance,
   assume we have 1k drives, and a 10% overall failure rate.  using
   all SMART but temp detects 64 of the 100 failures and misses 36.
   essentially, the failure rate is now .036.  I'm guessing that if
   utilization and temperature were included, the rate would be much
   lower.  feedback from active testing (especially scrubbing)
   and performance under the normal workload would also help.

Are you saying, you are content with pre-mature disk failure, as long as 
there is a smart warning sign?

If so, then I don't think that is enough.

I think the sensors should trigger some kind of shutdown mechanism as a 
protective measure, when some threshold is reached.  Just like the 
protective measure you see for CPUs to prevent meltdown.

Thanks!

--
Al

-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: nonzero mismatch_cnt with no earlier error

2007-02-25 Thread Bill Davidsen

Justin Piszcz wrote:



On Sat, 24 Feb 2007, Michael Tokarev wrote:


Jason Rainforest wrote:

I tried doing a check, found a mismatch_cnt of 8 (7*250Gb SW RAID5,
multiple controllers on Linux 2.6.19.2, SMP x86-64 on Athlon64 X2 4200
+).

I then ordered a resync. The mismatch_cnt returned to 0 at the start of


As pointed out later it was repair, not resync.


the resync, but around the same time that it went up to 8 with the
check, it went up to 8 in the resync. After the resync, it still is 
8. I

haven't ordered a check since the resync completed.


As far as I understand, repair will do the same as check does, but ALSO
will try to fix the problems found.  So the number in mismatch_cnt after
a repair will indicate the amount of mismatches found _and fixed_

/mjt



That is what I thought too (I will have to wait until I get another 
mismatch to verify), but FYI--


Yesterday I had 512 mismatches for my swap partition (RAID1) after I 
ran the check.


I ran repair.

I catted the mismatch_cnt again, still 512.

I re-ran the check, back to 0. 


AFAIK the repair action will give you a count of the repairs it does, 
and will fail a drive if a read does not succeed after the sector is 
rewritten. That's the way I read it, and the way it seems to work.


--
bill davidsen [EMAIL PROTECTED]
 CTO TMR Associates, Inc
 Doing interesting things with small computers since 1979


-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: nonzero mismatch_cnt with no earlier error

2007-02-25 Thread Justin Piszcz



On Sun, 25 Feb 2007, Christian Pernegger wrote:


Sorry to hijack the thread a little but I just noticed that the
mismatch_cnt for my mirror is at 256.

I'd always thought the monthly check done by the mdadm Debian package
does repair as well - apparently it doesn't.

So I guess I should run repair but I'm wondering ...
- is it safe / bugfree considering my oldish software? (mdadm 2.5.2 +
linux 2.6.17.4)
- is there any way to check which files (if any) have been corrupted?
- I have grub installed by hand on both mirror components, but that
shouldn't show up as mismatch, should it?

The box in question is in production so I'd rather not update mdadm
and/or kernel if possible.

Chris
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html



That is a very good question.. Also I hope you are not running XFS with 
2.6.17.4.  (corruption bug)


Besides that, I wonder if it would be possible (with bitmaps perhaps(?)) 
to have the kernel increment that and then post via ring buffer/dmesg, 
something like:


kernel: md1: mismatch_cnt: 512, file corrupted: /etc/resolv.conf

I would take a performance hit for something like that :)

Justin.

-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: PATA/SATA Disk Reliability paper

2007-02-25 Thread Mark Hahn

and none of those can be assumed to be independent.  those are the
real reasons, but most can't be measured directly outside a lab
and the number of combinatorial interactions is huge.


It seems to me that the biggest problem are the 7.2k+ rpm platters
themselves, especially with those heads flying closely on top of them.  So,
we can probably forget the rest of the ~1k non-moving parts, as they have
proven to be pretty reliable, most of the time.


donno.  non-moving parts probably have much higher reliability, but 
so many of them makes them a concern.  if a discrete resistor has 
a 1e9 hour MTBF, 1k of them are 1e6 and that's starting to approach

the claimed MTBF of a disk.  any lower (or more components) and it
takes over as a dominant failure mode...

the Google paper doesn't really try to diagnose, but it does indicate
that metrics related to media/head problems tend to promptly lead to failure.
(scan errors, reallocations, etc.)  I guess that's circumstantial support
for your theory that crashes of media/heads are the primary failure mode.


- factorial analysis of the data.  temperature is a good
example, because both low and high temperature affect AFR,
and in ways that interact with age and/or utilization.  this
is a common issue in medical studies, which are strikingly
similar in design (outcome is subject or disk dies...)  there
is a well-established body of practice for factorial analysis.


Agreed.  We definitely need more sensors.


just to be clear, I'm not saying we need more sensors, just that the 
existing metrics (including temp and utilization) need to be considered

jointly, not independently.  more metrics would be better as well,
assuming they're direct readouts, not idiot-lights...


and performance under the normal workload would also help.


Are you saying, you are content with pre-mature disk failure, as long as
there is a smart warning sign?


I'm saying that disk failures are inevitable.  ways to reduce the chance
of data loss are what we have to focus on.  the Google paper shows that 
disks like to be at around 35C - not too cool or hot (though this is probably
conflated with utilization.)  the paper also shows that warning signs can 
indicate a majority of failures (though it doesn't present the factorial 
analysis necessary to tell which ones, how well, avoid false-positives, etc.)



I think the sensors should trigger some kind of shutdown mechanism as a
protective measure, when some threshold is reached.  Just like the
protective measure you see for CPUs to prevent meltdown.


but they already do.  persistent bad reads or writes to a block will trigger
its reallocation to spares, etc.  for CPUs, the main threat is heat, and it's 
easy to throttle to cool down.  for disks, the main threat is probably wear, 
which seems quite different - more catastrophic and less mitigatable

once it starts.

I'd love to hear from an actual drive engineer on the failure modes 
they worry about...


regards, mark hahn.
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


trouble creating array

2007-02-25 Thread jahammonds prost
Just built a new FC6 machine, with 5x 320Gb drives and 1x 300Gb drive. Made a 
300Gb partition on all the drives /dev/hd{c,d,e} and /dev/sd{a,b,c}... Trying 
to create an array gave me an error, since it thought there was already an 
array on some of the disks (and there was an old one).

I decided to clear off the superblock on the drives with mdadm 
--zero-superblock on all the drives. It worked fine on all drives, except for 
/dev/sd{b,c)1, which returns an error mdadm: Couldn't open /dev/sdb1 for write 
- not zeroing. There doesn't seem to be a problem with the drive, as I've run 
a non destructive badblocks on it, and also done a dd if=/dev/zero of=/dev/sdb1 
on it, and Ive written out 300Gb onto the partition.

When I try and create an array using these 2 partitions, I get an error

mdadm: Cannot open /dev/sdb1: Device or resource busy
mdadm: Cannot open /dev/sdc1: Device or resource busy

and it aborts. I've double checked that the drives aren't mounted anywhere. 
There's nothing in /var/log/messages either...

Any suggestions where to check next?



Graham



___ 
To help you stay safe and secure online, we've developed the all new Yahoo! 
Security Centre. http://uk.security.yahoo.com
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: trouble creating array

2007-02-25 Thread Justin Piszcz



On Sun, 25 Feb 2007, jahammonds prost wrote:


Just built a new FC6 machine, with 5x 320Gb drives and 1x 300Gb drive. Made a 
300Gb partition on all the drives /dev/hd{c,d,e} and /dev/sd{a,b,c}... Trying 
to create an array gave me an error, since it thought there was already an 
array on some of the disks (and there was an old one).

I decided to clear off the superblock on the drives with mdadm --zero-superblock on all 
the drives. It worked fine on all drives, except for /dev/sd{b,c)1, which returns an 
error mdadm: Couldn't open /dev/sdb1 for write - not zeroing. There doesn't 
seem to be a problem with the drive, as I've run a non destructive badblocks on it, and 
also done a dd if=/dev/zero of=/dev/sdb1 on it, and Ive written out 300Gb onto the 
partition.

When I try and create an array using these 2 partitions, I get an error

mdadm: Cannot open /dev/sdb1: Device or resource busy
mdadm: Cannot open /dev/sdc1: Device or resource busy

and it aborts. I've double checked that the drives aren't mounted anywhere. 
There's nothing in /var/log/messages either...

Any suggestions where to check next?



Graham



___
To help you stay safe and secure online, we've developed the all new Yahoo! 
Security Centre. http://uk.security.yahoo.com
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html



Do you have an active md array?

mdadm -S /dev/md0
mdadm -S /dev/md1
mdadm -S /dev/md2

.. etc

lsof | egrep '(sdb|sdc)'

Something thinks its in use, that is why you cannot format it/make it part 
of a new array, a reboot would also fix the problem.


Justin.
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: PATA/SATA Disk Reliability paper

2007-02-25 Thread Benjamin Davenport

-BEGIN PGP SIGNED MESSAGE-
Hash: SHA512

Mark Hahn wrote:
| if a discrete resistor has a 1e9 hour MTBF, 1k of them are 1e6

That's not actually true.  As a (contrived) example, consider two cases.

Case 1: failures occur at constant rate from hours 0 through 2e9.
Case 2: failures occur at constant rate from 1e9-10 hours through 1e9+10 hours.

Clearly in the former case, over 1000 components there will almost certainly be
a failure by 1e8 hours.  In the latter case, there will not be.  Yet both have
the same MTTF.


MTTF says nothing about the shape of the failure curve.  It indicates only where
its midpoint is.  To compute the MTTF of 1000 devices, you'll need to know the
probability distribution of failures over time of those 1000 devices, which can
be computed from the distribution of failures over time for a single device.
But, although MTTF is derived from this distribution, you cannot reconstruct the
distribution knowing only MTTF.  In fact, the recent papers on disk failure
indicate that common assumptions about the shape of that distribution (either a
bathtub curve, or increasing failures due to wear-out after 3ish years) do not 
hold.

- -Ben
-BEGIN PGP SIGNATURE-
Version: GnuPG v1.4.5 (MingW32)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iD8DBQFF4hHHcsocGMHJ2H8RCqPfAKCYYlcOTW3OKGyJlYdXIRq802US+ACfTaBG
ZzVJSUNyU/htda/JCxWvc4A=
=DouE
-END PGP SIGNATURE-
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: PATA/SATA Disk Reliability paper

2007-02-25 Thread Mark Hahn

| if a discrete resistor has a 1e9 hour MTBF, 1k of them are 1e6

That's not actually true.  As a (contrived) example, consider two cases.


if you know nothing else, it's the best you can do.  it's also a 
conservative estimate (where conservative means to expect a failure sooner).



distribution knowing only MTTF.  In fact, the recent papers on disk failure
indicate that common assumptions about the shape of that distribution (either 
a
bathtub curve, or increasing failures due to wear-out after 3ish years) do 
not hold.


the data in both the Google and SchroederGibson papers are fairly noisy.
yes, the strong bathtub hypthothesis is apparently wrong (that infant
mortality is an exp decreasing failure rate over the first year, that
disks stay at a constant failure rate for the next 4-6 years, then have 
an exp increasing failure rate).


both papers, though, show what you might call a swimming pool curve:
a short period of high mortality (clock starts when the drive leaves 
the factory) with a minimum failure rate at about 1 year.  that's the 
deep end of the pool ;)  then increasing failures out to the end of 
expected service life (warranty period).  what happens after is probably
too noisy to conclude much, since most people prefer not to use disks 
which have already seen the death of ~25% of their peers.  (Google's 
paper has, halleluiah, error bars showing high variance at 3 years.)


both papers (and most people's experience, I think) agree that:
- there may be an infant mortality curve, but it depends on
when you start counting, conditions and load in early life, etc.
- failure rates increase with age.
- failure rates in the prime of life are dramatically higher
than the vendor spec sheets.
- failure rates in senescence (post warranty) are very bad.

after all, real bathtubs don't have flat bottoms!

as for models and fits, well, it's complicated.  consider that in a lot
of environments, it takes a year or two for a new disk array to fill.
so a wear-related process will initially be focused on a small area of 
disk, perhaps not even spread across individual disks.  or consider that

once the novelty of a new installation wears off, people get more worried
about failures, perhaps altering their replacement strategy...
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: trouble creating array

2007-02-25 Thread jahammonds prost
 Do you have an active md array?

 mdadm -S /dev/md0

Nothing was showing up in /proc/mdstat, but when I try and stop md0, I get this

# mdadm -S /dev/md0
mdadm: stopped /dev/md0

 lsof | egrep '(sdb|sdc)'

I had tried that before, and nothing is showing. A reboot didn't help, but 
something is definately keeping it open. I tried an mkfs

]# mkfs.ext3  /dev/sdb1
mke2fs 1.39 (29-May-2006)
/dev/sdb1 is apparently in use by the system; will not make a filesystem here!


Any ideas how to find out what has it open? I can happily write all over the 
disk with dd... I can create and delete the partition, and it's all good... I 
will try deleting the sd{b,c}1 partitions, reboot, and see what happens.


Graham




- Original Message 
From: Justin Piszcz [EMAIL PROTECTED]
To: jahammonds prost [EMAIL PROTECTED]
Cc: linux-raid@vger.kernel.org
Sent: Sunday, 25 February, 2007 5:35:33 PM
Subject: Re: trouble creating array


On Sun, 25 Feb 2007, jahammonds prost wrote:

 Just built a new FC6 machine, with 5x 320Gb drives and 1x 300Gb drive. Made a 
 300Gb partition on all the drives /dev/hd{c,d,e} and /dev/sd{a,b,c}... Trying 
 to create an array gave me an error, since it thought there was already an 
 array on some of the disks (and there was an old one).

 I decided to clear off the superblock on the drives with mdadm 
 --zero-superblock on all the drives. It worked fine on all drives, except for 
 /dev/sd{b,c)1, which returns an error mdadm: Couldn't open /dev/sdb1 for 
 write - not zeroing. There doesn't seem to be a problem with the drive, as 
 I've run a non destructive badblocks on it, and also done a dd if=/dev/zero 
 of=/dev/sdb1 on it, and Ive written out 300Gb onto the partition.

 When I try and create an array using these 2 partitions, I get an error

 mdadm: Cannot open /dev/sdb1: Device or resource busy
 mdadm: Cannot open /dev/sdc1: Device or resource busy

 and it aborts. I've double checked that the drives aren't mounted anywhere. 
 There's nothing in /var/log/messages either...

 Any suggestions where to check next?



 Graham



 ___
 To help you stay safe and secure online, we've developed the all new Yahoo! 
 Security Centre. http://uk.security.yahoo.com
 -
 To unsubscribe from this list: send the line unsubscribe linux-raid in
 the body of a message to [EMAIL PROTECTED]
 More majordomo info at  http://vger.kernel.org/majordomo-info.html


Do you have an active md array?

mdadm -S /dev/md0
mdadm -S /dev/md1
mdadm -S /dev/md2

.. etc

lsof | egrep '(sdb|sdc)'

Something thinks its in use, that is why you cannot format it/make it part 
of a new array, a reboot would also fix the problem.

Justin.



___ 
To help you stay safe and secure online, we've developed the all new Yahoo! 
Security Centre. http://uk.security.yahoo.com
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Linux Software RAID Bitmap Question

2007-02-25 Thread Neil Brown
On Sunday February 25, [EMAIL PROTECTED] wrote:
 Anyone have a good explanation for the use of bitmaps?
 
 Anyone on the list use them?
 
 http://gentoo-wiki.com/HOWTO_Gentoo_Install_on_Software_RAID#Data_Scrubbing
 
 Provides an explanation on that page.
 
 I believe Neil stated that using bitmaps does incur a 10% performance 
 penalty.  If one's box never (or rarely) crashes, is a bitmap needed?

I think I said it can incur such a penalty.  The actual cost is very
dependant on work-load.

 
 The one question I had regarding a bitmap is as follows:
 
 The mismatch_cnt file.
 
 If I have bitmaps turned on for my RAID DEVICES, is it possible that the 
 'mismatch_cnt' will be updated when it finds a bad block?
 
 That would be VERY nice instead of running a check all the time.

When md find a bad block (read failure) it either fixes it (by
successfully over-writing the correct date) or fails the drive.

The count of the times that this has happened is available via
   /sys/block/mdX/md/errors

If you use version-1 superblocks, then this count is maintained
throughout the life of the array.  If you use v0.90, the count is
zeroed whenever you assemble the array.

This count is completely separate from the 'mismatch_cnt'.
'mismatch_cnt' referred to when md check if redundant information
(copies or parity) is consistent or not.  This does not happen at all
during normal operation.  It only happens when you ask for a 'check'
or 'repair' operation.  It might also happen when the array
automatically performs a 'sync' after an unclean shutdown.

And all this has very little to do with bitmaps.
So I'm afraid I don't understand your question.

NeilBrown
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: nonzero mismatch_cnt with no earlier error

2007-02-25 Thread Neil Brown
On Saturday February 24, [EMAIL PROTECTED] wrote:
 But is this not a good opportunity to repair the bad stripe for a very
 low cost (no complete resync required)?

In this case, 'md' knew nothing about an error.  The SCSI layer
detected something and thought it had fixed it itself.  Nothing for md
to do.

 
 At time of error we actually know which disk failed and can re-write
 it, something we do not know at resync time, so I assume we always
 write to the parity disk.

md only knows of a 'problem' if the lower level driver reports one.
If it reports a problem for a write request, md will fail the device.
If it reports a problem for a read request, md will try to over-write
correct data on the failed block. 
But if the driver doesn't report the failure, there is nothing md can
do.

When performing a check/repair md looks for consistencies and fixes
the 'arbitrarily'.  For raid5/6, it just 'corrects' the parity.  For
raid1/10, it chooses one block and over-writes the other(s) with it.

Mapping these corrections back to blocks in files in the filesystem is
extremely non-trivial.

NeilBrown
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: trouble creating array

2007-02-25 Thread Neil Brown
On Sunday February 25, [EMAIL PROTECTED] wrote:
 
 
 Any ideas how to find out what has it open? I can happily write all over the 
 disk with dd... I can create and delete the partition, and it's all good... I 
 will try deleting the sd{b,c}1 partitions, reboot, and see what happens.
 

ls -l /sys/block/*/holders/* ??

NeilBrown
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: nonzero mismatch_cnt with no earlier error

2007-02-25 Thread Jeff Breidenbach

Ok, so hearing all the excitement I ran a check on a multi-disk
RAID-1. One of the RAID-1 disks failed out, maybe by coincidence
but presumably due to the check. (I also have another disk in
the array deliberately removed as a backup mechanism.) And
of course there is a big mismatch count.

Questions: will repair do the right thing for multidisk RAID-1, e.g.
vote or something? Do I need a special version of mdadm to
do this safely? What am I forgetting to ask?

Jeff


# cat /proc/mdstat
Personalities : [raid1]
md1 : active raid1 sdf1[0] sdb1[4] sdd1[6](F) sdc1[2] sde1[1]
 488383936 blocks [6/4] [UUU_U_]

# cat /sys/block/md1/md/mismatch_cnt
128

# cat /proc/version
Linux version 2.6.17-2-amd64 (Debian 2.6.17-7) ([EMAIL PROTECTED]) (gcc
version 4.1.2 20060814 (prerelease) (Debian 4.1.1-11)) #1 SMP Thu Aug
24 16:13:57 UTC 2006

# dpkg -l | grep  mdadm
ii  mdadm  1.9.0-4sarge1  Manage MD devices aka Linux Software Raid
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: end to end error recovery musings

2007-02-25 Thread Neil Brown
On Friday February 23, [EMAIL PROTECTED] wrote:
 On Fri, Feb 23, 2007 at 05:37:23PM -0700, Andreas Dilger wrote:
   Probably the only sane thing to do is to remember the bad sectors and 
   avoid attempting reading them; that would mean marking automatic 
   versus explicitly requested requests to determine whether or not to 
   filter them against a list of discovered bad blocks.
  
  And clearing this list when the sector is overwritten, as it will almost
  certainly be relocated at the disk level.  For that matter, a huge win
  would be to have the MD RAID layer rewrite only the bad sector (in hopes
  of the disk relocating it) instead of failing the whiole disk.  Otherwise,
  a few read errors on different disks in a RAID set can take the whole
  system offline.  Apologies if this is already done in recent kernels...

Yes, current md does this.

 
 And having a way of making this list available to both the filesystem
 and to a userspace utility, so they can more easily deal with doing a
 forced rewrite of the bad sector, after determining which file is
 involved and perhaps doing something intelligent (up to and including
 automatically requesting a backup system to fetch a backup version of
 the file, and if it can be determined that the file shouldn't have
 been changed since the last backup, automatically fixing up the
 corrupted data block :-).
 
   - Ted

So we want a clear path for media read errors from the device up to
user-space.  Stacked devices (like md) would do appropriate mappings
maybe (for raid0/linear at least.  Other levels wouldn't tolerate
errors).
There would need to be a limit on the number of 'bad blocks' that is
recorded.  Maybe a mechanism to clear old bad  blocks from the list is
needed.

Maybe if generic make request gets a request for a block which
overlaps a 'bad-block' it returns an error immediately.

Do we want a path in the other direction to handle write errors?  The
file system could say Don't worry to much if this block cannot be
written, just return an error and I will write it somewhere else?
This might allow md not to fail a whole drive if there is a single
write error.
Or is that completely un-necessary as all modern devices do bad-block
relocation for us?
Is there any need for a bad-block-relocating layer in md or dm?

What about corrected-error counts?  Drives provide them with SMART.
The SCSI layer could provide some as well.  Md can do a similar thing
to some extent.  Where these are actually useful predictors of pending
failure is unclear, but there could be some value.
e.g. after a certain number of recovered errors raid5 could trigger a
background consistency check, or a filesystem could trigger a
background fsck should it support that.


Lots of interesting questions... not so many answers.

NeilBrown
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: end to end error recovery musings

2007-02-25 Thread Douglas Gilbert
H. Peter Anvin wrote:
 Ric Wheeler wrote:

 We still have the following challenges:

(1) read-ahead often means that we will  retry every bad sector at
 least twice from the file system level. The first time, the fs read
 ahead request triggers a speculative read that includes the bad sector
 (triggering the error handling mechanisms) right before the real
 application triggers a read does the same thing.  Not sure what the
 answer is here since read-ahead is obviously a huge win in the normal
 case.

 
 Probably the only sane thing to do is to remember the bad sectors and
 avoid attempting reading them; that would mean marking automatic
 versus explicitly requested requests to determine whether or not to
 filter them against a list of discovered bad blocks.

Some disks are doing their own read-ahead in the form
of a background media scan. Scans are done on request or
periodically (e.g. once per day or once per week) and we
have tools that can fetch the scan results from a disk
(e.g. a list of unreadable sectors). What we don't have
is any way to feed such information to a file system
that may be impacted.

Doug Gilbert


-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html