Yeah... interesting you should say that.

One of the things that MapR does is to monitor disk speeds and mark disks as
bad when they start to stand out from the background (in a bad way).  That
can be a bit pessimistic, but it is really best to get the bad apples out
early even at the cost of some false positives.

On Sun, Sep 11, 2011 at 12:00 AM, Dmitriy Lyubimov <[email protected]>wrote:

> Yep I was at that presentation, I think they just went on to say that at
> that scale it is just much more effective to get rid of these 2 percent
> than
> trying to keep sending them for triage and figure out what it is about them
> that makes their hardware to fail.
>
> Very good presentation for any farm admin IMO.
>
> The essense is that while one (hadoop architect) is saying 'let's assume
> there's a 4% chance a drive would fail in a year's time', we should really
> assume that a drive would be painfully slow for another 1000 hours first
> before it fails. And that makes our life somewhat easier since we really
> should be concentrating on something working below the spec, which is on
> one
> hand hard, but if we succeed, we'd avoid a lot of fatal failures where we'd
> have 0 warning.
>
> (Plus, hadoop's opportunistic execution is a fairly weak hedge against slow
> running hardware)
>
> Sent from android tab
> On Sep 10, 2011 10:39 PM, "Ted Dunning" <[email protected]> wrote:
> > SMART indicators have limits. The generally have low false positive
> rates,
> > but very high false negative rates. Put another way, if the alarm goes
> off,
> > a failure is imminent, but many failures have no warnings.
> >
> > The Google paper has good numbers on this.
> >
> > But server failure can be disk, other hardware, software or interconnect
> > failures. It is difficult to predict truly sporadic failure, but many
> > failures are reasonably predictable. In the simplest case, it is good to
> > simply recognize lemon machines when you see one. A recent facebook
> > presentation (at the Hadoop Summit, I believe) claimed that 30% or so of
> the
> > trouble tickets came from 2% of the machines.
> >
> > On Sat, Sep 10, 2011 at 5:33 PM, Lance Norskog <[email protected]>
> wrote:
> >
> >> S.M.A.R.T. disks have gradual failure warnings.
> >>
> >> A disk failure in a RAID requires immediate attention, if your numbers
> >> are that when you buy 3 disks of the same manufacturer's lot at the
> >> same time, keep them powered up at the same rate, and the disk heads
> >> wiggling at the same rate, they die at the same time. Some people
> >> advocate RAID-6 (survives 2 failures) instead of RAID-5 (survives 1
> >> failure) for exactly this reason.
> >>
> >>
> >> On 9/10/11, Konstantin Shmakov <[email protected]> wrote:
> >> > "Prediction is very difficult, especially about the future" Niels Bohr
> >> >
> >> > I would first ask questions on evaluation techniques - how one would
> >> > verify that prediction "make sense" and second - how prediction will
> >> > be used? One can predict that on average N servers will fail within a
> >> > month; or even narrow prediction to a group of servers with higher
> >> > probability of failure, but how this prediction will be used?
> >> > It seems that future actions should affect the model prediction and
> >> > model built on past "non-actionable" data can have little future
> >> > prediction power, if at all.
> >> >
> >> > --Konstantin
> >> >
> >> > On Sat, Sep 10, 2011 at 1:38 AM, highpointe <[email protected]>
> >> wrote:
> >> >> I can't help but flog a dead horse but... Are you serious?
> >> >>
> >> >> The next server that goes down is the one your Zabbix alerts say,
> >> "Server
> >> >> X is down."
> >> >>
> >> >> Until then, do something productive dammit.
> >> >>
> >> >> Sent from my iPhone
> >> >>
> >> >> On Sep 10, 2011, at 1:10 AM, Lance Norskog <[email protected]>
> wrote:
> >> >>
> >> >>> Ah! The Butter-Side-Down Predictor.
> >> >>>
> >> >>> On Fri, Sep 9, 2011 at 10:38 PM, Matt Pinner <[email protected]>
> >> wrote:
> >> >>>
> >> >>>> Easy. The most important, least redundant, and single points of
> >> failure
> >> >>>> will
> >> >>>> fail next.
> >> >>>> On Sep 9, 2011 8:33 PM, "Mike Nute" <[email protected]> wrote:
> >> >>>>> IMO, the best approach would depend on your beliefs about the
> >> survival
> >> >>>> curve of the server. If you believe the general hazard rate is
> >> >>>> relatively
> >> >>>> constant (i.e. time-since-startup is not a huge factor) you could
> make
> >> >>>> it
> >> >>>> into a basic time series logistic regression problem: Let Y_i_t be
> 1
> >> if
> >> >>>> server i fails at time t, 0 if it does not. Let X_i_(t-1) be the
> >> vector
> >> >>>> of
> >> >>>> measurements on server i at time (t-1). Then do logistic regression
> of
> >> X
> >> >>>> on
> >> >>>> Y. You could then add X_i_(t-2) to your predictors and see if it
> adds
> >> >>>> accuracy, and so on with previous time periods until they stop
> being
> >> >>>> predictive.
> >> >>>>>
> >> >>>>> That would also facilitate experimenting with transformations like
> >> the
> >> >>>> change in certain measurements at (t-1), (t-2), etc..., or
> >> interactions
> >> >>>> between certain measurements.
> >> >>>>>
> >> >>>>> If different failure classes are important, you could similarly
> apply
> >> >>>> that
> >> >>>> to multinomial logistic regression.
> >> >>>>>
> >> >>>>> If the failure rate depends heavily on time since startup, you
> could
> >> >>>> apply
> >> >>>> some kind of survival modeling technique like a Cox Proportional
> >> Hazard
> >> >>>> model or incorporating some prior belief about the shape of the
> >> survival
> >> >>>> curve. That could end up being technically similar to the logistic
> >> >>>> regression above, but with a more exotic link function and/or
> offset
> >> >>>> term.
> >> >>>> (I have a good brief chapter on the CPH model from an old actuarial
> >> exam
> >> >>>> study guide in pdf if you want it. Survival models are actuary
> staples
> >> >>>> :-).)
> >> >>>>
> >> >>>>>
> >> >>>>> Hope that helps.
> >> >>>>>
> >> >>>>> Mike Nute
> >> >>>>>
> >> >>>>>
> >> >>>>> ------Original Message------
> >> >>>>> From: Lance Norskog
> >> >>>>> To: user
> >> >>>>> ReplyTo: [email protected]
> >> >>>>> Subject: Predictive analysis problem
> >> >>>>> Sent: Sep 9, 2011 10:45 PM
> >> >>>>>
> >> >>>>> Let's say you manage 2000 servers in a huge datacenter. You have
> >> >>>> regularly
> >> >>>>> sampled stats, with uniform methods: aka, they are all sampled the
> >> same
> >> >>>> way
> >> >>>>> across all servers across the full time series This data is a cube
> of
> >> >>>>> (server X time X measurement type), with a measurement in each
> cell.
> >> >>>>>
> >> >>>>> You also have a time series of system failures, a matrix of server
> X
> >> >>>> failure
> >> >>>>> class. What algorithm will predict which server will fail next,
> and
> >> >>>>> when
> >> >>>> and
> >> >>>>> how?
> >> >>>>>
> >> >>>>> --
> >> >>>>> Lance Norskog
> >> >>>>> [email protected]
> >> >>>>>
> >> >>>>>
> >> >>>>
> >> >>>
> >> >>>
> >> >>>
> >> >>> --
> >> >>> Lance Norskog
> >> >>> [email protected]
> >> >>
> >> >
> >> >
> >> >
> >> > --
> >> > ksh:
> >> >
> >>
> >>
> >> --
> >> Lance Norskog
> >> [email protected]
> >>
>

Reply via email to