@Steven - agreed!
As mentioned, if we can reduce the "footprint of unnecessary CHECKs" (so to
speak) I'm all for it - let's document and add Jiras for that, by all means.

@Scott - LoL: you certainly didn't; I was more worried my email would ;-)

Thanks, guys!

*Marco Massenzio*

*Distributed Systems Engineerhttp://codetrips.com <http://codetrips.com>*

On Wed, Sep 2, 2015 at 10:59 AM, Steven Schlansker <
sschlans...@opentable.com> wrote:

> I 100% agree with your philosophy here, and I suspect it's something
> shared in the Mesos community.
>
> I just think that we can restrict the domain of the failure to a smaller
> reasonable window -- once you are in the context of "I am doing work to
> launch a specific task", there is a well defined "success / failure / here
> is an error message" path defined already.  Users expect tasks to fail and
> can see the errors.
>
> I think that a lot of these assertions are in fact more appropriate as
> task failures.  But I agree that they should be fatal to *some* part of the
> system, just not the whole agent entirely.
>
> On Sep 1, 2015, at 4:33 PM, Marco Massenzio <ma...@mesosphere.io> wrote:
>
> > That's one of those areas for discussions that is so likely to generate
> a flame war that I'm hesitant to wade in :)
> >
> > In general, I would agree with the sentiment expressed there:
> >
> > > If the task fails, that is unfortunate, but not the end of the world.
> Other tasks should not be affected.
> >
> > which is, in fact, to large extent exactly what Mesos does; the example
> given in MESOS-2684, as it happens, is for a "disk full failure" - carrying
> on as if nothing had happened, is only likely to lead to further (and
> worse) disappointment.
> >
> > The general philosophy back at Google (and which certainly informs the
> design of Borg[0]) was "fail early, fail hard" so that either (a) the
> service is restarted and hopefully the root cause cleared or (b) someone
> (who can hopefully do something) will be alerted about it.
> >
> > I think it's ultimately a matter of scale: up to a few tens of servers,
> you can assume there is some sort of 'log-monitor' that looks out for
> errors and other anomalies and alerts humans that will then take a look and
> possibly apply some corrective action - when you're up to hundreds or
> thousands (definitely Mesos territory) that's not practical: the system
> should either self-heal or crash-and-restart.
> >
> > All this to say, that it's difficult to come up with a general
> *automated* approach to unequivocally decide if a failure is "fatal" or
> could just be safely "ignored" (after appropriate error logging) - in
> general, when in doubt it's probably safer to "noisily crash & restart" and
> rely on the overall system's HA architecture to take care of replication
> and consistency.
> > (and an intelligent monitoring system that only alerts when some failure
> threshold is exceeded).
> >
> > From what I've seen so far (granted, still a novice here) it seems that
> Mesos subscribes to this notion, assuming that Agent Nodes will come and
> go, and usually Tasks survive (for a certain amount of time anyway) a Slave
> restart (obviously, if the physical h/w is the ultimate cause of failure,
> well, then all bets are off).
> >
> > Having said all that - if there are areas where we have been over-eager
> with our CHECKs, we should definitely revisit that and make it more
> crash-resistant, absolutely.
> >
> > [0] http://research.google.com/pubs/pub43438.html
> >
> > Marco Massenzio
> > Distributed Systems Engineer
> > http://codetrips.com
> >
> > On Mon, Aug 31, 2015 at 12:47 PM, Steven Schlansker <
> sschlans...@opentable.com> wrote:
> >
> >
> > On Aug 31, 2015, at 11:54 AM, Scott Rankin <sran...@motus.com> wrote:
> > >
> > > tag=mesos-slave[12858]:  F0831 09:37:29.838184 12898 slave.cpp:3354]
> CHECK_SOME(os::touch(path)): Failed to open file: No such file or directory
> >
> > I reported a similar bug a while back:
> >
> > https://issues.apache.org/jira/browse/MESOS-2684
> >
> > This seems to be a class of bugs where some filesystem operations which
> may fail for unforeseen reasons are written as assertions which crash the
> process, rather than failing only the task and communicating back the error
> reason.
> >
> >
> >
>
>

Reply via email to