Re: mesos-slave crashing with CHECK_SOME

Marco Massenzio Tue, 01 Sep 2015 16:34:39 -0700

That's one of those areas for discussions that is so likely to generate a
flame war that I'm hesitant to wade in :)

In general, I would agree with the sentiment expressed there:

> If the task fails, that is unfortunate, but not the end of the world.
Other tasks should not be affected.

which is, in fact, to large extent exactly what Mesos does; the example
given in MESOS-2684, as it happens, is for a "disk full failure" - carrying
on as if nothing had happened, is only likely to lead to further (and
worse) disappointment.

The general philosophy back at Google (and which certainly informs the
design of Borg[0]) was "fail early, fail hard" so that either (a) the
service is restarted and hopefully the root cause cleared or (b) someone
(who can hopefully do something) will be alerted about it.

I think it's ultimately a matter of scale: up to a few tens of servers, you
can assume there is some sort of 'log-monitor' that looks out for errors
and other anomalies and alerts humans that will then take a look and
possibly apply some corrective action - when you're up to hundreds or
thousands (definitely Mesos territory) that's not practical: the system
should either self-heal or crash-and-restart.

All this to say, that it's difficult to come up with a general *automated*
approach to unequivocally decide if a failure is "fatal" or could just be
safely "ignored" (after appropriate error logging) - in general, when in
doubt it's probably safer to "noisily crash & restart" and rely on the
overall system's HA architecture to take care of replication and
consistency.
(and an intelligent monitoring system that only alerts when some failure
threshold is exceeded).

>From what I've seen so far (granted, still a novice here) it seems that
Mesos subscribes to this notion, assuming that Agent Nodes will come and
go, and usually Tasks survive (for a certain amount of time anyway) a Slave
restart (obviously, if the physical h/w is the ultimate cause of failure,
well, then all bets are off).

Having said all that - if there are areas where we have been over-eager
with our CHECKs, we should definitely revisit that and make it more
crash-resistant, absolutely.

[0] http://research.google.com/pubs/pub43438.html

*Marco Massenzio*

*Distributed Systems Engineerhttp://codetrips.com <http://codetrips.com>*

On Mon, Aug 31, 2015 at 12:47 PM, Steven Schlansker <
[email protected]> wrote:

>
>
> On Aug 31, 2015, at 11:54 AM, Scott Rankin <[email protected]> wrote:
> >
> > tag=mesos-slave[12858]:  F0831 09:37:29.838184 12898 slave.cpp:3354]
> CHECK_SOME(os::touch(path)): Failed to open file: No such file or directory
>
> I reported a similar bug a while back:
>
> https://issues.apache.org/jira/browse/MESOS-2684
>
> This seems to be a class of bugs where some filesystem operations which
> may fail for unforeseen reasons are written as assertions which crash the
> process, rather than failing only the task and communicating back the error
> reason.
>
>
>

Re: mesos-slave crashing with CHECK_SOME

Reply via email to