Re: mesos-slave crashing with CHECK_SOME

Scott Rankin Wed, 02 Sep 2015 05:16:30 -0700

Hi Marco,

I certainly don’t want to start a flame war, and I actually realized after I 
added my comment to MESOS-2684 that it’s not quite the same thing.

As far as I can tell, in our situation, there’s no underlying disk issue.  It 
seems like this is some sort of race condition (maybe?) with docker containers 
and executors shutting down.  I’m perfectly happy with Mesos choosing to shut 
down in the case of a failure or unexpected situation – that’s a methodology 
that we adopt ourselves.  I’m just trying to get a little more information 
about what the underlying issue is so that we can resolve it. I don’t know 
enough about Mesos internals to be able to answer that question just yet.

It’s also inconvenient because, while Mesos is well-behaved and restarts 
gracefully, as of 0.22.1, it’s not recovering the Docker executors – so a 
mesos-slave crash also brings down applications.

Thanks,
Scott

From: Marco Massenzio
Reply-To: "user@mesos.apache.org<mailto:user@mesos.apache.org>"
Date: Tuesday, September 1, 2015 at 7:33 PM
To: "user@mesos.apache.org<mailto:user@mesos.apache.org>"
Subject: Re: mesos-slave crashing with CHECK_SOME

That's one of those areas for discussions that is so likely to generate a flame 
war that I'm hesitant to wade in :)

In general, I would agree with the sentiment expressed there:

> If the task fails, that is unfortunate, but not the end of the world. Other 
> tasks should not be affected.

which is, in fact, to large extent exactly what Mesos does; the example given 
in MESOS-2684, as it happens, is for a "disk full failure" - carrying on as if 
nothing had happened, is only likely to lead to further (and worse) 
disappointment.

The general philosophy back at Google (and which certainly informs the design 
of Borg[0]) was "fail early, fail hard" so that either (a) the service is 
restarted and hopefully the root cause cleared or (b) someone (who can 
hopefully do something) will be alerted about it.

I think it's ultimately a matter of scale: up to a few tens of servers, you can 
assume there is some sort of 'log-monitor' that looks out for errors and other 
anomalies and alerts humans that will then take a look and possibly apply some 
corrective action - when you're up to hundreds or thousands (definitely Mesos 
territory) that's not practical: the system should either self-heal or 
crash-and-restart.

All this to say, that it's difficult to come up with a general *automated* 
approach to unequivocally decide if a failure is "fatal" or could just be 
safely "ignored" (after appropriate error logging) - in general, when in doubt 
it's probably safer to "noisily crash & restart" and rely on the overall 
system's HA architecture to take care of replication and consistency.
(and an intelligent monitoring system that only alerts when some failure 
threshold is exceeded).

From what I've seen so far (granted, still a novice here) it seems that Mesos 
subscribes to this notion, assuming that Agent Nodes will come and go, and 
usually Tasks survive (for a certain amount of time anyway) a Slave restart 
(obviously, if the physical h/w is the ultimate cause of failure, well, then 
all bets are off).

Having said all that - if there are areas where we have been over-eager with 
our CHECKs, we should definitely revisit that and make it more crash-resistant, 
absolutely.

[0] http://research.google.com/pubs/pub43438.html

Marco Massenzio
Distributed Systems Engineer
http://codetrips.com

On Mon, Aug 31, 2015 at 12:47 PM, Steven Schlansker 
<sschlans...@opentable.com<mailto:sschlans...@opentable.com>> wrote:

On Aug 31, 2015, at 11:54 AM, Scott Rankin 
<sran...@motus.com<mailto:sran...@motus.com>> wrote:
>
> tag=mesos-slave[12858]:  F0831 09:37:29.838184 12898 slave.cpp:3354] 
> CHECK_SOME(os::touch(path)): Failed to open file: No such file or directory

I reported a similar bug a while back:

https://issues.apache.org/jira/browse/MESOS-2684

This seems to be a class of bugs where some filesystem operations which may 
fail for unforeseen reasons are written as assertions which crash the process, 
rather than failing only the task and communicating back the error reason.

This email message contains information that Motus, LLC considers confidential 
and/or proprietary, or may later designate as confidential and proprietary. It 
is intended only for use of the individual or entity named above and should not 
be forwarded to any other persons or entities without the express consent of 
Motus, LLC, nor should it be used for any purpose other than in the course of 
any potential or actual business relationship with Motus, LLC. If the reader of 
this message is not the intended recipient, or the employee or agent 
responsible to deliver it to the intended recipient, you are hereby notified 
that any dissemination, distribution, or copying of this communication is 
strictly prohibited. If you have received this communication in error, please 
notify sender immediately and destroy the original message.

Internal Revenue Service regulations require that certain types of written 
advice include a disclaimer. To the extent the preceding message contains 
advice relating to a Federal tax issue, unless expressly stated otherwise the 
advice is not intended or written to be used, and it cannot be used by the 
recipient or any other taxpayer, for the purpose of avoiding Federal tax 
penalties, and was not written to support the promotion or marketing of any 
transaction or matter discussed herein.

Re: mesos-slave crashing with CHECK_SOME

Reply via email to