Hi Marco, I certainly don’t want to start a flame war, and I actually realized after I added my comment to MESOS-2684 that it’s not quite the same thing.
As far as I can tell, in our situation, there’s no underlying disk issue. It seems like this is some sort of race condition (maybe?) with docker containers and executors shutting down. I’m perfectly happy with Mesos choosing to shut down in the case of a failure or unexpected situation – that’s a methodology that we adopt ourselves. I’m just trying to get a little more information about what the underlying issue is so that we can resolve it. I don’t know enough about Mesos internals to be able to answer that question just yet. It’s also inconvenient because, while Mesos is well-behaved and restarts gracefully, as of 0.22.1, it’s not recovering the Docker executors – so a mesos-slave crash also brings down applications. Thanks, Scott From: Marco Massenzio Reply-To: "user@mesos.apache.org<mailto:user@mesos.apache.org>" Date: Tuesday, September 1, 2015 at 7:33 PM To: "user@mesos.apache.org<mailto:user@mesos.apache.org>" Subject: Re: mesos-slave crashing with CHECK_SOME That's one of those areas for discussions that is so likely to generate a flame war that I'm hesitant to wade in :) In general, I would agree with the sentiment expressed there: > If the task fails, that is unfortunate, but not the end of the world. Other > tasks should not be affected. which is, in fact, to large extent exactly what Mesos does; the example given in MESOS-2684, as it happens, is for a "disk full failure" - carrying on as if nothing had happened, is only likely to lead to further (and worse) disappointment. The general philosophy back at Google (and which certainly informs the design of Borg[0]) was "fail early, fail hard" so that either (a) the service is restarted and hopefully the root cause cleared or (b) someone (who can hopefully do something) will be alerted about it. I think it's ultimately a matter of scale: up to a few tens of servers, you can assume there is some sort of 'log-monitor' that looks out for errors and other anomalies and alerts humans that will then take a look and possibly apply some corrective action - when you're up to hundreds or thousands (definitely Mesos territory) that's not practical: the system should either self-heal or crash-and-restart. All this to say, that it's difficult to come up with a general *automated* approach to unequivocally decide if a failure is "fatal" or could just be safely "ignored" (after appropriate error logging) - in general, when in doubt it's probably safer to "noisily crash & restart" and rely on the overall system's HA architecture to take care of replication and consistency. (and an intelligent monitoring system that only alerts when some failure threshold is exceeded). From what I've seen so far (granted, still a novice here) it seems that Mesos subscribes to this notion, assuming that Agent Nodes will come and go, and usually Tasks survive (for a certain amount of time anyway) a Slave restart (obviously, if the physical h/w is the ultimate cause of failure, well, then all bets are off). Having said all that - if there are areas where we have been over-eager with our CHECKs, we should definitely revisit that and make it more crash-resistant, absolutely. [0] http://research.google.com/pubs/pub43438.html Marco Massenzio Distributed Systems Engineer http://codetrips.com On Mon, Aug 31, 2015 at 12:47 PM, Steven Schlansker <sschlans...@opentable.com<mailto:sschlans...@opentable.com>> wrote: On Aug 31, 2015, at 11:54 AM, Scott Rankin <sran...@motus.com<mailto:sran...@motus.com>> wrote: > > tag=mesos-slave[12858]: F0831 09:37:29.838184 12898 slave.cpp:3354] > CHECK_SOME(os::touch(path)): Failed to open file: No such file or directory I reported a similar bug a while back: https://issues.apache.org/jira/browse/MESOS-2684 This seems to be a class of bugs where some filesystem operations which may fail for unforeseen reasons are written as assertions which crash the process, rather than failing only the task and communicating back the error reason. This email message contains information that Motus, LLC considers confidential and/or proprietary, or may later designate as confidential and proprietary. It is intended only for use of the individual or entity named above and should not be forwarded to any other persons or entities without the express consent of Motus, LLC, nor should it be used for any purpose other than in the course of any potential or actual business relationship with Motus, LLC. If the reader of this message is not the intended recipient, or the employee or agent responsible to deliver it to the intended recipient, you are hereby notified that any dissemination, distribution, or copying of this communication is strictly prohibited. If you have received this communication in error, please notify sender immediately and destroy the original message. Internal Revenue Service regulations require that certain types of written advice include a disclaimer. To the extent the preceding message contains advice relating to a Federal tax issue, unless expressly stated otherwise the advice is not intended or written to be used, and it cannot be used by the recipient or any other taxpayer, for the purpose of avoiding Federal tax penalties, and was not written to support the promotion or marketing of any transaction or matter discussed herein.