Hi Scott, I wonder if you can try the latest Mesos and see if you can repro this?
And if it is can you put down the example task and steps? I couldn't see disk full in your slave log so I'm not sure if it's exactly the same problem of MESOS-2684. Tim On Wed, Sep 2, 2015 at 5:15 AM, Scott Rankin <sran...@motus.com> wrote: > Hi Marco, > > I certainly don’t want to start a flame war, and I actually realized after > I added my comment to MESOS-2684 that it’s not quite the same thing. > > As far as I can tell, in our situation, there’s no underlying disk issue. > It seems like this is some sort of race condition (maybe?) with docker > containers and executors shutting down. I’m perfectly happy with Mesos > choosing to shut down in the case of a failure or unexpected situation – > that’s a methodology that we adopt ourselves. I’m just trying to get a > little more information about what the underlying issue is so that we can > resolve it. I don’t know enough about Mesos internals to be able to answer > that question just yet. > > It’s also inconvenient because, while Mesos is well-behaved and restarts > gracefully, as of 0.22.1, it’s not recovering the Docker executors – so a > mesos-slave crash also brings down applications. > > Thanks, > Scott > > From: Marco Massenzio > Reply-To: "user@mesos.apache.org" > Date: Tuesday, September 1, 2015 at 7:33 PM > To: "user@mesos.apache.org" > Subject: Re: mesos-slave crashing with CHECK_SOME > > That's one of those areas for discussions that is so likely to generate a > flame war that I'm hesitant to wade in :) > > In general, I would agree with the sentiment expressed there: > > > If the task fails, that is unfortunate, but not the end of the world. > Other tasks should not be affected. > > which is, in fact, to large extent exactly what Mesos does; the example > given in MESOS-2684, as it happens, is for a "disk full failure" - carrying > on as if nothing had happened, is only likely to lead to further (and > worse) disappointment. > > The general philosophy back at Google (and which certainly informs the > design of Borg[0]) was "fail early, fail hard" so that either (a) the > service is restarted and hopefully the root cause cleared or (b) someone > (who can hopefully do something) will be alerted about it. > > I think it's ultimately a matter of scale: up to a few tens of servers, > you can assume there is some sort of 'log-monitor' that looks out for > errors and other anomalies and alerts humans that will then take a look and > possibly apply some corrective action - when you're up to hundreds or > thousands (definitely Mesos territory) that's not practical: the system > should either self-heal or crash-and-restart. > > All this to say, that it's difficult to come up with a general *automated* > approach to unequivocally decide if a failure is "fatal" or could just be > safely "ignored" (after appropriate error logging) - in general, when in > doubt it's probably safer to "noisily crash & restart" and rely on the > overall system's HA architecture to take care of replication and > consistency. > (and an intelligent monitoring system that only alerts when some failure > threshold is exceeded). > > From what I've seen so far (granted, still a novice here) it seems that > Mesos subscribes to this notion, assuming that Agent Nodes will come and > go, and usually Tasks survive (for a certain amount of time anyway) a Slave > restart (obviously, if the physical h/w is the ultimate cause of failure, > well, then all bets are off). > > Having said all that - if there are areas where we have been over-eager > with our CHECKs, we should definitely revisit that and make it more > crash-resistant, absolutely. > > [0] http://research.google.com/pubs/pub43438.html > > *Marco Massenzio* > > *Distributed Systems Engineer http://codetrips.com <http://codetrips.com>* > > On Mon, Aug 31, 2015 at 12:47 PM, Steven Schlansker < > sschlans...@opentable.com> wrote: > >> >> >> On Aug 31, 2015, at 11:54 AM, Scott Rankin <sran...@motus.com> wrote: >> > >> > tag=mesos-slave[12858]: F0831 09:37:29.838184 12898 slave.cpp:3354] >> CHECK_SOME(os::touch(path)): Failed to open file: No such file or directory >> >> I reported a similar bug a while back: >> >> https://issues.apache.org/jira/browse/MESOS-2684 >> >> This seems to be a class of bugs where some filesystem operations which >> may fail for unforeseen reasons are written as assertions which crash the >> process, rather than failing only the task and communicating back the error >> reason. >> >> >> > This email message contains information that Motus, LLC considers > confidential and/or proprietary, or may later designate as confidential and > proprietary. It is intended only for use of the individual or entity named > above and should not be forwarded to any other persons or entities without > the express consent of Motus, LLC, nor should it be used for any purpose > other than in the course of any potential or actual business relationship > with Motus, LLC. If the reader of this message is not the intended > recipient, or the employee or agent responsible to deliver it to the > intended recipient, you are hereby notified that any dissemination, > distribution, or copying of this communication is strictly prohibited. If > you have received this communication in error, please notify sender > immediately and destroy the original message. > > Internal Revenue Service regulations require that certain types of written > advice include a disclaimer. To the extent the preceding message contains > advice relating to a Federal tax issue, unless expressly stated otherwise > the advice is not intended or written to be used, and it cannot be used by > the recipient or any other taxpayer, for the purpose of avoiding Federal > tax penalties, and was not written to support the promotion or marketing of > any transaction or matter discussed herein. >