If could show the content of path in CHECK_SOME, it would more easy to debug here. According the log in https://groups.google.com/forum/#!topic/marathon-framework/oKXhfQUcoMQ and 0.22.1 code:
const string& path = paths::getExecutorSentinelPath( metaDir, info.id(), framework->id, executor->id, executor->containerId); framework->id ==> 20141209-011108-1378273290-5050-23221-0001 executor->id ==> tools.1d52eed8-062c-11e5-90d3-f2a3161ca8ab' metaDir could get from your slave work_dir, info.id() is your slave id, could you see the executor->containerId in complete slave log. And if you could reproduce this problem every time, it would very helpful if you add a trace log to slave and recompile it. On Thu, Sep 3, 2015 at 12:49 AM, Tim Chen <t...@mesosphere.io> wrote: > Hi Scott, > > I wonder if you can try the latest Mesos and see if you can repro this? > > And if it is can you put down the example task and steps? I couldn't see > disk full in your slave log so I'm not sure if it's exactly the same > problem of MESOS-2684. > > Tim > > On Wed, Sep 2, 2015 at 5:15 AM, Scott Rankin <sran...@motus.com> wrote: > >> Hi Marco, >> >> I certainly don’t want to start a flame war, and I actually realized >> after I added my comment to MESOS-2684 that it’s not quite the same thing. >> >> As far as I can tell, in our situation, there’s no underlying disk >> issue. It seems like this is some sort of race condition (maybe?) with >> docker containers and executors shutting down. I’m perfectly happy with >> Mesos choosing to shut down in the case of a failure or unexpected >> situation – that’s a methodology that we adopt ourselves. I’m just trying >> to get a little more information about what the underlying issue is so that >> we can resolve it. I don’t know enough about Mesos internals to be able to >> answer that question just yet. >> >> It’s also inconvenient because, while Mesos is well-behaved and restarts >> gracefully, as of 0.22.1, it’s not recovering the Docker executors – so a >> mesos-slave crash also brings down applications. >> >> Thanks, >> Scott >> >> From: Marco Massenzio >> Reply-To: "user@mesos.apache.org" >> Date: Tuesday, September 1, 2015 at 7:33 PM >> To: "user@mesos.apache.org" >> Subject: Re: mesos-slave crashing with CHECK_SOME >> >> That's one of those areas for discussions that is so likely to generate a >> flame war that I'm hesitant to wade in :) >> >> In general, I would agree with the sentiment expressed there: >> >> > If the task fails, that is unfortunate, but not the end of the world. >> Other tasks should not be affected. >> >> which is, in fact, to large extent exactly what Mesos does; the example >> given in MESOS-2684, as it happens, is for a "disk full failure" - carrying >> on as if nothing had happened, is only likely to lead to further (and >> worse) disappointment. >> >> The general philosophy back at Google (and which certainly informs the >> design of Borg[0]) was "fail early, fail hard" so that either (a) the >> service is restarted and hopefully the root cause cleared or (b) someone >> (who can hopefully do something) will be alerted about it. >> >> I think it's ultimately a matter of scale: up to a few tens of servers, >> you can assume there is some sort of 'log-monitor' that looks out for >> errors and other anomalies and alerts humans that will then take a look and >> possibly apply some corrective action - when you're up to hundreds or >> thousands (definitely Mesos territory) that's not practical: the system >> should either self-heal or crash-and-restart. >> >> All this to say, that it's difficult to come up with a general >> *automated* approach to unequivocally decide if a failure is "fatal" or >> could just be safely "ignored" (after appropriate error logging) - in >> general, when in doubt it's probably safer to "noisily crash & restart" and >> rely on the overall system's HA architecture to take care of replication >> and consistency. >> (and an intelligent monitoring system that only alerts when some failure >> threshold is exceeded). >> >> From what I've seen so far (granted, still a novice here) it seems that >> Mesos subscribes to this notion, assuming that Agent Nodes will come and >> go, and usually Tasks survive (for a certain amount of time anyway) a Slave >> restart (obviously, if the physical h/w is the ultimate cause of failure, >> well, then all bets are off). >> >> Having said all that - if there are areas where we have been over-eager >> with our CHECKs, we should definitely revisit that and make it more >> crash-resistant, absolutely. >> >> [0] http://research.google.com/pubs/pub43438.html >> >> *Marco Massenzio* >> >> *Distributed Systems Engineer http://codetrips.com <http://codetrips.com>* >> >> On Mon, Aug 31, 2015 at 12:47 PM, Steven Schlansker < >> sschlans...@opentable.com> wrote: >> >>> >>> >>> On Aug 31, 2015, at 11:54 AM, Scott Rankin <sran...@motus.com> wrote: >>> > >>> > tag=mesos-slave[12858]: F0831 09:37:29.838184 12898 slave.cpp:3354] >>> CHECK_SOME(os::touch(path)): Failed to open file: No such file or directory >>> >>> I reported a similar bug a while back: >>> >>> https://issues.apache.org/jira/browse/MESOS-2684 >>> >>> This seems to be a class of bugs where some filesystem operations which >>> may fail for unforeseen reasons are written as assertions which crash the >>> process, rather than failing only the task and communicating back the error >>> reason. >>> >>> >>> >> This email message contains information that Motus, LLC considers >> confidential and/or proprietary, or may later designate as confidential and >> proprietary. It is intended only for use of the individual or entity named >> above and should not be forwarded to any other persons or entities without >> the express consent of Motus, LLC, nor should it be used for any purpose >> other than in the course of any potential or actual business relationship >> with Motus, LLC. If the reader of this message is not the intended >> recipient, or the employee or agent responsible to deliver it to the >> intended recipient, you are hereby notified that any dissemination, >> distribution, or copying of this communication is strictly prohibited. If >> you have received this communication in error, please notify sender >> immediately and destroy the original message. >> >> Internal Revenue Service regulations require that certain types of >> written advice include a disclaimer. To the extent the preceding message >> contains advice relating to a Federal tax issue, unless expressly stated >> otherwise the advice is not intended or written to be used, and it cannot >> be used by the recipient or any other taxpayer, for the purpose of avoiding >> Federal tax penalties, and was not written to support the promotion or >> marketing of any transaction or matter discussed herein. >> > > -- Best Regards, Haosdent Huang