That's one of those areas for discussions that is so likely to generate a flame war that I'm hesitant to wade in :)
In general, I would agree with the sentiment expressed there: > If the task fails, that is unfortunate, but not the end of the world. Other tasks should not be affected. which is, in fact, to large extent exactly what Mesos does; the example given in MESOS-2684, as it happens, is for a "disk full failure" - carrying on as if nothing had happened, is only likely to lead to further (and worse) disappointment. The general philosophy back at Google (and which certainly informs the design of Borg[0]) was "fail early, fail hard" so that either (a) the service is restarted and hopefully the root cause cleared or (b) someone (who can hopefully do something) will be alerted about it. I think it's ultimately a matter of scale: up to a few tens of servers, you can assume there is some sort of 'log-monitor' that looks out for errors and other anomalies and alerts humans that will then take a look and possibly apply some corrective action - when you're up to hundreds or thousands (definitely Mesos territory) that's not practical: the system should either self-heal or crash-and-restart. All this to say, that it's difficult to come up with a general *automated* approach to unequivocally decide if a failure is "fatal" or could just be safely "ignored" (after appropriate error logging) - in general, when in doubt it's probably safer to "noisily crash & restart" and rely on the overall system's HA architecture to take care of replication and consistency. (and an intelligent monitoring system that only alerts when some failure threshold is exceeded). >From what I've seen so far (granted, still a novice here) it seems that Mesos subscribes to this notion, assuming that Agent Nodes will come and go, and usually Tasks survive (for a certain amount of time anyway) a Slave restart (obviously, if the physical h/w is the ultimate cause of failure, well, then all bets are off). Having said all that - if there are areas where we have been over-eager with our CHECKs, we should definitely revisit that and make it more crash-resistant, absolutely. [0] http://research.google.com/pubs/pub43438.html *Marco Massenzio* *Distributed Systems Engineerhttp://codetrips.com <http://codetrips.com>* On Mon, Aug 31, 2015 at 12:47 PM, Steven Schlansker < [email protected]> wrote: > > > On Aug 31, 2015, at 11:54 AM, Scott Rankin <[email protected]> wrote: > > > > tag=mesos-slave[12858]: F0831 09:37:29.838184 12898 slave.cpp:3354] > CHECK_SOME(os::touch(path)): Failed to open file: No such file or directory > > I reported a similar bug a while back: > > https://issues.apache.org/jira/browse/MESOS-2684 > > This seems to be a class of bugs where some filesystem operations which > may fail for unforeseen reasons are written as assertions which crash the > process, rather than failing only the task and communicating back the error > reason. > > >

