I 100% agree with your philosophy here, and I suspect it's something shared in the Mesos community.
I just think that we can restrict the domain of the failure to a smaller reasonable window -- once you are in the context of "I am doing work to launch a specific task", there is a well defined "success / failure / here is an error message" path defined already. Users expect tasks to fail and can see the errors. I think that a lot of these assertions are in fact more appropriate as task failures. But I agree that they should be fatal to *some* part of the system, just not the whole agent entirely. On Sep 1, 2015, at 4:33 PM, Marco Massenzio <ma...@mesosphere.io> wrote: > That's one of those areas for discussions that is so likely to generate a > flame war that I'm hesitant to wade in :) > > In general, I would agree with the sentiment expressed there: > > > If the task fails, that is unfortunate, but not the end of the world. Other > > tasks should not be affected. > > which is, in fact, to large extent exactly what Mesos does; the example given > in MESOS-2684, as it happens, is for a "disk full failure" - carrying on as > if nothing had happened, is only likely to lead to further (and worse) > disappointment. > > The general philosophy back at Google (and which certainly informs the design > of Borg[0]) was "fail early, fail hard" so that either (a) the service is > restarted and hopefully the root cause cleared or (b) someone (who can > hopefully do something) will be alerted about it. > > I think it's ultimately a matter of scale: up to a few tens of servers, you > can assume there is some sort of 'log-monitor' that looks out for errors and > other anomalies and alerts humans that will then take a look and possibly > apply some corrective action - when you're up to hundreds or thousands > (definitely Mesos territory) that's not practical: the system should either > self-heal or crash-and-restart. > > All this to say, that it's difficult to come up with a general *automated* > approach to unequivocally decide if a failure is "fatal" or could just be > safely "ignored" (after appropriate error logging) - in general, when in > doubt it's probably safer to "noisily crash & restart" and rely on the > overall system's HA architecture to take care of replication and consistency. > (and an intelligent monitoring system that only alerts when some failure > threshold is exceeded). > > From what I've seen so far (granted, still a novice here) it seems that Mesos > subscribes to this notion, assuming that Agent Nodes will come and go, and > usually Tasks survive (for a certain amount of time anyway) a Slave restart > (obviously, if the physical h/w is the ultimate cause of failure, well, then > all bets are off). > > Having said all that - if there are areas where we have been over-eager with > our CHECKs, we should definitely revisit that and make it more > crash-resistant, absolutely. > > [0] http://research.google.com/pubs/pub43438.html > > Marco Massenzio > Distributed Systems Engineer > http://codetrips.com > > On Mon, Aug 31, 2015 at 12:47 PM, Steven Schlansker > <sschlans...@opentable.com> wrote: > > > On Aug 31, 2015, at 11:54 AM, Scott Rankin <sran...@motus.com> wrote: > > > > tag=mesos-slave[12858]: F0831 09:37:29.838184 12898 slave.cpp:3354] > > CHECK_SOME(os::touch(path)): Failed to open file: No such file or directory > > I reported a similar bug a while back: > > https://issues.apache.org/jira/browse/MESOS-2684 > > This seems to be a class of bugs where some filesystem operations which may > fail for unforeseen reasons are written as assertions which crash the > process, rather than failing only the task and communicating back the error > reason. > > >