@Steven - agreed! As mentioned, if we can reduce the "footprint of unnecessary CHECKs" (so to speak) I'm all for it - let's document and add Jiras for that, by all means.
@Scott - LoL: you certainly didn't; I was more worried my email would ;-) Thanks, guys! *Marco Massenzio* *Distributed Systems Engineerhttp://codetrips.com <http://codetrips.com>* On Wed, Sep 2, 2015 at 10:59 AM, Steven Schlansker < sschlans...@opentable.com> wrote: > I 100% agree with your philosophy here, and I suspect it's something > shared in the Mesos community. > > I just think that we can restrict the domain of the failure to a smaller > reasonable window -- once you are in the context of "I am doing work to > launch a specific task", there is a well defined "success / failure / here > is an error message" path defined already. Users expect tasks to fail and > can see the errors. > > I think that a lot of these assertions are in fact more appropriate as > task failures. But I agree that they should be fatal to *some* part of the > system, just not the whole agent entirely. > > On Sep 1, 2015, at 4:33 PM, Marco Massenzio <ma...@mesosphere.io> wrote: > > > That's one of those areas for discussions that is so likely to generate > a flame war that I'm hesitant to wade in :) > > > > In general, I would agree with the sentiment expressed there: > > > > > If the task fails, that is unfortunate, but not the end of the world. > Other tasks should not be affected. > > > > which is, in fact, to large extent exactly what Mesos does; the example > given in MESOS-2684, as it happens, is for a "disk full failure" - carrying > on as if nothing had happened, is only likely to lead to further (and > worse) disappointment. > > > > The general philosophy back at Google (and which certainly informs the > design of Borg[0]) was "fail early, fail hard" so that either (a) the > service is restarted and hopefully the root cause cleared or (b) someone > (who can hopefully do something) will be alerted about it. > > > > I think it's ultimately a matter of scale: up to a few tens of servers, > you can assume there is some sort of 'log-monitor' that looks out for > errors and other anomalies and alerts humans that will then take a look and > possibly apply some corrective action - when you're up to hundreds or > thousands (definitely Mesos territory) that's not practical: the system > should either self-heal or crash-and-restart. > > > > All this to say, that it's difficult to come up with a general > *automated* approach to unequivocally decide if a failure is "fatal" or > could just be safely "ignored" (after appropriate error logging) - in > general, when in doubt it's probably safer to "noisily crash & restart" and > rely on the overall system's HA architecture to take care of replication > and consistency. > > (and an intelligent monitoring system that only alerts when some failure > threshold is exceeded). > > > > From what I've seen so far (granted, still a novice here) it seems that > Mesos subscribes to this notion, assuming that Agent Nodes will come and > go, and usually Tasks survive (for a certain amount of time anyway) a Slave > restart (obviously, if the physical h/w is the ultimate cause of failure, > well, then all bets are off). > > > > Having said all that - if there are areas where we have been over-eager > with our CHECKs, we should definitely revisit that and make it more > crash-resistant, absolutely. > > > > [0] http://research.google.com/pubs/pub43438.html > > > > Marco Massenzio > > Distributed Systems Engineer > > http://codetrips.com > > > > On Mon, Aug 31, 2015 at 12:47 PM, Steven Schlansker < > sschlans...@opentable.com> wrote: > > > > > > On Aug 31, 2015, at 11:54 AM, Scott Rankin <sran...@motus.com> wrote: > > > > > > tag=mesos-slave[12858]: F0831 09:37:29.838184 12898 slave.cpp:3354] > CHECK_SOME(os::touch(path)): Failed to open file: No such file or directory > > > > I reported a similar bug a while back: > > > > https://issues.apache.org/jira/browse/MESOS-2684 > > > > This seems to be a class of bugs where some filesystem operations which > may fail for unforeseen reasons are written as assertions which crash the > process, rather than failing only the task and communicating back the error > reason. > > > > > > > >