I 100% agree with your philosophy here, and I suspect it's something shared in 
the Mesos community.

I just think that we can restrict the domain of the failure to a smaller 
reasonable window -- once you are in the context of "I am doing work to launch 
a specific task", there is a well defined "success / failure / here is an error 
message" path defined already.  Users expect tasks to fail and can see the 
errors.

I think that a lot of these assertions are in fact more appropriate as task 
failures.  But I agree that they should be fatal to *some* part of the system, 
just not the whole agent entirely.

On Sep 1, 2015, at 4:33 PM, Marco Massenzio <ma...@mesosphere.io> wrote:

> That's one of those areas for discussions that is so likely to generate a 
> flame war that I'm hesitant to wade in :)
> 
> In general, I would agree with the sentiment expressed there:
> 
> > If the task fails, that is unfortunate, but not the end of the world. Other 
> > tasks should not be affected.
> 
> which is, in fact, to large extent exactly what Mesos does; the example given 
> in MESOS-2684, as it happens, is for a "disk full failure" - carrying on as 
> if nothing had happened, is only likely to lead to further (and worse) 
> disappointment.
> 
> The general philosophy back at Google (and which certainly informs the design 
> of Borg[0]) was "fail early, fail hard" so that either (a) the service is 
> restarted and hopefully the root cause cleared or (b) someone (who can 
> hopefully do something) will be alerted about it.
> 
> I think it's ultimately a matter of scale: up to a few tens of servers, you 
> can assume there is some sort of 'log-monitor' that looks out for errors and 
> other anomalies and alerts humans that will then take a look and possibly 
> apply some corrective action - when you're up to hundreds or thousands 
> (definitely Mesos territory) that's not practical: the system should either 
> self-heal or crash-and-restart.
> 
> All this to say, that it's difficult to come up with a general *automated* 
> approach to unequivocally decide if a failure is "fatal" or could just be 
> safely "ignored" (after appropriate error logging) - in general, when in 
> doubt it's probably safer to "noisily crash & restart" and rely on the 
> overall system's HA architecture to take care of replication and consistency.
> (and an intelligent monitoring system that only alerts when some failure 
> threshold is exceeded).
> 
> From what I've seen so far (granted, still a novice here) it seems that Mesos 
> subscribes to this notion, assuming that Agent Nodes will come and go, and 
> usually Tasks survive (for a certain amount of time anyway) a Slave restart 
> (obviously, if the physical h/w is the ultimate cause of failure, well, then 
> all bets are off).
> 
> Having said all that - if there are areas where we have been over-eager with 
> our CHECKs, we should definitely revisit that and make it more 
> crash-resistant, absolutely.
> 
> [0] http://research.google.com/pubs/pub43438.html
> 
> Marco Massenzio
> Distributed Systems Engineer
> http://codetrips.com
> 
> On Mon, Aug 31, 2015 at 12:47 PM, Steven Schlansker 
> <sschlans...@opentable.com> wrote:
> 
> 
> On Aug 31, 2015, at 11:54 AM, Scott Rankin <sran...@motus.com> wrote:
> >
> > tag=mesos-slave[12858]:  F0831 09:37:29.838184 12898 slave.cpp:3354] 
> > CHECK_SOME(os::touch(path)): Failed to open file: No such file or directory
> 
> I reported a similar bug a while back:
> 
> https://issues.apache.org/jira/browse/MESOS-2684
> 
> This seems to be a class of bugs where some filesystem operations which may 
> fail for unforeseen reasons are written as assertions which crash the 
> process, rather than failing only the task and communicating back the error 
> reason.
> 
> 
> 

Reply via email to