Re: mesos-slave crashing with CHECK_SOME

Tim Chen Wed, 02 Sep 2015 09:50:29 -0700

Hi Scott,

I wonder if you can try the latest Mesos and see if you can repro this?


And if it is can you put down the example task and steps? I couldn't see
disk full in your slave log so I'm not sure if it's exactly the same
problem of MESOS-2684.

Tim

On Wed, Sep 2, 2015 at 5:15 AM, Scott Rankin <sran...@motus.com> wrote:

> Hi Marco,
>
> I certainly don’t want to start a flame war, and I actually realized after
> I added my comment to MESOS-2684 that it’s not quite the same thing.
>
> As far as I can tell, in our situation, there’s no underlying disk issue.
> It seems like this is some sort of race condition (maybe?) with docker
> containers and executors shutting down.  I’m perfectly happy with Mesos
> choosing to shut down in the case of a failure or unexpected situation –
> that’s a methodology that we adopt ourselves.  I’m just trying to get a
> little more information about what the underlying issue is so that we can
> resolve it. I don’t know enough about Mesos internals to be able to answer
> that question just yet.
>
> It’s also inconvenient because, while Mesos is well-behaved and restarts
> gracefully, as of 0.22.1, it’s not recovering the Docker executors – so a
> mesos-slave crash also brings down applications.
>
> Thanks,
> Scott
>
> From: Marco Massenzio
> Reply-To: "user@mesos.apache.org"
> Date: Tuesday, September 1, 2015 at 7:33 PM
> To: "user@mesos.apache.org"
> Subject: Re: mesos-slave crashing with CHECK_SOME
>
> That's one of those areas for discussions that is so likely to generate a
> flame war that I'm hesitant to wade in :)
>
> In general, I would agree with the sentiment expressed there:
>
> > If the task fails, that is unfortunate, but not the end of the world.
> Other tasks should not be affected.
>
> which is, in fact, to large extent exactly what Mesos does; the example
> given in MESOS-2684, as it happens, is for a "disk full failure" - carrying
> on as if nothing had happened, is only likely to lead to further (and
> worse) disappointment.
>
> The general philosophy back at Google (and which certainly informs the
> design of Borg[0]) was "fail early, fail hard" so that either (a) the
> service is restarted and hopefully the root cause cleared or (b) someone
> (who can hopefully do something) will be alerted about it.
>
> I think it's ultimately a matter of scale: up to a few tens of servers,
> you can assume there is some sort of 'log-monitor' that looks out for
> errors and other anomalies and alerts humans that will then take a look and
> possibly apply some corrective action - when you're up to hundreds or
> thousands (definitely Mesos territory) that's not practical: the system
> should either self-heal or crash-and-restart.
>
> All this to say, that it's difficult to come up with a general *automated*
> approach to unequivocally decide if a failure is "fatal" or could just be
> safely "ignored" (after appropriate error logging) - in general, when in
> doubt it's probably safer to "noisily crash & restart" and rely on the
> overall system's HA architecture to take care of replication and
> consistency.
> (and an intelligent monitoring system that only alerts when some failure
> threshold is exceeded).
>
> From what I've seen so far (granted, still a novice here) it seems that
> Mesos subscribes to this notion, assuming that Agent Nodes will come and
> go, and usually Tasks survive (for a certain amount of time anyway) a Slave
> restart (obviously, if the physical h/w is the ultimate cause of failure,
> well, then all bets are off).
>
> Having said all that - if there are areas where we have been over-eager
> with our CHECKs, we should definitely revisit that and make it more
> crash-resistant, absolutely.
>
> [0] http://research.google.com/pubs/pub43438.html
>
> *Marco Massenzio*
>
> *Distributed Systems Engineer http://codetrips.com <http://codetrips.com>*
>
> On Mon, Aug 31, 2015 at 12:47 PM, Steven Schlansker <
> sschlans...@opentable.com> wrote:
>
>>
>>
>> On Aug 31, 2015, at 11:54 AM, Scott Rankin <sran...@motus.com> wrote:
>> >
>> > tag=mesos-slave[12858]:  F0831 09:37:29.838184 12898 slave.cpp:3354]
>> CHECK_SOME(os::touch(path)): Failed to open file: No such file or directory
>>
>> I reported a similar bug a while back:
>>
>> https://issues.apache.org/jira/browse/MESOS-2684
>>
>> This seems to be a class of bugs where some filesystem operations which
>> may fail for unforeseen reasons are written as assertions which crash the
>> process, rather than failing only the task and communicating back the error
>> reason.
>>
>>
>>
> This email message contains information that Motus, LLC considers
> confidential and/or proprietary, or may later designate as confidential and
> proprietary. It is intended only for use of the individual or entity named
> above and should not be forwarded to any other persons or entities without
> the express consent of Motus, LLC, nor should it be used for any purpose
> other than in the course of any potential or actual business relationship
> with Motus, LLC. If the reader of this message is not the intended
> recipient, or the employee or agent responsible to deliver it to the
> intended recipient, you are hereby notified that any dissemination,
> distribution, or copying of this communication is strictly prohibited. If
> you have received this communication in error, please notify sender
> immediately and destroy the original message.
>
> Internal Revenue Service regulations require that certain types of written
> advice include a disclaimer. To the extent the preceding message contains
> advice relating to a Federal tax issue, unless expressly stated otherwise
> the advice is not intended or written to be used, and it cannot be used by
> the recipient or any other taxpayer, for the purpose of avoiding Federal
> tax penalties, and was not written to support the promotion or marketing of
> any transaction or matter discussed herein.
>

Re: mesos-slave crashing with CHECK_SOME

Reply via email to