Re: mesos-slave crashing with CHECK_SOME

haosdent Wed, 02 Sep 2015 10:45:06 -0700

If could show the content of path in CHECK_SOME, it would more easy to
debug here. According the log in
https://groups.google.com/forum/#!topic/marathon-framework/oKXhfQUcoMQ and
0.22.1 code:


    const string& path = paths::getExecutorSentinelPath(
        metaDir, info.id(), framework->id, executor->id,
executor->containerId);

framework->id ==> 20141209-011108-1378273290-5050-23221-0001
executor->id ==> tools.1d52eed8-062c-11e5-90d3-f2a3161ca8ab'

metaDir could get from your slave work_dir, info.id() is your slave id,
could you see the executor->containerId in complete slave log. And if you
could reproduce this problem every time, it would very helpful if you add a
trace log to slave and recompile it.

On Thu, Sep 3, 2015 at 12:49 AM, Tim Chen <t...@mesosphere.io> wrote:

> Hi Scott,
>
> I wonder if you can try the latest Mesos and see if you can repro this?
>
> And if it is can you put down the example task and steps? I couldn't see
> disk full in your slave log so I'm not sure if it's exactly the same
> problem of MESOS-2684.
>
> Tim
>
> On Wed, Sep 2, 2015 at 5:15 AM, Scott Rankin <sran...@motus.com> wrote:
>
>> Hi Marco,
>>
>> I certainly don’t want to start a flame war, and I actually realized
>> after I added my comment to MESOS-2684 that it’s not quite the same thing.
>>
>> As far as I can tell, in our situation, there’s no underlying disk
>> issue.  It seems like this is some sort of race condition (maybe?) with
>> docker containers and executors shutting down.  I’m perfectly happy with
>> Mesos choosing to shut down in the case of a failure or unexpected
>> situation – that’s a methodology that we adopt ourselves.  I’m just trying
>> to get a little more information about what the underlying issue is so that
>> we can resolve it. I don’t know enough about Mesos internals to be able to
>> answer that question just yet.
>>
>> It’s also inconvenient because, while Mesos is well-behaved and restarts
>> gracefully, as of 0.22.1, it’s not recovering the Docker executors – so a
>> mesos-slave crash also brings down applications.
>>
>> Thanks,
>> Scott
>>
>> From: Marco Massenzio
>> Reply-To: "user@mesos.apache.org"
>> Date: Tuesday, September 1, 2015 at 7:33 PM
>> To: "user@mesos.apache.org"
>> Subject: Re: mesos-slave crashing with CHECK_SOME
>>
>> That's one of those areas for discussions that is so likely to generate a
>> flame war that I'm hesitant to wade in :)
>>
>> In general, I would agree with the sentiment expressed there:
>>
>> > If the task fails, that is unfortunate, but not the end of the world.
>> Other tasks should not be affected.
>>
>> which is, in fact, to large extent exactly what Mesos does; the example
>> given in MESOS-2684, as it happens, is for a "disk full failure" - carrying
>> on as if nothing had happened, is only likely to lead to further (and
>> worse) disappointment.
>>
>> The general philosophy back at Google (and which certainly informs the
>> design of Borg[0]) was "fail early, fail hard" so that either (a) the
>> service is restarted and hopefully the root cause cleared or (b) someone
>> (who can hopefully do something) will be alerted about it.
>>
>> I think it's ultimately a matter of scale: up to a few tens of servers,
>> you can assume there is some sort of 'log-monitor' that looks out for
>> errors and other anomalies and alerts humans that will then take a look and
>> possibly apply some corrective action - when you're up to hundreds or
>> thousands (definitely Mesos territory) that's not practical: the system
>> should either self-heal or crash-and-restart.
>>
>> All this to say, that it's difficult to come up with a general
>> *automated* approach to unequivocally decide if a failure is "fatal" or
>> could just be safely "ignored" (after appropriate error logging) - in
>> general, when in doubt it's probably safer to "noisily crash & restart" and
>> rely on the overall system's HA architecture to take care of replication
>> and consistency.
>> (and an intelligent monitoring system that only alerts when some failure
>> threshold is exceeded).
>>
>> From what I've seen so far (granted, still a novice here) it seems that
>> Mesos subscribes to this notion, assuming that Agent Nodes will come and
>> go, and usually Tasks survive (for a certain amount of time anyway) a Slave
>> restart (obviously, if the physical h/w is the ultimate cause of failure,
>> well, then all bets are off).
>>
>> Having said all that - if there are areas where we have been over-eager
>> with our CHECKs, we should definitely revisit that and make it more
>> crash-resistant, absolutely.
>>
>> [0] http://research.google.com/pubs/pub43438.html
>>
>> *Marco Massenzio*
>>
>> *Distributed Systems Engineer http://codetrips.com <http://codetrips.com>*
>>
>> On Mon, Aug 31, 2015 at 12:47 PM, Steven Schlansker <
>> sschlans...@opentable.com> wrote:
>>
>>>
>>>
>>> On Aug 31, 2015, at 11:54 AM, Scott Rankin <sran...@motus.com> wrote:
>>> >
>>> > tag=mesos-slave[12858]:  F0831 09:37:29.838184 12898 slave.cpp:3354]
>>> CHECK_SOME(os::touch(path)): Failed to open file: No such file or directory
>>>
>>> I reported a similar bug a while back:
>>>
>>> https://issues.apache.org/jira/browse/MESOS-2684
>>>
>>> This seems to be a class of bugs where some filesystem operations which
>>> may fail for unforeseen reasons are written as assertions which crash the
>>> process, rather than failing only the task and communicating back the error
>>> reason.
>>>
>>>
>>>
>> This email message contains information that Motus, LLC considers
>> confidential and/or proprietary, or may later designate as confidential and
>> proprietary. It is intended only for use of the individual or entity named
>> above and should not be forwarded to any other persons or entities without
>> the express consent of Motus, LLC, nor should it be used for any purpose
>> other than in the course of any potential or actual business relationship
>> with Motus, LLC. If the reader of this message is not the intended
>> recipient, or the employee or agent responsible to deliver it to the
>> intended recipient, you are hereby notified that any dissemination,
>> distribution, or copying of this communication is strictly prohibited. If
>> you have received this communication in error, please notify sender
>> immediately and destroy the original message.
>>
>> Internal Revenue Service regulations require that certain types of
>> written advice include a disclaimer. To the extent the preceding message
>> contains advice relating to a Federal tax issue, unless expressly stated
>> otherwise the advice is not intended or written to be used, and it cannot
>> be used by the recipient or any other taxpayer, for the purpose of avoiding
>> Federal tax penalties, and was not written to support the promotion or
>> marketing of any transaction or matter discussed herein.
>>
>
>


-- 
Best Regards,
Haosdent Huang

Re: mesos-slave crashing with CHECK_SOME

Reply via email to