Re: mesos-slave crashing with CHECK_SOME

2015-09-02 Thread Steven Schlansker
I 100% agree with your philosophy here, and I suspect it's something shared in 
the Mesos community.

I just think that we can restrict the domain of the failure to a smaller 
reasonable window -- once you are in the context of "I am doing work to launch 
a specific task", there is a well defined "success / failure / here is an error 
message" path defined already.  Users expect tasks to fail and can see the 
errors.

I think that a lot of these assertions are in fact more appropriate as task 
failures.  But I agree that they should be fatal to *some* part of the system, 
just not the whole agent entirely.

On Sep 1, 2015, at 4:33 PM, Marco Massenzio  wrote:

> That's one of those areas for discussions that is so likely to generate a 
> flame war that I'm hesitant to wade in :)
> 
> In general, I would agree with the sentiment expressed there:
> 
> > If the task fails, that is unfortunate, but not the end of the world. Other 
> > tasks should not be affected.
> 
> which is, in fact, to large extent exactly what Mesos does; the example given 
> in MESOS-2684, as it happens, is for a "disk full failure" - carrying on as 
> if nothing had happened, is only likely to lead to further (and worse) 
> disappointment.
> 
> The general philosophy back at Google (and which certainly informs the design 
> of Borg[0]) was "fail early, fail hard" so that either (a) the service is 
> restarted and hopefully the root cause cleared or (b) someone (who can 
> hopefully do something) will be alerted about it.
> 
> I think it's ultimately a matter of scale: up to a few tens of servers, you 
> can assume there is some sort of 'log-monitor' that looks out for errors and 
> other anomalies and alerts humans that will then take a look and possibly 
> apply some corrective action - when you're up to hundreds or thousands 
> (definitely Mesos territory) that's not practical: the system should either 
> self-heal or crash-and-restart.
> 
> All this to say, that it's difficult to come up with a general *automated* 
> approach to unequivocally decide if a failure is "fatal" or could just be 
> safely "ignored" (after appropriate error logging) - in general, when in 
> doubt it's probably safer to "noisily crash & restart" and rely on the 
> overall system's HA architecture to take care of replication and consistency.
> (and an intelligent monitoring system that only alerts when some failure 
> threshold is exceeded).
> 
> From what I've seen so far (granted, still a novice here) it seems that Mesos 
> subscribes to this notion, assuming that Agent Nodes will come and go, and 
> usually Tasks survive (for a certain amount of time anyway) a Slave restart 
> (obviously, if the physical h/w is the ultimate cause of failure, well, then 
> all bets are off).
> 
> Having said all that - if there are areas where we have been over-eager with 
> our CHECKs, we should definitely revisit that and make it more 
> crash-resistant, absolutely.
> 
> [0] http://research.google.com/pubs/pub43438.html
> 
> Marco Massenzio
> Distributed Systems Engineer
> http://codetrips.com
> 
> On Mon, Aug 31, 2015 at 12:47 PM, Steven Schlansker 
>  wrote:
> 
> 
> On Aug 31, 2015, at 11:54 AM, Scott Rankin  wrote:
> >
> > tag=mesos-slave[12858]:  F0831 09:37:29.838184 12898 slave.cpp:3354] 
> > CHECK_SOME(os::touch(path)): Failed to open file: No such file or directory
> 
> I reported a similar bug a while back:
> 
> https://issues.apache.org/jira/browse/MESOS-2684
> 
> This seems to be a class of bugs where some filesystem operations which may 
> fail for unforeseen reasons are written as assertions which crash the 
> process, rather than failing only the task and communicating back the error 
> reason.
> 
> 
> 



Re: mesos-slave crashing with CHECK_SOME

2015-09-02 Thread haosdent
If could show the content of path in CHECK_SOME, it would more easy to
debug here. According the log in
https://groups.google.com/forum/#!topic/marathon-framework/oKXhfQUcoMQ and
0.22.1 code:

const string& path = paths::getExecutorSentinelPath(
metaDir, info.id(), framework->id, executor->id,
executor->containerId);

framework->id ==> 20141209-011108-1378273290-5050-23221-0001
executor->id ==> tools.1d52eed8-062c-11e5-90d3-f2a3161ca8ab'

metaDir could get from your slave work_dir, info.id() is your slave id,
could you see the executor->containerId in complete slave log. And if you
could reproduce this problem every time, it would very helpful if you add a
trace log to slave and recompile it.

On Thu, Sep 3, 2015 at 12:49 AM, Tim Chen <t...@mesosphere.io> wrote:

> Hi Scott,
>
> I wonder if you can try the latest Mesos and see if you can repro this?
>
> And if it is can you put down the example task and steps? I couldn't see
> disk full in your slave log so I'm not sure if it's exactly the same
> problem of MESOS-2684.
>
> Tim
>
> On Wed, Sep 2, 2015 at 5:15 AM, Scott Rankin <sran...@motus.com> wrote:
>
>> Hi Marco,
>>
>> I certainly don’t want to start a flame war, and I actually realized
>> after I added my comment to MESOS-2684 that it’s not quite the same thing.
>>
>> As far as I can tell, in our situation, there’s no underlying disk
>> issue.  It seems like this is some sort of race condition (maybe?) with
>> docker containers and executors shutting down.  I’m perfectly happy with
>> Mesos choosing to shut down in the case of a failure or unexpected
>> situation – that’s a methodology that we adopt ourselves.  I’m just trying
>> to get a little more information about what the underlying issue is so that
>> we can resolve it. I don’t know enough about Mesos internals to be able to
>> answer that question just yet.
>>
>> It’s also inconvenient because, while Mesos is well-behaved and restarts
>> gracefully, as of 0.22.1, it’s not recovering the Docker executors – so a
>> mesos-slave crash also brings down applications.
>>
>> Thanks,
>> Scott
>>
>> From: Marco Massenzio
>> Reply-To: "user@mesos.apache.org"
>> Date: Tuesday, September 1, 2015 at 7:33 PM
>> To: "user@mesos.apache.org"
>> Subject: Re: mesos-slave crashing with CHECK_SOME
>>
>> That's one of those areas for discussions that is so likely to generate a
>> flame war that I'm hesitant to wade in :)
>>
>> In general, I would agree with the sentiment expressed there:
>>
>> > If the task fails, that is unfortunate, but not the end of the world.
>> Other tasks should not be affected.
>>
>> which is, in fact, to large extent exactly what Mesos does; the example
>> given in MESOS-2684, as it happens, is for a "disk full failure" - carrying
>> on as if nothing had happened, is only likely to lead to further (and
>> worse) disappointment.
>>
>> The general philosophy back at Google (and which certainly informs the
>> design of Borg[0]) was "fail early, fail hard" so that either (a) the
>> service is restarted and hopefully the root cause cleared or (b) someone
>> (who can hopefully do something) will be alerted about it.
>>
>> I think it's ultimately a matter of scale: up to a few tens of servers,
>> you can assume there is some sort of 'log-monitor' that looks out for
>> errors and other anomalies and alerts humans that will then take a look and
>> possibly apply some corrective action - when you're up to hundreds or
>> thousands (definitely Mesos territory) that's not practical: the system
>> should either self-heal or crash-and-restart.
>>
>> All this to say, that it's difficult to come up with a general
>> *automated* approach to unequivocally decide if a failure is "fatal" or
>> could just be safely "ignored" (after appropriate error logging) - in
>> general, when in doubt it's probably safer to "noisily crash & restart" and
>> rely on the overall system's HA architecture to take care of replication
>> and consistency.
>> (and an intelligent monitoring system that only alerts when some failure
>> threshold is exceeded).
>>
>> From what I've seen so far (granted, still a novice here) it seems that
>> Mesos subscribes to this notion, assuming that Agent Nodes will come and
>> go, and usually Tasks survive (for a certain amount of time anyway) a Slave
>> restart (obviously, if the physical h/w is the ultimate cause of failure,
>> well, then all bets are off).
>>
>> Having said all that 

Re: mesos-slave crashing with CHECK_SOME

2015-09-02 Thread Tim Chen
Hi Scott,

I wonder if you can try the latest Mesos and see if you can repro this?

And if it is can you put down the example task and steps? I couldn't see
disk full in your slave log so I'm not sure if it's exactly the same
problem of MESOS-2684.

Tim

On Wed, Sep 2, 2015 at 5:15 AM, Scott Rankin <sran...@motus.com> wrote:

> Hi Marco,
>
> I certainly don’t want to start a flame war, and I actually realized after
> I added my comment to MESOS-2684 that it’s not quite the same thing.
>
> As far as I can tell, in our situation, there’s no underlying disk issue.
> It seems like this is some sort of race condition (maybe?) with docker
> containers and executors shutting down.  I’m perfectly happy with Mesos
> choosing to shut down in the case of a failure or unexpected situation –
> that’s a methodology that we adopt ourselves.  I’m just trying to get a
> little more information about what the underlying issue is so that we can
> resolve it. I don’t know enough about Mesos internals to be able to answer
> that question just yet.
>
> It’s also inconvenient because, while Mesos is well-behaved and restarts
> gracefully, as of 0.22.1, it’s not recovering the Docker executors – so a
> mesos-slave crash also brings down applications.
>
> Thanks,
> Scott
>
> From: Marco Massenzio
> Reply-To: "user@mesos.apache.org"
> Date: Tuesday, September 1, 2015 at 7:33 PM
> To: "user@mesos.apache.org"
> Subject: Re: mesos-slave crashing with CHECK_SOME
>
> That's one of those areas for discussions that is so likely to generate a
> flame war that I'm hesitant to wade in :)
>
> In general, I would agree with the sentiment expressed there:
>
> > If the task fails, that is unfortunate, but not the end of the world.
> Other tasks should not be affected.
>
> which is, in fact, to large extent exactly what Mesos does; the example
> given in MESOS-2684, as it happens, is for a "disk full failure" - carrying
> on as if nothing had happened, is only likely to lead to further (and
> worse) disappointment.
>
> The general philosophy back at Google (and which certainly informs the
> design of Borg[0]) was "fail early, fail hard" so that either (a) the
> service is restarted and hopefully the root cause cleared or (b) someone
> (who can hopefully do something) will be alerted about it.
>
> I think it's ultimately a matter of scale: up to a few tens of servers,
> you can assume there is some sort of 'log-monitor' that looks out for
> errors and other anomalies and alerts humans that will then take a look and
> possibly apply some corrective action - when you're up to hundreds or
> thousands (definitely Mesos territory) that's not practical: the system
> should either self-heal or crash-and-restart.
>
> All this to say, that it's difficult to come up with a general *automated*
> approach to unequivocally decide if a failure is "fatal" or could just be
> safely "ignored" (after appropriate error logging) - in general, when in
> doubt it's probably safer to "noisily crash & restart" and rely on the
> overall system's HA architecture to take care of replication and
> consistency.
> (and an intelligent monitoring system that only alerts when some failure
> threshold is exceeded).
>
> From what I've seen so far (granted, still a novice here) it seems that
> Mesos subscribes to this notion, assuming that Agent Nodes will come and
> go, and usually Tasks survive (for a certain amount of time anyway) a Slave
> restart (obviously, if the physical h/w is the ultimate cause of failure,
> well, then all bets are off).
>
> Having said all that - if there are areas where we have been over-eager
> with our CHECKs, we should definitely revisit that and make it more
> crash-resistant, absolutely.
>
> [0] http://research.google.com/pubs/pub43438.html
>
> *Marco Massenzio*
>
> *Distributed Systems Engineer http://codetrips.com <http://codetrips.com>*
>
> On Mon, Aug 31, 2015 at 12:47 PM, Steven Schlansker <
> sschlans...@opentable.com> wrote:
>
>>
>>
>> On Aug 31, 2015, at 11:54 AM, Scott Rankin <sran...@motus.com> wrote:
>> >
>> > tag=mesos-slave[12858]:  F0831 09:37:29.838184 12898 slave.cpp:3354]
>> CHECK_SOME(os::touch(path)): Failed to open file: No such file or directory
>>
>> I reported a similar bug a while back:
>>
>> https://issues.apache.org/jira/browse/MESOS-2684
>>
>> This seems to be a class of bugs where some filesystem operations which
>> may fail for unforeseen reasons are written as assertions which crash the
>> process, rather than failing only the task and communicating back the error
>> reason.
>>

Re: mesos-slave crashing with CHECK_SOME

2015-09-02 Thread Marco Massenzio
@Steven - agreed!
As mentioned, if we can reduce the "footprint of unnecessary CHECKs" (so to
speak) I'm all for it - let's document and add Jiras for that, by all means.

@Scott - LoL: you certainly didn't; I was more worried my email would ;-)

Thanks, guys!

*Marco Massenzio*

*Distributed Systems Engineerhttp://codetrips.com *

On Wed, Sep 2, 2015 at 10:59 AM, Steven Schlansker <
sschlans...@opentable.com> wrote:

> I 100% agree with your philosophy here, and I suspect it's something
> shared in the Mesos community.
>
> I just think that we can restrict the domain of the failure to a smaller
> reasonable window -- once you are in the context of "I am doing work to
> launch a specific task", there is a well defined "success / failure / here
> is an error message" path defined already.  Users expect tasks to fail and
> can see the errors.
>
> I think that a lot of these assertions are in fact more appropriate as
> task failures.  But I agree that they should be fatal to *some* part of the
> system, just not the whole agent entirely.
>
> On Sep 1, 2015, at 4:33 PM, Marco Massenzio  wrote:
>
> > That's one of those areas for discussions that is so likely to generate
> a flame war that I'm hesitant to wade in :)
> >
> > In general, I would agree with the sentiment expressed there:
> >
> > > If the task fails, that is unfortunate, but not the end of the world.
> Other tasks should not be affected.
> >
> > which is, in fact, to large extent exactly what Mesos does; the example
> given in MESOS-2684, as it happens, is for a "disk full failure" - carrying
> on as if nothing had happened, is only likely to lead to further (and
> worse) disappointment.
> >
> > The general philosophy back at Google (and which certainly informs the
> design of Borg[0]) was "fail early, fail hard" so that either (a) the
> service is restarted and hopefully the root cause cleared or (b) someone
> (who can hopefully do something) will be alerted about it.
> >
> > I think it's ultimately a matter of scale: up to a few tens of servers,
> you can assume there is some sort of 'log-monitor' that looks out for
> errors and other anomalies and alerts humans that will then take a look and
> possibly apply some corrective action - when you're up to hundreds or
> thousands (definitely Mesos territory) that's not practical: the system
> should either self-heal or crash-and-restart.
> >
> > All this to say, that it's difficult to come up with a general
> *automated* approach to unequivocally decide if a failure is "fatal" or
> could just be safely "ignored" (after appropriate error logging) - in
> general, when in doubt it's probably safer to "noisily crash & restart" and
> rely on the overall system's HA architecture to take care of replication
> and consistency.
> > (and an intelligent monitoring system that only alerts when some failure
> threshold is exceeded).
> >
> > From what I've seen so far (granted, still a novice here) it seems that
> Mesos subscribes to this notion, assuming that Agent Nodes will come and
> go, and usually Tasks survive (for a certain amount of time anyway) a Slave
> restart (obviously, if the physical h/w is the ultimate cause of failure,
> well, then all bets are off).
> >
> > Having said all that - if there are areas where we have been over-eager
> with our CHECKs, we should definitely revisit that and make it more
> crash-resistant, absolutely.
> >
> > [0] http://research.google.com/pubs/pub43438.html
> >
> > Marco Massenzio
> > Distributed Systems Engineer
> > http://codetrips.com
> >
> > On Mon, Aug 31, 2015 at 12:47 PM, Steven Schlansker <
> sschlans...@opentable.com> wrote:
> >
> >
> > On Aug 31, 2015, at 11:54 AM, Scott Rankin  wrote:
> > >
> > > tag=mesos-slave[12858]:  F0831 09:37:29.838184 12898 slave.cpp:3354]
> CHECK_SOME(os::touch(path)): Failed to open file: No such file or directory
> >
> > I reported a similar bug a while back:
> >
> > https://issues.apache.org/jira/browse/MESOS-2684
> >
> > This seems to be a class of bugs where some filesystem operations which
> may fail for unforeseen reasons are written as assertions which crash the
> process, rather than failing only the task and communicating back the error
> reason.
> >
> >
> >
>
>


Re: mesos-slave crashing with CHECK_SOME

2015-09-02 Thread Scott Rankin
Hi Marco,

I certainly don’t want to start a flame war, and I actually realized after I 
added my comment to MESOS-2684 that it’s not quite the same thing.

As far as I can tell, in our situation, there’s no underlying disk issue.  It 
seems like this is some sort of race condition (maybe?) with docker containers 
and executors shutting down.  I’m perfectly happy with Mesos choosing to shut 
down in the case of a failure or unexpected situation – that’s a methodology 
that we adopt ourselves.  I’m just trying to get a little more information 
about what the underlying issue is so that we can resolve it. I don’t know 
enough about Mesos internals to be able to answer that question just yet.

It’s also inconvenient because, while Mesos is well-behaved and restarts 
gracefully, as of 0.22.1, it’s not recovering the Docker executors – so a 
mesos-slave crash also brings down applications.

Thanks,
Scott

From: Marco Massenzio
Reply-To: "user@mesos.apache.org<mailto:user@mesos.apache.org>"
Date: Tuesday, September 1, 2015 at 7:33 PM
To: "user@mesos.apache.org<mailto:user@mesos.apache.org>"
Subject: Re: mesos-slave crashing with CHECK_SOME

That's one of those areas for discussions that is so likely to generate a flame 
war that I'm hesitant to wade in :)

In general, I would agree with the sentiment expressed there:

> If the task fails, that is unfortunate, but not the end of the world. Other 
> tasks should not be affected.

which is, in fact, to large extent exactly what Mesos does; the example given 
in MESOS-2684, as it happens, is for a "disk full failure" - carrying on as if 
nothing had happened, is only likely to lead to further (and worse) 
disappointment.

The general philosophy back at Google (and which certainly informs the design 
of Borg[0]) was "fail early, fail hard" so that either (a) the service is 
restarted and hopefully the root cause cleared or (b) someone (who can 
hopefully do something) will be alerted about it.

I think it's ultimately a matter of scale: up to a few tens of servers, you can 
assume there is some sort of 'log-monitor' that looks out for errors and other 
anomalies and alerts humans that will then take a look and possibly apply some 
corrective action - when you're up to hundreds or thousands (definitely Mesos 
territory) that's not practical: the system should either self-heal or 
crash-and-restart.

All this to say, that it's difficult to come up with a general *automated* 
approach to unequivocally decide if a failure is "fatal" or could just be 
safely "ignored" (after appropriate error logging) - in general, when in doubt 
it's probably safer to "noisily crash & restart" and rely on the overall 
system's HA architecture to take care of replication and consistency.
(and an intelligent monitoring system that only alerts when some failure 
threshold is exceeded).

From what I've seen so far (granted, still a novice here) it seems that Mesos 
subscribes to this notion, assuming that Agent Nodes will come and go, and 
usually Tasks survive (for a certain amount of time anyway) a Slave restart 
(obviously, if the physical h/w is the ultimate cause of failure, well, then 
all bets are off).

Having said all that - if there are areas where we have been over-eager with 
our CHECKs, we should definitely revisit that and make it more crash-resistant, 
absolutely.

[0] http://research.google.com/pubs/pub43438.html

Marco Massenzio
Distributed Systems Engineer
http://codetrips.com

On Mon, Aug 31, 2015 at 12:47 PM, Steven Schlansker 
<sschlans...@opentable.com<mailto:sschlans...@opentable.com>> wrote:


On Aug 31, 2015, at 11:54 AM, Scott Rankin 
<sran...@motus.com<mailto:sran...@motus.com>> wrote:
>
> tag=mesos-slave[12858]:  F0831 09:37:29.838184 12898 slave.cpp:3354] 
> CHECK_SOME(os::touch(path)): Failed to open file: No such file or directory

I reported a similar bug a while back:

https://issues.apache.org/jira/browse/MESOS-2684

This seems to be a class of bugs where some filesystem operations which may 
fail for unforeseen reasons are written as assertions which crash the process, 
rather than failing only the task and communicating back the error reason.




This email message contains information that Motus, LLC considers confidential 
and/or proprietary, or may later designate as confidential and proprietary. It 
is intended only for use of the individual or entity named above and should not 
be forwarded to any other persons or entities without the express consent of 
Motus, LLC, nor should it be used for any purpose other than in the course of 
any potential or actual business relationship with Motus, LLC. If the reader of 
this message is not the intended recipient, or the employee or agent 
responsible to deliver it to the intended recipient, you are hereby notified 
that any dissemination, distribution, or copying of this communica

Re: mesos-slave crashing with CHECK_SOME

2015-09-01 Thread Marco Massenzio
That's one of those areas for discussions that is so likely to generate a
flame war that I'm hesitant to wade in :)

In general, I would agree with the sentiment expressed there:

> If the task fails, that is unfortunate, but not the end of the world.
Other tasks should not be affected.

which is, in fact, to large extent exactly what Mesos does; the example
given in MESOS-2684, as it happens, is for a "disk full failure" - carrying
on as if nothing had happened, is only likely to lead to further (and
worse) disappointment.

The general philosophy back at Google (and which certainly informs the
design of Borg[0]) was "fail early, fail hard" so that either (a) the
service is restarted and hopefully the root cause cleared or (b) someone
(who can hopefully do something) will be alerted about it.

I think it's ultimately a matter of scale: up to a few tens of servers, you
can assume there is some sort of 'log-monitor' that looks out for errors
and other anomalies and alerts humans that will then take a look and
possibly apply some corrective action - when you're up to hundreds or
thousands (definitely Mesos territory) that's not practical: the system
should either self-heal or crash-and-restart.

All this to say, that it's difficult to come up with a general *automated*
approach to unequivocally decide if a failure is "fatal" or could just be
safely "ignored" (after appropriate error logging) - in general, when in
doubt it's probably safer to "noisily crash & restart" and rely on the
overall system's HA architecture to take care of replication and
consistency.
(and an intelligent monitoring system that only alerts when some failure
threshold is exceeded).

>From what I've seen so far (granted, still a novice here) it seems that
Mesos subscribes to this notion, assuming that Agent Nodes will come and
go, and usually Tasks survive (for a certain amount of time anyway) a Slave
restart (obviously, if the physical h/w is the ultimate cause of failure,
well, then all bets are off).

Having said all that - if there are areas where we have been over-eager
with our CHECKs, we should definitely revisit that and make it more
crash-resistant, absolutely.

[0] http://research.google.com/pubs/pub43438.html

*Marco Massenzio*

*Distributed Systems Engineerhttp://codetrips.com *

On Mon, Aug 31, 2015 at 12:47 PM, Steven Schlansker <
sschlans...@opentable.com> wrote:

>
>
> On Aug 31, 2015, at 11:54 AM, Scott Rankin  wrote:
> >
> > tag=mesos-slave[12858]:  F0831 09:37:29.838184 12898 slave.cpp:3354]
> CHECK_SOME(os::touch(path)): Failed to open file: No such file or directory
>
> I reported a similar bug a while back:
>
> https://issues.apache.org/jira/browse/MESOS-2684
>
> This seems to be a class of bugs where some filesystem operations which
> may fail for unforeseen reasons are written as assertions which crash the
> process, rather than failing only the task and communicating back the error
> reason.
>
>
>


mesos-slave crashing with CHECK_SOME

2015-08-31 Thread Scott Rankin
Hi all,

We are running Mesos 0.22.1 on CentOS 6 and are hitting some frequent 
mesos-slave crashes when we try to upgrade our Marathon applications.  The 
crash happens when Marathon deploys a new version of an application and stops a 
running task.  The error in the Mesos logs is:

tag=mesos-slave[12858]:  F0831 09:37:29.838184 12898 slave.cpp:3354] 
CHECK_SOME(os::touch(path)): Failed to open file: No such file or directory
tag=mesos-slave[12858]:  *** Check failure stack trace: ***
tag=mesos-slave[12858]:  @   0x36a46765cd  (unknown)
tag=mesos-slave[12858]:  @   0x36a467a5e7  (unknown)
tag=mesos-slave[12858]:  @   0x36a4678469  (unknown)
tag=mesos-slave[12858]:  @   0x36a467876d  (unknown)
tag=mesos-slave[12858]:  @   0x36a3fc5696  (unknown)
tag=mesos-slave[12858]:  @   0x36a421855a  (unknown)
tag=mesos-slave[12858]:  @   0x36a421c0a9  (unknown)
tag=mesos-slave[12858]:  @   0x36a42510ff  (unknown)
tag=mesos-slave[12858]:  @   0x36a4618b83  (unknown)
tag=mesos-slave[12858]:  @   0x36a461978c  (unknown)
tag=mesos-slave[12858]:  @   0x3699407a51  (unknown)
tag=mesos-slave[12858]:  @   0x36990e89ad  (unknown)
tag=init:  mesos-slave main process (12858) killed by ABRT signal

It appears in the log immediately after the Docker container stops.  The 
mesos-slave process respawns, but in doing so kills all of the running Docker 
containers on that slave.  It then appears that the mesos-slave process 
terminates a second time, then comes up successfully.  The logs from this 
process are below.

This has been reported by at least one other Marathon user here:  
https://groups.google.com/forum/#!topic/marathon-framework/oKXhfQUcoMQ

Any advice on how to go about troubleshooting this would be most appreciated!

Thanks,
Scott



tag=mesos-slave[17756]:  W0831 09:37:42.474733 17783 slave.cpp:2568] Could not 
find the executor for status update TASK_FINISHED (UUID: 
8583e68d-99f0-4a89-a0fd-af5012a1b35d) for task 
app_pingfederate-console.37953216-4ffe-11e5-bd36-005056a00679 of framework 
20141209-011108-1378273290-5050-23221-0001
tag=mesos-slave[17756]:  W0831 09:37:42.861536 17781 slave.cpp:2557] Ignoring 
status update TASK_FINISHED (UUID: 7251ad5f-7850-471f-9976-b7162e183d0e) for 
task app_legacy.74d76339-4c08-11e5-bd36-005056a00679 of framework 
20141209-011108-1378273290-5050-23221-0001 for terminating framework 
20141209-011108-1378273290-5050-23221-0001
tag=mesos-slave[17756]:  W0831 09:37:42.962225 17779 slave.cpp:2557] Ignoring 
status update TASK_FINISHED (UUID: b6c60f4b-3e7d-46f9-ad54-630f5be1241f) for 
task app_pingfederate-engine.aa4f77a1-46ce-11e5-bd36-005056a00679 of framework 
20141209-011108-1378273290-5050-23221-0001 for terminating framework 
20141209-011108-1378273290-5050-23221-0001
tag=mesos-slave[17756]:  W0831 09:37:43.363952 17780 slave.cpp:2557] Ignoring 
status update TASK_FAILED (UUID: 0d44ee67-f9e3-48d7-b4e1-39d66babcd42) for task 
marathon-hipache-bridge.1461d8c2-411a-11e5-bd36-005056a00679 of framework 
20141209-011108-1378273290-5050-23221-0001 for terminating framework 
20141209-011108-1378273290-5050-23221-0001
tag=mesos-slave[17756]:  W0831 09:37:46.479511 17781 slave.cpp:2557] Ignoring 
status update TASK_FINISHED (UUID: f0cf57e3-3cbd-43f2-bbb2-55ad442a8abc) for 
task service_userservice.b4d14d32-45b9-11e5-bd36-005056a00679 of framework 
20141209-011108-1378273290-5050-23221-0001 for terminating framework 
20141209-011108-1378273290-5050-23221-0001
tag=mesos-slave[17756]:  W0831 09:37:52.476265 17779 
status_update_manager.cpp:472] Resending status update TASK_FINISHED (UUID: 
8583e68d-99f0-4a89-a0fd-af5012a1b35d) for task 
app_pingfederate-console.37953216-4ffe-11e5-bd36-005056a00679 of framework 
20141209-011108-1378273290-5050-23221-0001
tag=mesos-slave[17756]:  W0831 09:37:52.476434 17779 slave.cpp:2731] Dropping 
status update TASK_FINISHED (UUID: 8583e68d-99f0-4a89-a0fd-af5012a1b35d) for 
task app_pingfederate-console.37953216-4ffe-11e5-bd36-005056a00679 of framework 
20141209-011108-1378273290-5050-23221-0001 sent by status update manager 
because the slave is in TERMINATING state
tag=mesos-slave[17756]:  W0831 09:37:54.727569 17782 slave.cpp:2557] Ignoring 
status update TASK_FAILED (UUID: c5e4092e-75cd-44c8-9ee5-efc53f304df3) for task 
service_tripbatchservice.6228c9d7-4a99-11e5-bd36-005056a00679 of framework 
20141209-011108-1378273290-5050-23221-0001 for terminating framework 
20141209-011108-1378273290-5050-23221-0001
tag=mesos-slave[17756]:  W0831 09:37:54.814648 17782 slave.cpp:2557] Ignoring 
status update TASK_FAILED (UUID: a681b752-9522-4acf-8c9f-c6530999d096) for task 
service_mapservice.18904037-411a-11e5-bd36-005056a00679 of framework 
20141209-011108-1378273290-5050-23221-0001 for terminating framework 
20141209-011108-1378273290-5050-23221-0001
tag=mesos-slave[17756]:  E0831 09:37:57.225787 17783 slave.cpp:3112] Container 
'f3da678a-e566-4179-b66a-084e055d32e4' for 

Re: mesos-slave crashing with CHECK_SOME

2015-08-31 Thread Steven Schlansker


On Aug 31, 2015, at 11:54 AM, Scott Rankin  wrote:
> 
> tag=mesos-slave[12858]:  F0831 09:37:29.838184 12898 slave.cpp:3354] 
> CHECK_SOME(os::touch(path)): Failed to open file: No such file or directory 

I reported a similar bug a while back:

https://issues.apache.org/jira/browse/MESOS-2684

This seems to be a class of bugs where some filesystem operations which may 
fail for unforeseen reasons are written as assertions which crash the process, 
rather than failing only the task and communicating back the error reason.