[jira] [Commented] (MESOS-7566) Master crash due to failed check in DRFSorter::remove

2017-05-31 Thread Yan Xu (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-7566?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16032442#comment-16032442
 ] 

Yan Xu commented on MESOS-7566:
---

Certain scenarios do seem problematic to me, e.g.,

- The agent's {{UpdateSlaveMessage}} reduces the the oversubscribed resources.
- {{Master::updateSlave}} upon receiving the update would first call 
{{HierarchicalAllocatorProcess::updateSlave}}, followed by 
{{allocator->recoverResources}}.
- {{HierarchicalAllocatorProcess::updateSlave}} would update 
{{roleSorter.total_}} to reduce to total so the total could go below the 
allocation.
- In the subsequent {{allocator->recoverResources}} call the attempt to remove 
outstanding allocation may fail to reduce it to below the total because some 
allocation may not be in outstanding offers. It could be in offered resources 
pending between {{Master::accept}} and {{Master::_accept}}. So the end result 
could still be {{total < allocation}}.
- Then when {{Master::_accept}} is executed, it will then call 
{{allocator->updateAllocation}}, in which the {{total < allocation}} condition 
could trigger such crash.

The root issue indeed looks to be MESOS-4553.

/cc [~bmahler] [~mcypark]

> Master crash due to failed check in DRFSorter::remove
> -
>
> Key: MESOS-7566
> URL: https://issues.apache.org/jira/browse/MESOS-7566
> Project: Mesos
>  Issue Type: Bug
>Affects Versions: 1.1.1, 1.1.2
>Reporter: Zhitao Li
>Priority: Critical
>
> A check in [sorter.cpp#L355 in 1.1.2 | 
> https://github.com/apache/mesos/blob/1.1.2/src/master/allocator/sorter/drf/sorter.cpp#L355]
>  is triggered occasionally in our cluster and crashes the master leader.
> I manually modified that check to print out the related variables, and the 
> following is a master log.
> https://gist.github.com/zhitaoli/0662d9fe1f6d57de344951c05b536bad#file-gistfile1-txt
> From the log, it seems like the check was using an stale value revocable CPU  
> {{26}} while the new value was updated to 25, thus the check crashed.
> So far two verified occurrence of this bug are both observed near an 
> {{UNRESERVE}} operation (see lines above in the log).



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Updated] (MESOS-7597) libprocess build is broken

2017-05-31 Thread Joseph Wu (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-7597?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph Wu updated MESOS-7597:
-
Shepherd: Joseph Wu

Fix: https://reviews.apache.org/r/59691/

> libprocess build is broken
> --
>
> Key: MESOS-7597
> URL: https://issues.apache.org/jira/browse/MESOS-7597
> Project: Mesos
>  Issue Type: Bug
>  Components: libprocess
> Environment: Windows
>Reporter: Andrew Schwartzmeyer
>Assignee: Andrew Schwartzmeyer
>Priority: Critical
>
> Commit 8fbbebfb6 broke the build:
> C:\Users\andschwa\src\mesos\3rdparty\libprocess\src\process.cpp(2877): error 
> C2882: 'flags': illegal use of namespace identifier in expressio
> This is probably due to the use of pre-compiled headers on Windows (`flags` 
> as an identifier has been problematic).



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Assigned] (MESOS-7597) libprocess build is broken

2017-05-31 Thread Andrew Schwartzmeyer (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-7597?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Schwartzmeyer reassigned MESOS-7597:
---

Assignee: Andrew Schwartzmeyer

> libprocess build is broken
> --
>
> Key: MESOS-7597
> URL: https://issues.apache.org/jira/browse/MESOS-7597
> Project: Mesos
>  Issue Type: Bug
>  Components: libprocess
> Environment: Windows
>Reporter: Andrew Schwartzmeyer
>Assignee: Andrew Schwartzmeyer
>Priority: Critical
>
> Commit 8fbbebfb6 broke the build:
> C:\Users\andschwa\src\mesos\3rdparty\libprocess\src\process.cpp(2877): error 
> C2882: 'flags': illegal use of namespace identifier in expressio
> This is probably due to the use of pre-compiled headers on Windows (`flags` 
> as an identifier has been problematic).



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Created] (MESOS-7597) libprocess build is broken

2017-05-31 Thread Andrew Schwartzmeyer (JIRA)
Andrew Schwartzmeyer created MESOS-7597:
---

 Summary: libprocess build is broken
 Key: MESOS-7597
 URL: https://issues.apache.org/jira/browse/MESOS-7597
 Project: Mesos
  Issue Type: Bug
  Components: libprocess
 Environment: Windows
Reporter: Andrew Schwartzmeyer
Priority: Critical


Commit 8fbbebfb6 broke the build:

C:\Users\andschwa\src\mesos\3rdparty\libprocess\src\process.cpp(2877): error 
C2882: 'flags': illegal use of namespace identifier in expressio

This is probably due to the use of pre-compiled headers on Windows (`flags` as 
an identifier has been problematic).



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (MESOS-7292) Introduce a "sensitive mode" in Mesos which prevents leaks of sensitive data.

2017-05-31 Thread Subodh Pachghare (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-7292?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16031759#comment-16031759
 ] 

Subodh Pachghare commented on MESOS-7292:
-

Just a few suggestions - 

1. Implement a encryption logic for env variables. Something like Travis CI.
2. Vault integrations into Mesos.
3. A prefix SECRET_* obfuscation of details on logs.

Thanks,
Subodh Pachghare

> Introduce a "sensitive mode" in Mesos which prevents leaks of sensitive data.
> -
>
> Key: MESOS-7292
> URL: https://issues.apache.org/jira/browse/MESOS-7292
> Project: Mesos
>  Issue Type: Improvement
>  Components: security
>Reporter: Alexander Rukletsov
>  Labels: mesosphere, security
>
> Consider a following scenario. A user passes some sensitive data in an 
> environment variable to a task. These data may be logged by Mesos components, 
> e.g., executor as part of {{mesos-containerizer}} invocation. While this is 
> useful for debugging, this might be an issue in some production environments.
> One of the solution is to have global "sensitive mode", that turns off 
> logging of such sensitive data.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Comment Edited] (MESOS-3453) Add patch file to silence deprecation warnings when we compile protobufs on Windows

2017-05-31 Thread Andrew Schwartzmeyer (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-3453?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16031739#comment-16031739
 ] 

Andrew Schwartzmeyer edited comment on MESOS-3453 at 5/31/17 7:05 PM:
--

Resolved with 
https://github.com/apache/mesos/commit/5e86f8f4d8b15de1e409bbe17b0c71ee7ed62035

There is no patch; the patch is that we use the newer version of Protobuf.


was (Author: andschwa):
Resolved with 
https://github.com/apache/mesos/commit/5e86f8f4d8b15de1e409bbe17b0c71ee7ed62035

There is not patch; the patch is that we use the newer version of Protobuf.

> Add patch file to silence deprecation warnings when we compile protobufs on 
> Windows
> ---
>
> Key: MESOS-3453
> URL: https://issues.apache.org/jira/browse/MESOS-3453
> Project: Mesos
>  Issue Type: Task
>  Components: cmake
>Reporter: Alex Clemmer
>Assignee: Andrew Schwartzmeyer
>  Labels: cmake, mesosphere, microsoft, windows-mvp
>
> Right now when you compile Protobuf v2.5.0, it gives you deprecation warnings 
> because stdext was removed. You can silence these, but it will require either 
> submitting a PR to the project or adding a patchfile to be applied to the 
> repot when you untar it.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Updated] (MESOS-6975) Prevent pre-1.0 agents from registering with 1.3+ master.

2017-05-31 Thread Neil Conway (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-6975?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Neil Conway updated MESOS-6975:
---
Target Version/s: 1.4.0  (was: 1.3.1, 1.4.0)

> Prevent pre-1.0 agents from registering with 1.3+ master.
> -
>
> Key: MESOS-6975
> URL: https://issues.apache.org/jira/browse/MESOS-6975
> Project: Mesos
>  Issue Type: Epic
>  Components: master
>Reporter: Neil Conway
>Assignee: Neil Conway
> Fix For: 1.4.0
>
>
> https://www.mail-archive.com/dev@mesos.apache.org/msg37194.html



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (MESOS-7596) Multiple registration attempts might result in agent shutdown.

2017-05-31 Thread Neil Conway (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-7596?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16031607#comment-16031607
 ] 

Neil Conway commented on MESOS-7596:


Related: https://reviews.apache.org/r/59685

> Multiple registration attempts might result in agent shutdown.
> --
>
> Key: MESOS-7596
> URL: https://issues.apache.org/jira/browse/MESOS-7596
> Project: Mesos
>  Issue Type: Bug
>Reporter: Neil Conway
>  Labels: mesosphere
>
> This sequence of events is possible:
> # Agent sends register message M1 to master.
> # Agent register timer expires, sends register message M2 to master.
> # Master sees M1 and adds agent with ID A1.
> # Agent gets SlaveRegisteredMessage with ID A1.
> # The master <-> agent socket breaks; the master marks the agent as 
> disconnected.
> # Master sees M2; since the agent is currently disconnected, the master 
> removes A1 and adds the agent with ID A2.
> # Agent gets SlaveRegisteredMessage with ID A2. Since this is unexpected, the 
> agent exits ("Registered but got wrong id").
> Shutting down the agent is unfortunate, although arguably not catastrophic.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Updated] (MESOS-7596) Multiple registration attempts might result in agent shutdown.

2017-05-31 Thread Neil Conway (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-7596?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Neil Conway updated MESOS-7596:
---
Summary: Multiple registration attempts might result in agent shutdown.  
(was: Multiple registration attempts might result in agent shutdown)

> Multiple registration attempts might result in agent shutdown.
> --
>
> Key: MESOS-7596
> URL: https://issues.apache.org/jira/browse/MESOS-7596
> Project: Mesos
>  Issue Type: Bug
>Reporter: Neil Conway
>  Labels: mesosphere
>
> This sequence of events is possible:
> # Agent sends register message M1 to master.
> # Agent register timer expires, sends register message M2 to master.
> # Master sees M1 and adds agent with ID A1.
> # Agent gets SlaveRegisteredMessage with ID A1.
> # The master <-> agent socket breaks; the master marks the agent as 
> disconnected.
> # Master sees M2; since the agent is currently disconnected, the master 
> removes A1 and adds the agent with ID A2.
> # Agent gets SlaveRegisteredMessage with ID A2. Since this is unexpected, the 
> agent exits ("Registered but got wrong id").
> Shutting down the agent is unfortunate, although arguably not catastrophic.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Created] (MESOS-7596) Multiple registration attempts might result in agent shutdown

2017-05-31 Thread Neil Conway (JIRA)
Neil Conway created MESOS-7596:
--

 Summary: Multiple registration attempts might result in agent 
shutdown
 Key: MESOS-7596
 URL: https://issues.apache.org/jira/browse/MESOS-7596
 Project: Mesos
  Issue Type: Bug
Reporter: Neil Conway


This sequence of events is possible:

# Agent sends register message M1 to master.
# Agent register timer expires, sends register message M2 to master.
# Master sees M1 and adds agent with ID A1.
# Agent gets SlaveRegisteredMessage with ID A1.
# The master <-> agent socket breaks; the master marks the agent as 
disconnected.
# Master sees M2; since the agent is currently disconnected, the master removes 
A1 and adds the agent with ID A2.
# Agent gets SlaveRegisteredMessage with ID A2. Since this is unexpected, the 
agent exits ("Registered but got wrong id").

Shutting down the agent is unfortunate, although arguably not catastrophic.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (MESOS-7587) Launching tasks with the Mesos Containerizer after a long time without launching new tasks fails

2017-05-31 Thread Anand Mazumdar (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-7587?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16031590#comment-16031590
 ] 

Anand Mazumdar commented on MESOS-7587:
---

[~jieyu] [~gilbert] Can you folks help take a look?

> Launching tasks with the Mesos Containerizer after a long time without 
> launching new tasks fails
> 
>
> Key: MESOS-7587
> URL: https://issues.apache.org/jira/browse/MESOS-7587
> Project: Mesos
>  Issue Type: Bug
>  Components: containerization
>Affects Versions: 1.2.0
>Reporter: Alexander Rojas
>Priority: Critical
>  Labels: mesos-containerizer, mesosphere
>
> After having a cluster running without launching new tasks for an extended 
> period of time, ~1week. When launching a new task using the Mesos 
> Containerizer, the task fails to launch with the error:
> [{{Failed to execute command: No such file or 
> directory}}|https://github.com/apache/mesos/blob/8245981b889ec3725cc0be4150b15d1fe9d64b86/src/slave/containerizer/mesos/launch.cpp#L778]
> The task is launched from Marathon with the app definition:
> {code}
> {
> "container": {
> "type": "MESOS",
> "docker": {
> "forcePullImage": true,
> "image": "private.repository.local/updated:fixed",
> "privileged": false
> }
> },
> "cpus": 0.1,
> "id": "/20150530/mesos9",
> "instances": 1,
> "minimumHealthCapacity": 1,
> "acceptedResourceRoles": ["*"],
> "constraints": [["hostname", "UNIQUE"]],
> "mem": 128
> }
> {code}
> and {{Dockerfile}}
> {code}
> FROM private.repository.local/centos:stable
> MAINTAINER Sebastian Gerlach "s...@boreus.de"
> CMD python -m SimpleHTTPServer 80
> {code}
> The obtained stdout is:
> {noformat}
> Executing pre-exec command 
> '{"arguments":["mesos-containerizer","mount","--help=false","--operation=make-rslave","--path=\/"],"shell":false,"value":"\/usr\/libexec\/mesos\/mesos-containerizer"}'
> Executing pre-exec command 
> '{"arguments":["mount","-n","--rbind","\/var\/lib\/mesos\/slaves\/562f1892-41c6-4c55-955e-662c63ab3ddf-S8\/frameworks\/562f1892-41c6-4c55-955e-662c63ab3ddf-\/executors\/20150530_mesos13.5f8c3db1-4517-11e7-8d95-024276be29e0\/runs\/da1cb066-1354-42a6-82bc-eee688d61b94","\/var\/lib\/mesos\/provisioner\/containers\/da1cb066-1354-42a6-82bc-eee688d61b94\/backends\/overlay\/rootfses\/f74816b3-c115-4a5e-a51c-fbd502efd1fe\/mnt\/mesos\/sandbox"],"shell":false,"value":"mount"}'
> Received SUBSCRIBED event
> Subscribed executor on bp-mesos8.private.local
> Received LAUNCH event
> Starting task 20150530_mesos13.5f8c3db1-4517-11e7-8d95-024276be29e0
> /usr/libexec/mesos/mesos-containerizer launch --help="false" 
> --launch_info="{"command":{"arguments":["\/bin\/sh","-c","python -m 
> SimpleHTTPServer 
> 

[jira] [Updated] (MESOS-7587) Launching tasks with the Mesos Containerizer after a long time without launching new tasks fails

2017-05-31 Thread Anand Mazumdar (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-7587?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Anand Mazumdar updated MESOS-7587:
--
Priority: Critical  (was: Major)

> Launching tasks with the Mesos Containerizer after a long time without 
> launching new tasks fails
> 
>
> Key: MESOS-7587
> URL: https://issues.apache.org/jira/browse/MESOS-7587
> Project: Mesos
>  Issue Type: Bug
>  Components: containerization
>Affects Versions: 1.2.0
>Reporter: Alexander Rojas
>Priority: Critical
>  Labels: mesos-containerizer, mesosphere
>
> After having a cluster running without launching new tasks for an extended 
> period of time, ~1week. When launching a new task using the Mesos 
> Containerizer, the task fails to launch with the error:
> [{{Failed to execute command: No such file or 
> directory}}|https://github.com/apache/mesos/blob/8245981b889ec3725cc0be4150b15d1fe9d64b86/src/slave/containerizer/mesos/launch.cpp#L778]
> The task is launched from Marathon with the app definition:
> {code}
> {
> "container": {
> "type": "MESOS",
> "docker": {
> "forcePullImage": true,
> "image": "private.repository.local/updated:fixed",
> "privileged": false
> }
> },
> "cpus": 0.1,
> "id": "/20150530/mesos9",
> "instances": 1,
> "minimumHealthCapacity": 1,
> "acceptedResourceRoles": ["*"],
> "constraints": [["hostname", "UNIQUE"]],
> "mem": 128
> }
> {code}
> and {{Dockerfile}}
> {code}
> FROM private.repository.local/centos:stable
> MAINTAINER Sebastian Gerlach "s...@boreus.de"
> CMD python -m SimpleHTTPServer 80
> {code}
> The obtained stdout is:
> {noformat}
> Executing pre-exec command 
> '{"arguments":["mesos-containerizer","mount","--help=false","--operation=make-rslave","--path=\/"],"shell":false,"value":"\/usr\/libexec\/mesos\/mesos-containerizer"}'
> Executing pre-exec command 
> '{"arguments":["mount","-n","--rbind","\/var\/lib\/mesos\/slaves\/562f1892-41c6-4c55-955e-662c63ab3ddf-S8\/frameworks\/562f1892-41c6-4c55-955e-662c63ab3ddf-\/executors\/20150530_mesos13.5f8c3db1-4517-11e7-8d95-024276be29e0\/runs\/da1cb066-1354-42a6-82bc-eee688d61b94","\/var\/lib\/mesos\/provisioner\/containers\/da1cb066-1354-42a6-82bc-eee688d61b94\/backends\/overlay\/rootfses\/f74816b3-c115-4a5e-a51c-fbd502efd1fe\/mnt\/mesos\/sandbox"],"shell":false,"value":"mount"}'
> Received SUBSCRIBED event
> Subscribed executor on bp-mesos8.private.local
> Received LAUNCH event
> Starting task 20150530_mesos13.5f8c3db1-4517-11e7-8d95-024276be29e0
> /usr/libexec/mesos/mesos-containerizer launch --help="false" 
> --launch_info="{"command":{"arguments":["\/bin\/sh","-c","python -m 
> SimpleHTTPServer 
> 

[jira] [Assigned] (MESOS-7595) Implement local resource provider registration

2017-05-31 Thread Jan Schlicht (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-7595?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jan Schlicht reassigned MESOS-7595:
---

Assignee: Jan Schlicht

> Implement local resource provider registration
> --
>
> Key: MESOS-7595
> URL: https://issues.apache.org/jira/browse/MESOS-7595
> Project: Mesos
>  Issue Type: Task
>  Components: master
>Reporter: Jan Schlicht
>Assignee: Jan Schlicht
>  Labels: mesosphere
>
> A {{resource_provider::Call::SUBSCRIBE}} call of a resource provider should 
> add that one to the list of registered resource providers in the master.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Created] (MESOS-7595) Implement local resource provider registration

2017-05-31 Thread Jan Schlicht (JIRA)
Jan Schlicht created MESOS-7595:
---

 Summary: Implement local resource provider registration
 Key: MESOS-7595
 URL: https://issues.apache.org/jira/browse/MESOS-7595
 Project: Mesos
  Issue Type: Task
  Components: master
Reporter: Jan Schlicht


A {{resource_provider::Call::SUBSCRIBE}} call of a resource provider should add 
that one to the list of registered resource providers in the master.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Updated] (MESOS-7595) Implement local resource provider registration

2017-05-31 Thread Jan Schlicht (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-7595?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jan Schlicht updated MESOS-7595:

Shepherd: Jie Yu

> Implement local resource provider registration
> --
>
> Key: MESOS-7595
> URL: https://issues.apache.org/jira/browse/MESOS-7595
> Project: Mesos
>  Issue Type: Task
>  Components: master
>Reporter: Jan Schlicht
>Assignee: Jan Schlicht
>  Labels: mesosphere
>
> A {{resource_provider::Call::SUBSCRIBE}} call of a resource provider should 
> add that one to the list of registered resource providers in the master.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Created] (MESOS-7592) Add handling of local resource providers to the master

2017-05-31 Thread Jan Schlicht (JIRA)
Jan Schlicht created MESOS-7592:
---

 Summary: Add handling of local resource providers to the master
 Key: MESOS-7592
 URL: https://issues.apache.org/jira/browse/MESOS-7592
 Project: Mesos
  Issue Type: Task
  Components: master
Reporter: Jan Schlicht
Assignee: Jan Schlicht


To support local resource providers the master has to keep track of the 
registered ones, and their allocated resources/ outstanding offers. This is 
similar to how it's already done for agents, hence this existing functionality 
could be abstracted and reused for local resource providers.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (MESOS-7587) Launching tasks with the Mesos Containerizer after a long time without launching new tasks fails

2017-05-31 Thread Sebastian Gerlach (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-7587?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16030827#comment-16030827
 ] 

Sebastian Gerlach commented on MESOS-7587:
--

Environment:
- CentOS Linux release 7.3.1611 (Core)
- mesos.x86_641.2.0-2.0.6 
- mesosphere-el-repo.noarch   7-3 
- marathon 1.4.3
- calico v1.1.3
- 3 master
- 6 slaves

slave configuration:
{code}
# attributes
hostname:bp-mesos7;manufacturer:VMware, Inc.
# containerizers
mesos,docker
# image_providers
docker
# isolation
docker/runtime,filesystem/linux,cgroups/cpu,cgroups/mem,cgroups/devices,cgroups/net_cls,disk/du
# network_cni_config_dir
/etc/calico/mesos
# network_cni_plugins_dir
/usr/share/calico
# resources
file:///etc/mesos-resources/resources.json
# work_dir
/var/lib/mesos
{code}

master configuration:
{code}
# ip
10.XXX.XXX.XXX
# quorum
2
# work_dir
/var/lib/mesos
{code}

> Launching tasks with the Mesos Containerizer after a long time without 
> launching new tasks fails
> 
>
> Key: MESOS-7587
> URL: https://issues.apache.org/jira/browse/MESOS-7587
> Project: Mesos
>  Issue Type: Bug
>  Components: containerization
>Affects Versions: 1.2.0
>Reporter: Alexander Rojas
>  Labels: mesos-containerizer, mesosphere
>
> After having a cluster running without launching new tasks for an extended 
> period of time, ~1week. When launching a new task using the Mesos 
> Containerizer, the task fails to launch with the error:
> [{{Failed to execute command: No such file or 
> directory}}|https://github.com/apache/mesos/blob/8245981b889ec3725cc0be4150b15d1fe9d64b86/src/slave/containerizer/mesos/launch.cpp#L778]
> The task is launched from Marathon with the app definition:
> {code}
> {
> "container": {
> "type": "MESOS",
> "docker": {
> "forcePullImage": true,
> "image": "private.repository.local/updated:fixed",
> "privileged": false
> }
> },
> "cpus": 0.1,
> "id": "/20150530/mesos9",
> "instances": 1,
> "minimumHealthCapacity": 1,
> "acceptedResourceRoles": ["*"],
> "constraints": [["hostname", "UNIQUE"]],
> "mem": 128
> }
> {code}
> and {{Dockerfile}}
> {code}
> FROM private.repository.local/centos:stable
> MAINTAINER Sebastian Gerlach "s...@boreus.de"
> CMD python -m SimpleHTTPServer 80
> {code}
> The obtained stdout is:
> {noformat}
> Executing pre-exec command 
> '{"arguments":["mesos-containerizer","mount","--help=false","--operation=make-rslave","--path=\/"],"shell":false,"value":"\/usr\/libexec\/mesos\/mesos-containerizer"}'
> Executing pre-exec command 
> '{"arguments":["mount","-n","--rbind","\/var\/lib\/mesos\/slaves\/562f1892-41c6-4c55-955e-662c63ab3ddf-S8\/frameworks\/562f1892-41c6-4c55-955e-662c63ab3ddf-\/executors\/20150530_mesos13.5f8c3db1-4517-11e7-8d95-024276be29e0\/runs\/da1cb066-1354-42a6-82bc-eee688d61b94","\/var\/lib\/mesos\/provisioner\/containers\/da1cb066-1354-42a6-82bc-eee688d61b94\/backends\/overlay\/rootfses\/f74816b3-c115-4a5e-a51c-fbd502efd1fe\/mnt\/mesos\/sandbox"],"shell":false,"value":"mount"}'
> Received SUBSCRIBED event
> Subscribed executor on bp-mesos8.private.local
> Received LAUNCH event
> Starting task 20150530_mesos13.5f8c3db1-4517-11e7-8d95-024276be29e0
> /usr/libexec/mesos/mesos-containerizer launch --help="false" 
> --launch_info="{"command":{"arguments":["\/bin\/sh","-c","python -m 
> SimpleHTTPServer 
>