[jira] [Created] (MESOS-7894) Mesos Executor UI - Disk:Used Field isn't populated with Docker Container Runtime

2017-08-16 Thread Justin Lee (JIRA)
Justin Lee created MESOS-7894:
-

 Summary: Mesos Executor UI - Disk:Used Field isn't populated with 
Docker Container Runtime
 Key: MESOS-7894
 URL: https://issues.apache.org/jira/browse/MESOS-7894
 Project: Mesos
  Issue Type: Bug
Affects Versions: 1.2.2
 Environment: DC/OS 1.9.2 (CentOS 7.3, Docker 1.13.1, Mesos 1.2.2, 
Marathon 1.4.5)
Reporter: Justin Lee
Priority: Minor


If you use the Docker container runtime, the 'Disk' 'Used' field never gets 
populated in the Mesos UI (on the executor/task page).

Steps to Reproduce:
in DC/OS 1.9.2, deploy two apps:
{code:javascript}
{
  "id": "/dummy-disk-docker",
  "cmd": "dd if=/dev/zero of=$MESOS_SANDBOX/testfile bs=128M count=1; tail -f 
/dev/null",
  "instances": 1,
  "cpus": 0.1,
  "mem": 256,
  "disk": 150,
  "container": {
"type": "DOCKER",
"docker": {
  "image": "alpine"
}
  }
}
{code}

{code:javascript}
{
  "id": "/dummy-disk-ucr",
  "cmd": "dd if=/dev/zero of=$MESOS_SANDBOX/testfile bs=128M count=1; tail -f 
/dev/null",
  "instances": 1,
  "cpus": 0.1,
  "mem": 256,
  "disk": 150,
  "container": {
"type": "MESOS"
"docker": {
  "image": "alpine"
}
  }
}
{code}

Wait for them the deploy.  

Then, navigate to the mesos UI, and go to the executor/task page for the two 
tasks.

On the UCR task, eventually the "Used Disk" field should populate with 128 MB 
(the size of the dummy file).
The same field on the Docker task will never get populated.

Both containers are writing to the same location on the agent filesystem 
(/var/lib/mesos/slave/slaves//frameworks//executors//runs/latest,
 but only one reports the data through the UI.




--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (MESOS-7744) Mesos Agent Sends TASK_KILL status update to Master, and still launches task

2017-08-16 Thread Alexander Rukletsov (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-7744?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16128608#comment-16128608
 ] 

Alexander Rukletsov commented on MESOS-7744:


And dropping 1.1.3 target.

> Mesos Agent Sends TASK_KILL status update to Master, and still launches task
> 
>
> Key: MESOS-7744
> URL: https://issues.apache.org/jira/browse/MESOS-7744
> Project: Mesos
>  Issue Type: Bug
>Affects Versions: 1.0.1
>Reporter: Sargun Dhillon
>Assignee: Benjamin Mahler
>Priority: Critical
>  Labels: reliability
>
> We sometimes launch jobs, and cancel them in ~7 seconds, if we don't get a 
> TASK_STARTING back from the agent. Under certain conditions it can result in 
> Mesos losing track of the task. The chunk of the logs which is interesting is 
> here:
> {code}
> Jun 29 23:22:26 titusagent-mainvpc-r3.8xlarge.2-i-04907efc9f1f8535c 
> mesos-slave[4290]: I0629 23:22:26.951799  5171 slave.cpp:1495] Got assigned 
> task Titus-7590548-worker-0-4476 for framework TitusFramework
> Jun 29 23:22:26 titusagent-mainvpc-r3.8xlarge.2-i-04907efc9f1f8535c 
> mesos-slave[4290]: I0629 23:22:26.952251  5171 slave.cpp:1614] Launching task 
> Titus-7590548-worker-0-4476 for framework TitusFramework
> Jun 29 23:22:37 titusagent-mainvpc-r3.8xlarge.2-i-04907efc9f1f8535c 
> mesos-slave[4290]: I0629 23:22:37.484611  5171 slave.cpp:1853] Queuing task 
> ‘Titus-7590548-worker-0-4476’ for executor ‘docker-executor’ of framework 
> TitusFramework at executor(1)@100.66.11.10:17707
> Jun 29 23:22:37 titusagent-mainvpc-r3.8xlarge.2-i-04907efc9f1f8535c 
> mesos-slave[4290]: I0629 23:22:37.487876  5171 slave.cpp:2035] Asked to kill 
> task Titus-7590548-worker-0-4476 of framework TitusFramework
> Jun 29 23:22:37 titusagent-mainvpc-r3.8xlarge.2-i-04907efc9f1f8535c 
> mesos-slave[4290]: I0629 23:22:37.488994  5171 slave.cpp:3211] Handling 
> status update TASK_KILLED (UUID: 898215d6-a244-4dbe-bc9c-878a22d36ea4) for 
> task Titus-7590548-worker-0-4476 of framework TitusFramework from @0.0.0.0:0
> Jun 29 23:22:37 titusagent-mainvpc-r3.8xlarge.2-i-04907efc9f1f8535c 
> mesos-slave[4290]: I0629 23:22:37.490603  5171 slave.cpp:2005] Sending queued 
> task ‘Titus-7590548-worker-0-4476’ to executor ‘docker-executor’ of framework 
> TitusFramework at executor(1)@100.66.11.10:17707{
> {code}
> In our executor, we see that the launch message arrives after the master has 
> already gotten the kill update. We then send non-terminal state updates to 
> the agent, and yet it doesn't forward these to our framework. We're using a 
> custom executor which is based on the older mesos-go bindings. 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Updated] (MESOS-7744) Mesos Agent Sends TASK_KILL status update to Master, and still launches task

2017-08-16 Thread Alexander Rukletsov (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-7744?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alexander Rukletsov updated MESOS-7744:
---
Target Version/s: 1.2.3, 1.3.2, 1.5.0, 1.4.1  (was: 1.1.3, 1.2.3, 1.3.2, 
1.5.0)

> Mesos Agent Sends TASK_KILL status update to Master, and still launches task
> 
>
> Key: MESOS-7744
> URL: https://issues.apache.org/jira/browse/MESOS-7744
> Project: Mesos
>  Issue Type: Bug
>Affects Versions: 1.0.1
>Reporter: Sargun Dhillon
>Assignee: Benjamin Mahler
>Priority: Critical
>  Labels: reliability
>
> We sometimes launch jobs, and cancel them in ~7 seconds, if we don't get a 
> TASK_STARTING back from the agent. Under certain conditions it can result in 
> Mesos losing track of the task. The chunk of the logs which is interesting is 
> here:
> {code}
> Jun 29 23:22:26 titusagent-mainvpc-r3.8xlarge.2-i-04907efc9f1f8535c 
> mesos-slave[4290]: I0629 23:22:26.951799  5171 slave.cpp:1495] Got assigned 
> task Titus-7590548-worker-0-4476 for framework TitusFramework
> Jun 29 23:22:26 titusagent-mainvpc-r3.8xlarge.2-i-04907efc9f1f8535c 
> mesos-slave[4290]: I0629 23:22:26.952251  5171 slave.cpp:1614] Launching task 
> Titus-7590548-worker-0-4476 for framework TitusFramework
> Jun 29 23:22:37 titusagent-mainvpc-r3.8xlarge.2-i-04907efc9f1f8535c 
> mesos-slave[4290]: I0629 23:22:37.484611  5171 slave.cpp:1853] Queuing task 
> ‘Titus-7590548-worker-0-4476’ for executor ‘docker-executor’ of framework 
> TitusFramework at executor(1)@100.66.11.10:17707
> Jun 29 23:22:37 titusagent-mainvpc-r3.8xlarge.2-i-04907efc9f1f8535c 
> mesos-slave[4290]: I0629 23:22:37.487876  5171 slave.cpp:2035] Asked to kill 
> task Titus-7590548-worker-0-4476 of framework TitusFramework
> Jun 29 23:22:37 titusagent-mainvpc-r3.8xlarge.2-i-04907efc9f1f8535c 
> mesos-slave[4290]: I0629 23:22:37.488994  5171 slave.cpp:3211] Handling 
> status update TASK_KILLED (UUID: 898215d6-a244-4dbe-bc9c-878a22d36ea4) for 
> task Titus-7590548-worker-0-4476 of framework TitusFramework from @0.0.0.0:0
> Jun 29 23:22:37 titusagent-mainvpc-r3.8xlarge.2-i-04907efc9f1f8535c 
> mesos-slave[4290]: I0629 23:22:37.490603  5171 slave.cpp:2005] Sending queued 
> task ‘Titus-7590548-worker-0-4476’ to executor ‘docker-executor’ of framework 
> TitusFramework at executor(1)@100.66.11.10:17707{
> {code}
> In our executor, we see that the launch message arrives after the master has 
> already gotten the kill update. We then send non-terminal state updates to 
> the agent, and yet it doesn't forward these to our framework. We're using a 
> custom executor which is based on the older mesos-go bindings. 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Updated] (MESOS-7894) Mesos Executor UI - Disk:Used Field isn't populated with Docker Container Runtime

2017-08-16 Thread Justin Lee (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-7894?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Justin Lee updated MESOS-7894:
--
Attachment: UCR on left, Docker on right.png

> Mesos Executor UI - Disk:Used Field isn't populated with Docker Container 
> Runtime
> -
>
> Key: MESOS-7894
> URL: https://issues.apache.org/jira/browse/MESOS-7894
> Project: Mesos
>  Issue Type: Bug
>Affects Versions: 1.2.2
> Environment: DC/OS 1.9.2 (CentOS 7.3, Docker 1.13.1, Mesos 1.2.2, 
> Marathon 1.4.5)
>Reporter: Justin Lee
>Priority: Minor
> Attachments: UCR on left, Docker on right.png
>
>
> If you use the Docker container runtime, the 'Disk' 'Used' field never gets 
> populated in the Mesos UI (on the executor/task page).
> Steps to Reproduce:
> in DC/OS 1.9.2, deploy two apps:
> {code:javascript}
> {
>   "id": "/dummy-disk-docker",
>   "cmd": "dd if=/dev/zero of=$MESOS_SANDBOX/testfile bs=128M count=1; tail -f 
> /dev/null",
>   "instances": 1,
>   "cpus": 0.1,
>   "mem": 256,
>   "disk": 150,
>   "container": {
> "type": "DOCKER",
> "docker": {
>   "image": "alpine"
> }
>   }
> }
> {code}
> {code:javascript}
> {
>   "id": "/dummy-disk-ucr",
>   "cmd": "dd if=/dev/zero of=$MESOS_SANDBOX/testfile bs=128M count=1; tail -f 
> /dev/null",
>   "instances": 1,
>   "cpus": 0.1,
>   "mem": 256,
>   "disk": 150,
>   "container": {
> "type": "MESOS"
> "docker": {
>   "image": "alpine"
> }
>   }
> }
> {code}
> Wait for them the deploy.  
> Then, navigate to the mesos UI, and go to the executor/task page for the two 
> tasks.
> On the UCR task, eventually the "Used Disk" field should populate with 128 MB 
> (the size of the dummy file).
> The same field on the Docker task will never get populated.
> Both containers are writing to the same location on the agent filesystem 
> (/var/lib/mesos/slave/slaves//frameworks//executors//runs/latest,
>  but only one reports the data through the UI.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Created] (MESOS-7896) Use std::error_code for reporting platform-dependent errors

2017-08-16 Thread Ilya Pronin (JIRA)
Ilya Pronin created MESOS-7896:
--

 Summary: Use std::error_code for reporting platform-dependent 
errors
 Key: MESOS-7896
 URL: https://issues.apache.org/jira/browse/MESOS-7896
 Project: Mesos
  Issue Type: Improvement
Reporter: Ilya Pronin
Priority: Minor


It may be useful to return an error code from various functions to be able to 
distinguish different kinds of errors. E.g. for being able to ignore {{ENOENT}} 
from {{unlink()}}. This can be achieved by returning {{Try}}, 
but this is not portable.

Since C++11 STL has {{std::error_code}} that hides platform-dependent error 
code behind a portable error condition. We can use it for error reporting.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Assigned] (MESOS-7791) subprocess' childMain using ABORT when encountering user errors

2017-08-16 Thread Andrei Budnik (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-7791?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrei Budnik reassigned MESOS-7791:


Assignee: Andrei Budnik

> subprocess' childMain using ABORT when encountering user errors
> ---
>
> Key: MESOS-7791
> URL: https://issues.apache.org/jira/browse/MESOS-7791
> Project: Mesos
>  Issue Type: Bug
>  Components: libprocess
>Affects Versions: 1.4.0
>Reporter: Benjamin Bannier
>Assignee: Andrei Budnik
>  Labels: mesosphere, tech-debt
>
> In {{process/posix/subprocess.hpp}}'s {{childMain}} we exit with {{ABORT}} 
> when there was a user error,
> {noformat}
> ABORT: 
> (/pkg/src/mesos/3rdparty/libprocess/include/process/posix/subprocess.hpp:195):
>  Failed to os::execvpe on path '/SOME/PATH': Argument list too long
> {noformat}
> We here abort instead of simply {{_exit}}'ing and letting the user know that 
> we couldn't deal with the given arguments.
> Abort can potentially dump core, and since this abort is before the 
> {{execvpe}}, the process image can potentially be large (e.g., >300 MB) which 
> could quickly fill up a lot of disk space.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (MESOS-7895) ZK session timeout is unconfigurable in agent and scheduler drivers

2017-08-16 Thread Ilya Pronin (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-7895?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16128827#comment-16128827
 ] 

Ilya Pronin commented on MESOS-7895:


[~vinodkone] can you shepherd this please?

Review requests for agents:
https://reviews.apache.org/r/61689/
https://reviews.apache.org/r/61690/

> ZK session timeout is unconfigurable in agent and scheduler drivers
> ---
>
> Key: MESOS-7895
> URL: https://issues.apache.org/jira/browse/MESOS-7895
> Project: Mesos
>  Issue Type: Improvement
>Affects Versions: 1.3.0
>Reporter: Ilya Pronin
>Assignee: Ilya Pronin
>Priority: Minor
>
> {{ZooKeeperMasterDetector}} in agents and scheduler drivers uses the default 
> ZK session timeout (10 secs). This timeout may have to be increased to cope 
> with long ZK upgrades or ZK GC pauses (with local ZK sessions these can cause 
> lots of {{TASK_LOST}}, because sessions expire on disconnection after 
> {{session_timeout * 2 / 3}}).



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Updated] (MESOS-7894) Mesos Executor UI - Disk:Used Field isn't populated with Docker Container Runtime

2017-08-16 Thread Justin Lee (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-7894?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Justin Lee updated MESOS-7894:
--
Attachment: UCR on left, Docker on right..png

> Mesos Executor UI - Disk:Used Field isn't populated with Docker Container 
> Runtime
> -
>
> Key: MESOS-7894
> URL: https://issues.apache.org/jira/browse/MESOS-7894
> Project: Mesos
>  Issue Type: Bug
>Affects Versions: 1.2.2
> Environment: DC/OS 1.9.2 (CentOS 7.3, Docker 1.13.1, Mesos 1.2.2, 
> Marathon 1.4.5)
>Reporter: Justin Lee
>Priority: Minor
> Attachments: UCR on left, Docker on right.png
>
>
> If you use the Docker container runtime, the 'Disk' 'Used' field never gets 
> populated in the Mesos UI (on the executor/task page).
> Steps to Reproduce:
> in DC/OS 1.9.2, deploy two apps:
> {code:javascript}
> {
>   "id": "/dummy-disk-docker",
>   "cmd": "dd if=/dev/zero of=$MESOS_SANDBOX/testfile bs=128M count=1; tail -f 
> /dev/null",
>   "instances": 1,
>   "cpus": 0.1,
>   "mem": 256,
>   "disk": 150,
>   "container": {
> "type": "DOCKER",
> "docker": {
>   "image": "alpine"
> }
>   }
> }
> {code}
> {code:javascript}
> {
>   "id": "/dummy-disk-ucr",
>   "cmd": "dd if=/dev/zero of=$MESOS_SANDBOX/testfile bs=128M count=1; tail -f 
> /dev/null",
>   "instances": 1,
>   "cpus": 0.1,
>   "mem": 256,
>   "disk": 150,
>   "container": {
> "type": "MESOS"
> "docker": {
>   "image": "alpine"
> }
>   }
> }
> {code}
> Wait for them the deploy.  
> Then, navigate to the mesos UI, and go to the executor/task page for the two 
> tasks.
> On the UCR task, eventually the "Used Disk" field should populate with 128 MB 
> (the size of the dummy file).
> The same field on the Docker task will never get populated.
> Both containers are writing to the same location on the agent filesystem 
> (/var/lib/mesos/slave/slaves//frameworks//executors//runs/latest,
>  but only one reports the data through the UI.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Updated] (MESOS-7894) Mesos Executor UI - Disk:Used Field isn't populated with Docker Container Runtime

2017-08-16 Thread Justin Lee (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-7894?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Justin Lee updated MESOS-7894:
--
Description: 
If you use the Docker container runtime, the 'Disk' 'Used' field never gets 
populated in the Mesos UI (on the executor/task page).

Steps to Reproduce:
in DC/OS 1.9.2, deploy two apps:
{code:javascript}
{
  "id": "/dummy-disk-docker",
  "cmd": "dd if=/dev/zero of=$MESOS_SANDBOX/testfile bs=128M count=1; tail -f 
/dev/null",
  "instances": 1,
  "cpus": 0.1,
  "mem": 256,
  "disk": 150,
  "container": {
"type": "DOCKER",
"docker": {
  "image": "alpine"
}
  }
}
{code}

{code:javascript}
{
  "id": "/dummy-disk-ucr",
  "cmd": "dd if=/dev/zero of=$MESOS_SANDBOX/testfile bs=128M count=1; tail -f 
/dev/null",
  "instances": 1,
  "cpus": 0.1,
  "mem": 256,
  "disk": 150,
  "container": {
"type": "MESOS"
"docker": {
  "image": "alpine"
}
  }
}
{code}

Wait for them the deploy.  

Then, navigate to the mesos UI, and go to the executor/task page for the two 
tasks.

On the UCR task, eventually the "Used Disk" field should populate with 128 MB 
(the size of the dummy file).
The same field on the Docker task will never get populated.

Both containers are writing to the same location on the agent filesystem 
({{/var/lib/mesos/slave/slaves//frameworks//executors//runs/latest}}),
 but only one reports the data through the UI.


  was:
If you use the Docker container runtime, the 'Disk' 'Used' field never gets 
populated in the Mesos UI (on the executor/task page).

Steps to Reproduce:
in DC/OS 1.9.2, deploy two apps:
{code:javascript}
{
  "id": "/dummy-disk-docker",
  "cmd": "dd if=/dev/zero of=$MESOS_SANDBOX/testfile bs=128M count=1; tail -f 
/dev/null",
  "instances": 1,
  "cpus": 0.1,
  "mem": 256,
  "disk": 150,
  "container": {
"type": "DOCKER",
"docker": {
  "image": "alpine"
}
  }
}
{code}

{code:javascript}
{
  "id": "/dummy-disk-ucr",
  "cmd": "dd if=/dev/zero of=$MESOS_SANDBOX/testfile bs=128M count=1; tail -f 
/dev/null",
  "instances": 1,
  "cpus": 0.1,
  "mem": 256,
  "disk": 150,
  "container": {
"type": "MESOS"
"docker": {
  "image": "alpine"
}
  }
}
{code}

Wait for them the deploy.  

Then, navigate to the mesos UI, and go to the executor/task page for the two 
tasks.

On the UCR task, eventually the "Used Disk" field should populate with 128 MB 
(the size of the dummy file).
The same field on the Docker task will never get populated.

Both containers are writing to the same location on the agent filesystem 
(/var/lib/mesos/slave/slaves//frameworks//executors//runs/latest,
 but only one reports the data through the UI.



> Mesos Executor UI - Disk:Used Field isn't populated with Docker Container 
> Runtime
> -
>
> Key: MESOS-7894
> URL: https://issues.apache.org/jira/browse/MESOS-7894
> Project: Mesos
>  Issue Type: Bug
>Affects Versions: 1.2.2
> Environment: DC/OS 1.9.2 (CentOS 7.3, Docker 1.13.1, Mesos 1.2.2, 
> Marathon 1.4.5)
>Reporter: Justin Lee
>Priority: Minor
> Attachments: UCR on left, Docker on right.png
>
>
> If you use the Docker container runtime, the 'Disk' 'Used' field never gets 
> populated in the Mesos UI (on the executor/task page).
> Steps to Reproduce:
> in DC/OS 1.9.2, deploy two apps:
> {code:javascript}
> {
>   "id": "/dummy-disk-docker",
>   "cmd": "dd if=/dev/zero of=$MESOS_SANDBOX/testfile bs=128M count=1; tail -f 
> /dev/null",
>   "instances": 1,
>   "cpus": 0.1,
>   "mem": 256,
>   "disk": 150,
>   "container": {
> "type": "DOCKER",
> "docker": {
>   "image": "alpine"
> }
>   }
> }
> {code}
> {code:javascript}
> {
>   "id": "/dummy-disk-ucr",
>   "cmd": "dd if=/dev/zero of=$MESOS_SANDBOX/testfile bs=128M count=1; tail -f 
> /dev/null",
>   "instances": 1,
>   "cpus": 0.1,
>   "mem": 256,
>   "disk": 150,
>   "container": {
> "type": "MESOS"
> "docker": {
>   "image": "alpine"
> }
>   }
> }
> {code}
> Wait for them the deploy.  
> Then, navigate to the mesos UI, and go to the executor/task page for the two 
> tasks.
> On the UCR task, eventually the "Used Disk" field should populate with 128 MB 
> (the size of the dummy file).
> The same field on the Docker task will never get populated.
> Both containers are writing to the same location on the agent filesystem 
> ({{/var/lib/mesos/slave/slaves//frameworks//executors//runs/latest}}),
>  but only one reports the data through the UI.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Updated] (MESOS-7894) Mesos Executor UI - Disk:Used Field isn't populated with Docker Container Runtime

2017-08-16 Thread Justin Lee (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-7894?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Justin Lee updated MESOS-7894:
--
Attachment: (was: UCR on left, Docker on right..png)

> Mesos Executor UI - Disk:Used Field isn't populated with Docker Container 
> Runtime
> -
>
> Key: MESOS-7894
> URL: https://issues.apache.org/jira/browse/MESOS-7894
> Project: Mesos
>  Issue Type: Bug
>Affects Versions: 1.2.2
> Environment: DC/OS 1.9.2 (CentOS 7.3, Docker 1.13.1, Mesos 1.2.2, 
> Marathon 1.4.5)
>Reporter: Justin Lee
>Priority: Minor
> Attachments: UCR on left, Docker on right.png
>
>
> If you use the Docker container runtime, the 'Disk' 'Used' field never gets 
> populated in the Mesos UI (on the executor/task page).
> Steps to Reproduce:
> in DC/OS 1.9.2, deploy two apps:
> {code:javascript}
> {
>   "id": "/dummy-disk-docker",
>   "cmd": "dd if=/dev/zero of=$MESOS_SANDBOX/testfile bs=128M count=1; tail -f 
> /dev/null",
>   "instances": 1,
>   "cpus": 0.1,
>   "mem": 256,
>   "disk": 150,
>   "container": {
> "type": "DOCKER",
> "docker": {
>   "image": "alpine"
> }
>   }
> }
> {code}
> {code:javascript}
> {
>   "id": "/dummy-disk-ucr",
>   "cmd": "dd if=/dev/zero of=$MESOS_SANDBOX/testfile bs=128M count=1; tail -f 
> /dev/null",
>   "instances": 1,
>   "cpus": 0.1,
>   "mem": 256,
>   "disk": 150,
>   "container": {
> "type": "MESOS"
> "docker": {
>   "image": "alpine"
> }
>   }
> }
> {code}
> Wait for them the deploy.  
> Then, navigate to the mesos UI, and go to the executor/task page for the two 
> tasks.
> On the UCR task, eventually the "Used Disk" field should populate with 128 MB 
> (the size of the dummy file).
> The same field on the Docker task will never get populated.
> Both containers are writing to the same location on the agent filesystem 
> (/var/lib/mesos/slave/slaves//frameworks//executors//runs/latest,
>  but only one reports the data through the UI.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Created] (MESOS-7895) ZK session timeout is unconfigurable in agent and scheduler drivers

2017-08-16 Thread Ilya Pronin (JIRA)
Ilya Pronin created MESOS-7895:
--

 Summary: ZK session timeout is unconfigurable in agent and 
scheduler drivers
 Key: MESOS-7895
 URL: https://issues.apache.org/jira/browse/MESOS-7895
 Project: Mesos
  Issue Type: Improvement
Affects Versions: 1.3.0
Reporter: Ilya Pronin
Assignee: Ilya Pronin
Priority: Minor


{{ZooKeeperMasterDetector}} in agents and scheduler drivers uses the default ZK 
session timeout (10 secs). This timeout may have to be increased to cope with 
long ZK upgrades or ZK GC pauses (with local ZK sessions these can cause lots 
of {{TASK_LOST}}, because sessions expire on disconnection after 
{{session_timeout * 2 / 3}}).



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (MESOS-7894) Mesos UI - Disk:Used Field isn't populated with Docker Container Runtime.

2017-08-16 Thread Joseph Wu (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-7894?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16128868#comment-16128868
 ] 

Joseph Wu commented on MESOS-7894:
--

This is due to how the disk resource field is populated on the agent's resource 
statistics endpoint.

The MesosContainerizer (UCR) will fetch usage metrics from each isolator (which 
includes a {{disk/du}} isolator reporting disk usage) and then merge these 
results together, giving you the 128 MB you expect.

The DockerContainerizer just reports some cgroups stats, which does not include 
disk info.

> Mesos UI - Disk:Used Field isn't populated with Docker Container Runtime.
> -
>
> Key: MESOS-7894
> URL: https://issues.apache.org/jira/browse/MESOS-7894
> Project: Mesos
>  Issue Type: Bug
>Affects Versions: 1.2.2
> Environment: DC/OS 1.9.2 (CentOS 7.3, Docker 1.13.1, Mesos 1.2.2, 
> Marathon 1.4.5)
>Reporter: Justin Lee
>Priority: Minor
> Attachments: UCR on left, Docker on right.png
>
>
> If you use the Docker container runtime, the 'Disk' 'Used' field never gets 
> populated in the Mesos UI (on the executor/task page).
> Steps to Reproduce:
> in DC/OS 1.9.2, deploy two apps:
> {code:javascript}
> {
>   "id": "/dummy-disk-docker",
>   "cmd": "dd if=/dev/zero of=$MESOS_SANDBOX/testfile bs=128M count=1; tail -f 
> /dev/null",
>   "instances": 1,
>   "cpus": 0.1,
>   "mem": 256,
>   "disk": 150,
>   "container": {
> "type": "DOCKER",
> "docker": {
>   "image": "alpine"
> }
>   }
> }
> {code}
> {code:javascript}
> {
>   "id": "/dummy-disk-ucr",
>   "cmd": "dd if=/dev/zero of=$MESOS_SANDBOX/testfile bs=128M count=1; tail -f 
> /dev/null",
>   "instances": 1,
>   "cpus": 0.1,
>   "mem": 256,
>   "disk": 150,
>   "container": {
> "type": "MESOS"
> "docker": {
>   "image": "alpine"
> }
>   }
> }
> {code}
> Wait for them the deploy.  
> Then, navigate to the mesos UI, and go to the executor/task page for the two 
> tasks.
> On the UCR task, eventually the "Used Disk" field should populate with 128 MB 
> (the size of the dummy file).
> The same field on the Docker task will never get populated.
> Both containers are writing to the same location on the agent filesystem 
> ({{/var/lib/mesos/slave/slaves//frameworks//executors//runs/latest}}),
>  but only one reports the data through the UI.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (MESOS-7895) ZK session timeout is unconfigurable in agent and scheduler drivers

2017-08-16 Thread Vinod Kone (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-7895?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16128997#comment-16128997
 ] 

Vinod Kone commented on MESOS-7895:
---

Sorry, I do not have time :( Maybe [~xujyan] can?

> ZK session timeout is unconfigurable in agent and scheduler drivers
> ---
>
> Key: MESOS-7895
> URL: https://issues.apache.org/jira/browse/MESOS-7895
> Project: Mesos
>  Issue Type: Improvement
>Affects Versions: 1.3.0
>Reporter: Ilya Pronin
>Assignee: Ilya Pronin
>Priority: Minor
>
> {{ZooKeeperMasterDetector}} in agents and scheduler drivers uses the default 
> ZK session timeout (10 secs). This timeout may have to be increased to cope 
> with long ZK upgrades or ZK GC pauses (with local ZK sessions these can cause 
> lots of {{TASK_LOST}}, because sessions expire on disconnection after 
> {{session_timeout * 2 / 3}}).



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (MESOS-7714) Fix agent downgrade for reservation refinement

2017-08-16 Thread Adam B (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-7714?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16129728#comment-16129728
 ] 

Adam B commented on MESOS-7714:
---

Downgrading from blocker because [~mcypark] says "I think we’ll have to ship 
without it".
Please retarget to 1.4.1 and/or 1.5.0 so [~karya] and [~anandmazumdar] can cut 
1.4.0-rc1

> Fix agent downgrade for reservation refinement
> --
>
> Key: MESOS-7714
> URL: https://issues.apache.org/jira/browse/MESOS-7714
> Project: Mesos
>  Issue Type: Bug
>Reporter: Michael Park
>Assignee: Michael Park
>Priority: Critical
>
> The agent code only partially supports downgrading of an agent correctly.
> The checkpointed resources are done correctly, but the resources within
> the {{SlaveInfo}} message as well as tasks and executors also need to be 
> downgraded
> correctly and converted back on recovery.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Updated] (MESOS-7100) Missing AGENT_REMOVED event in event stream

2017-08-16 Thread Adam B (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-7100?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Adam B updated MESOS-7100:
--
Target Version/s:   (was: 1.4.0)

> Missing AGENT_REMOVED event in event stream
> ---
>
> Key: MESOS-7100
> URL: https://issues.apache.org/jira/browse/MESOS-7100
> Project: Mesos
>  Issue Type: Bug
>  Components: HTTP API
>Affects Versions: 1.1.0
>Reporter: Haralds Ulmanis
>Priority: Minor
>  Labels: mesosphere
>
> I'm playing with event stream via HTTP endpoints.
> So - i got all events - SUBSCRIBED, TASK_ADDED, TASK_UPDATED, AGENT_ADDED 
> except AGENT_REMOVED.
> What i do:
> Just stop agent or terminate server(if cloud).
> What i expect: 
> Once it disappear from agent list (in mesos UI) to get event AGENT_REMOVED.
> Not sure about internals, maybe that is not correct event and agents got 
> removed after some period, if they do not come up. But in general some event 
> to indicate that agent went offline and not available would be good.
> If AGENT_REMOVED and AGENT_ADDED is kinda one-time events .. maybe something 
> like: AGENT_CONNECTED/RECONNECTED, AGENT_LEAVE events would be great.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Updated] (MESOS-7707) Compile with newer Boost versions

2017-08-16 Thread Adam B (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-7707?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Adam B updated MESOS-7707:
--
Fix Version/s: (was: 1.4.0)

> Compile with newer Boost versions
> -
>
> Key: MESOS-7707
> URL: https://issues.apache.org/jira/browse/MESOS-7707
> Project: Mesos
>  Issue Type: Bug
>  Components: c++ api
>Affects Versions: 1.4.0
>Reporter: David Carlier
>Priority: Minor
>
> Hello first look at mesos codebase and tried to compile with boost 1.6.4 
> shipped with the system thus
> https://reviews.apache.org/r/60356/
> Hope it s useful.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Updated] (MESOS-7707) Compile with newer Boost versions

2017-08-16 Thread Adam B (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-7707?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Adam B updated MESOS-7707:
--
Affects Version/s: 1.4.0

> Compile with newer Boost versions
> -
>
> Key: MESOS-7707
> URL: https://issues.apache.org/jira/browse/MESOS-7707
> Project: Mesos
>  Issue Type: Bug
>  Components: c++ api
>Affects Versions: 1.4.0
>Reporter: David Carlier
>Priority: Minor
>
> Hello first look at mesos codebase and tried to compile with boost 1.6.4 
> shipped with the system thus
> https://reviews.apache.org/r/60356/
> Hope it s useful.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Updated] (MESOS-7707) Compile with newer Boost versions

2017-08-16 Thread Adam B (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-7707?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Adam B updated MESOS-7707:
--
Target Version/s:   (was: 1.4.0)

> Compile with newer Boost versions
> -
>
> Key: MESOS-7707
> URL: https://issues.apache.org/jira/browse/MESOS-7707
> Project: Mesos
>  Issue Type: Bug
>  Components: c++ api
>Affects Versions: 1.4.0
>Reporter: David Carlier
>Priority: Minor
>
> Hello first look at mesos codebase and tried to compile with boost 1.6.4 
> shipped with the system thus
> https://reviews.apache.org/r/60356/
> Hope it s useful.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Updated] (MESOS-7714) Fix agent downgrade for reservation refinement

2017-08-16 Thread Adam B (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-7714?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Adam B updated MESOS-7714:
--
Priority: Critical  (was: Blocker)

> Fix agent downgrade for reservation refinement
> --
>
> Key: MESOS-7714
> URL: https://issues.apache.org/jira/browse/MESOS-7714
> Project: Mesos
>  Issue Type: Bug
>Reporter: Michael Park
>Assignee: Michael Park
>Priority: Critical
>
> The agent code only partially supports downgrading of an agent correctly.
> The checkpointed resources are done correctly, but the resources within
> the {{SlaveInfo}} message as well as tasks and executors also need to be 
> downgraded
> correctly and converted back on recovery.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (MESOS-6641) Remove deprecated hooks from our module API.

2017-08-16 Thread Adam B (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-6641?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16129735#comment-16129735
 ] 

Adam B commented on MESOS-6641:
---

Removing target version until we have "Accepted" this ticket, and 
scoped/prioritized/assigned it for a release.

> Remove deprecated hooks from our module API.
> 
>
> Key: MESOS-6641
> URL: https://issues.apache.org/jira/browse/MESOS-6641
> Project: Mesos
>  Issue Type: Improvement
>  Components: modules
>Reporter: Till Toenshoff
>Priority: Minor
>  Labels: deprecation, hooks, tech-debt
>
> By now we have at least one deprecated hook in our modules API which is 
> {{slavePreLaunchDockerHook}}. 
> There is a new one coming in now which is deprecating 
> {{slavePreLaunchDockerEnvironmentDecorator}}.
> We need to actually remove those deprecations while making the community 
> aware - this ticket is meant for tracking this.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (MESOS-7100) Missing AGENT_REMOVED event in event stream

2017-08-16 Thread Adam B (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-7100?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16129732#comment-16129732
 ] 

Adam B commented on MESOS-7100:
---

Removing target version until we have "Accepted" this ticket, and 
scoped/prioritized/assigned it for a release.

> Missing AGENT_REMOVED event in event stream
> ---
>
> Key: MESOS-7100
> URL: https://issues.apache.org/jira/browse/MESOS-7100
> Project: Mesos
>  Issue Type: Bug
>  Components: HTTP API
>Affects Versions: 1.1.0
>Reporter: Haralds Ulmanis
>Priority: Minor
>  Labels: mesosphere
>
> I'm playing with event stream via HTTP endpoints.
> So - i got all events - SUBSCRIBED, TASK_ADDED, TASK_UPDATED, AGENT_ADDED 
> except AGENT_REMOVED.
> What i do:
> Just stop agent or terminate server(if cloud).
> What i expect: 
> Once it disappear from agent list (in mesos UI) to get event AGENT_REMOVED.
> Not sure about internals, maybe that is not correct event and agents got 
> removed after some period, if they do not come up. But in general some event 
> to indicate that agent went offline and not available would be good.
> If AGENT_REMOVED and AGENT_ADDED is kinda one-time events .. maybe something 
> like: AGENT_CONNECTED/RECONNECTED, AGENT_LEAVE events would be great.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (MESOS-7233) Use varint comparator in replica log

2017-08-16 Thread Adam B (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-7233?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16129734#comment-16129734
 ] 

Adam B commented on MESOS-7233:
---

Removing target version until we have "Accepted" this ticket, and 
scoped/prioritized it for a release.
It appears there is concern about the approach taken in the 5-month-old review.

> Use varint comparator in replica log
> 
>
> Key: MESOS-7233
> URL: https://issues.apache.org/jira/browse/MESOS-7233
> Project: Mesos
>  Issue Type: Improvement
>  Components: replicated log
>Reporter: Tomasz Janiszewski
>Assignee: Tomasz Janiszewski
>Priority: Minor
>
> Since bug discussed at
> https://groups.google.com/forum/#\!topic/leveldb/F-rDkWiQm6c
> was fixed. After upgrading leveldb to 1.19 we could replace
> default byte-wise comparator with varint comparator.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Updated] (MESOS-6641) Remove deprecated hooks from our module API.

2017-08-16 Thread Adam B (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-6641?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Adam B updated MESOS-6641:
--
Target Version/s:   (was: 1.4.0)

> Remove deprecated hooks from our module API.
> 
>
> Key: MESOS-6641
> URL: https://issues.apache.org/jira/browse/MESOS-6641
> Project: Mesos
>  Issue Type: Improvement
>  Components: modules
>Reporter: Till Toenshoff
>Priority: Minor
>  Labels: deprecation, hooks, tech-debt
>
> By now we have at least one deprecated hook in our modules API which is 
> {{slavePreLaunchDockerHook}}. 
> There is a new one coming in now which is deprecating 
> {{slavePreLaunchDockerEnvironmentDecorator}}.
> We need to actually remove those deprecations while making the community 
> aware - this ticket is meant for tracking this.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Updated] (MESOS-7233) Use varint comparator in replica log

2017-08-16 Thread Adam B (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-7233?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Adam B updated MESOS-7233:
--
Target Version/s:   (was: 1.4.0)

> Use varint comparator in replica log
> 
>
> Key: MESOS-7233
> URL: https://issues.apache.org/jira/browse/MESOS-7233
> Project: Mesos
>  Issue Type: Improvement
>  Components: replicated log
>Reporter: Tomasz Janiszewski
>Assignee: Tomasz Janiszewski
>Priority: Minor
>
> Since bug discussed at
> https://groups.google.com/forum/#\!topic/leveldb/F-rDkWiQm6c
> was fixed. After upgrading leveldb to 1.19 we could replace
> default byte-wise comparator with varint comparator.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (MESOS-7007) filesystem/shared and --default_container_info broken since 1.1

2017-08-16 Thread R.B. Boyer (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-7007?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16129526#comment-16129526
 ] 

R.B. Boyer commented on MESOS-7007:
---

Also from my parsing of the master branch today 
(`git:3587ca14c81c61949239698539df3e67774c9071`) the affected code moved to 
`src/slave/slave.cpp` and still has the same underlying bug:

{code}
  // Add the default container info to the executor info.
  // TODO(jieyu): Rename the flag to be default_mesos_container_info.
  if (!executorInfo_.has_container() &&
  flags.default_container_info.isSome()) {
executorInfo_.mutable_container()->CopyFrom(
flags.default_container_info.get());
  }

  // Bundle all the container launch fields together.
  ContainerConfig containerConfig;
  containerConfig.mutable_executor_info()->CopyFrom(executorInfo_);
  containerConfig.mutable_command_info()->CopyFrom(executorInfo_.command());
  containerConfig.mutable_resources()->CopyFrom(executorInfo_.resources());
  containerConfig.set_directory(executor->directory);

  if (executor->user.isSome()) {
containerConfig.set_user(executor->user.get());
  }

  if (executor->isCommandExecutor()) {
if (taskInfo.isSome()) {
  containerConfig.mutable_task_info()->CopyFrom(taskInfo.get());

  if (taskInfo.get().has_container()) {
containerConfig.mutable_container_info()
  ->CopyFrom(taskInfo.get().container());
  }
}
  } else {
if (executorInfo_.has_container()) {
  containerConfig.mutable_container_info()
->CopyFrom(executorInfo_.container());
}
  }
{code}

> filesystem/shared and --default_container_info broken since 1.1
> ---
>
> Key: MESOS-7007
> URL: https://issues.apache.org/jira/browse/MESOS-7007
> Project: Mesos
>  Issue Type: Bug
>  Components: agent
>Affects Versions: 1.1.0, 1.2.0
>Reporter: Pierre Cheynier
>Assignee: Chun-Hung Hsiao
>  Labels: storage
>
> I face this issue, that prevent me to upgrade to 1.1.0 (and the change was 
> consequently introduced in this version):
> I'm using default_container_info to mount a /tmp volume in the container's 
> mount namespace from its current sandbox, meaning that each container have a 
> dedicated /tmp, thanks to the {{filesystem/shared}} isolator.
> I noticed through our automation pipeline that integration tests were failing 
> and found that this is because /tmp (the one from the host!) contents is 
> trashed each time a container is created.
> Here is my setup: 
> * 
> {{--isolation='cgroups/cpu,cgroups/mem,namespaces/pid,*disk/du,filesystem/shared,filesystem/linux*,docker/runtime'}}
> * 
> {{--default_container_info='\{"type":"MESOS","volumes":\[\{"host_path":"tmp","container_path":"/tmp","mode":"RW"\}\]\}'}}
> I discovered this issue in the early days of 1.1 (end of Nov, spoke with 
> someone on Slack), but had unfortunately no time to dig into the symptoms a 
> bit more.
> I found nothing interesting even using GLOGv=3.
> Maybe it's a bad usage of isolators that trigger this issue ? If it's the 
> case, then at least a documentation update should be done.
> Let me know if more information is needed.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Updated] (MESOS-7705) Reconsider restricting the resource format for frameworks.

2017-08-16 Thread Michael Park (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-7705?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Park updated MESOS-7705:

Priority: Major  (was: Blocker)

> Reconsider restricting the resource format for frameworks.
> --
>
> Key: MESOS-7705
> URL: https://issues.apache.org/jira/browse/MESOS-7705
> Project: Mesos
>  Issue Type: Bug
>  Components: master
>Reporter: Michael Park
>Assignee: Michael Park
>
> We output the "endpoint" format through the endpoints
> for backward compatibility of external tooling. A framework should be
> able to use the result of an endpoint and pass it back to Mesos,
> since the result was produced by Mesos. This is especially applicable
> to the V1 API. We also allow the "pre-reservation-refinement" format
> because existing "resources files" are written in that format, and
> they should still be usable without modification.
> This is probably too flexible however, since a framework without
> a RESERVATION_REFINEMENT capability could make refined reservations
> using the "post-reservation-refinement" format, although they wouldn't be
> offered such resources. It still seems undesirable if anyone were to
> run into it, and we should consider adding sensible restrictions.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Updated] (MESOS-7705) Reconsider restricting the resource format for frameworks.

2017-08-16 Thread Michael Park (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-7705?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Park updated MESOS-7705:

Target Version/s: 1.5.0, 1.4.1  (was: 1.4.0)

> Reconsider restricting the resource format for frameworks.
> --
>
> Key: MESOS-7705
> URL: https://issues.apache.org/jira/browse/MESOS-7705
> Project: Mesos
>  Issue Type: Bug
>  Components: master
>Reporter: Michael Park
>Assignee: Michael Park
>Priority: Blocker
>
> We output the "endpoint" format through the endpoints
> for backward compatibility of external tooling. A framework should be
> able to use the result of an endpoint and pass it back to Mesos,
> since the result was produced by Mesos. This is especially applicable
> to the V1 API. We also allow the "pre-reservation-refinement" format
> because existing "resources files" are written in that format, and
> they should still be usable without modification.
> This is probably too flexible however, since a framework without
> a RESERVATION_REFINEMENT capability could make refined reservations
> using the "post-reservation-refinement" format, although they wouldn't be
> offered such resources. It still seems undesirable if anyone were to
> run into it, and we should consider adding sensible restrictions.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Updated] (MESOS-7865) Agent may process a kill task and still launch the task.

2017-08-16 Thread Benjamin Mahler (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-7865?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Benjamin Mahler updated MESOS-7865:
---
Priority: Critical  (was: Blocker)

> Agent may process a kill task and still launch the task.
> 
>
> Key: MESOS-7865
> URL: https://issues.apache.org/jira/browse/MESOS-7865
> Project: Mesos
>  Issue Type: Bug
>  Components: agent
>Reporter: Benjamin Mahler
>Assignee: Benjamin Mahler
>Priority: Critical
>
> Based on the investigation of MESOS-7744, the agent has a race in which 
> "queued" tasks can still be launched after the agent has processed a kill 
> task for them. This race was introduced when {{Slave::statusUpdate}} was made 
> asynchronous:
> (1) {{Slave::__run}} completes, task is now within {{Executor::queuedTasks}}
> (2) {{Slave::killTask}} locates the executor based on the task ID residing in 
> queuedTasks, calls {{Slave::statusUpdate()}} with {{TASK_KILLED}}
> (3) {{Slave::___run}} assumes that killed tasks have been removed from 
> {{Executor::queuedTasks}}, but this now occurs asynchronously in 
> {{Slave::_statusUpdate}}. So, the executor still sees the queued task and 
> delivers it and adds the task to {{Executor::launchedTasks}}.
> (3) {{Slave::_statusUpdate}} runs, removes the task from 
> {{Executor::launchedTasks}} and adds it to {{Executor::terminatedTasks}}.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Updated] (MESOS-7783) Framework might not receive status update when a just launched task is killed immediately

2017-08-16 Thread Benjamin Mahler (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-7783?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Benjamin Mahler updated MESOS-7783:
---
Priority: Critical  (was: Blocker)

> Framework might not receive status update when a just launched task is killed 
> immediately
> -
>
> Key: MESOS-7783
> URL: https://issues.apache.org/jira/browse/MESOS-7783
> Project: Mesos
>  Issue Type: Bug
>  Components: agent
>Affects Versions: 1.2.0
>Reporter: Benjamin Bannier
>Assignee: Benjamin Mahler
>Priority: Critical
>  Labels: reliability
> Attachments: GroupDeployIntegrationTest.log.zip, logs
>
>
> Our Marathon team are seeing issues in their integration test suite when 
> Marathon gets stuck in an infinite loop trying to kill a just launched task. 
> In their test a task launched which is immediately followed by killing the 
> task -- the framework does e.g., not wait for any task status update.
> In this case the launch and kill messages arrive at the agent in the correct 
> order, but both the launch and kill paths in the agent do not reach the point 
> where a status update is sent to the framework. Since the framework has seen 
> no status update on the task it re-triggers a kill, causing an infinite loop.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Updated] (MESOS-7863) Agent may drop pending kill task status updates.

2017-08-16 Thread Benjamin Mahler (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-7863?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Benjamin Mahler updated MESOS-7863:
---
Priority: Critical  (was: Blocker)

> Agent may drop pending kill task status updates.
> 
>
> Key: MESOS-7863
> URL: https://issues.apache.org/jira/browse/MESOS-7863
> Project: Mesos
>  Issue Type: Bug
>  Components: agent
>Reporter: Benjamin Mahler
>Assignee: Benjamin Mahler
>Priority: Critical
>
> Currently there is an assumption that when a pending task is killed, the 
> framework will still be stored in the agent. However, this assumption can be 
> violated in two cases:
> # Another pending task was killed and we removed the framework in 
> 'Slave::run' thinking it was idle, because pending tasks were empty (we 
> remove from pending tasks when processing the kill). (MESOS-7783 is an 
> example instance of this).
> # The last executor terminated without tasks to send terminal updates for, or 
> the last terminated executor received its last acknowledgement. At this 
> point, we remove the framework thinking there were no pending tasks if the 
> task was killed (removed from pending).



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Updated] (MESOS-7744) Mesos Agent Sends TASK_KILL status update to Master, and still launches task

2017-08-16 Thread Benjamin Mahler (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-7744?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Benjamin Mahler updated MESOS-7744:
---
Priority: Critical  (was: Blocker)

> Mesos Agent Sends TASK_KILL status update to Master, and still launches task
> 
>
> Key: MESOS-7744
> URL: https://issues.apache.org/jira/browse/MESOS-7744
> Project: Mesos
>  Issue Type: Bug
>Affects Versions: 1.0.1
>Reporter: Sargun Dhillon
>Assignee: Benjamin Mahler
>Priority: Critical
>  Labels: reliability
>
> We sometimes launch jobs, and cancel them in ~7 seconds, if we don't get a 
> TASK_STARTING back from the agent. Under certain conditions it can result in 
> Mesos losing track of the task. The chunk of the logs which is interesting is 
> here:
> {code}
> Jun 29 23:22:26 titusagent-mainvpc-r3.8xlarge.2-i-04907efc9f1f8535c 
> mesos-slave[4290]: I0629 23:22:26.951799  5171 slave.cpp:1495] Got assigned 
> task Titus-7590548-worker-0-4476 for framework TitusFramework
> Jun 29 23:22:26 titusagent-mainvpc-r3.8xlarge.2-i-04907efc9f1f8535c 
> mesos-slave[4290]: I0629 23:22:26.952251  5171 slave.cpp:1614] Launching task 
> Titus-7590548-worker-0-4476 for framework TitusFramework
> Jun 29 23:22:37 titusagent-mainvpc-r3.8xlarge.2-i-04907efc9f1f8535c 
> mesos-slave[4290]: I0629 23:22:37.484611  5171 slave.cpp:1853] Queuing task 
> ‘Titus-7590548-worker-0-4476’ for executor ‘docker-executor’ of framework 
> TitusFramework at executor(1)@100.66.11.10:17707
> Jun 29 23:22:37 titusagent-mainvpc-r3.8xlarge.2-i-04907efc9f1f8535c 
> mesos-slave[4290]: I0629 23:22:37.487876  5171 slave.cpp:2035] Asked to kill 
> task Titus-7590548-worker-0-4476 of framework TitusFramework
> Jun 29 23:22:37 titusagent-mainvpc-r3.8xlarge.2-i-04907efc9f1f8535c 
> mesos-slave[4290]: I0629 23:22:37.488994  5171 slave.cpp:3211] Handling 
> status update TASK_KILLED (UUID: 898215d6-a244-4dbe-bc9c-878a22d36ea4) for 
> task Titus-7590548-worker-0-4476 of framework TitusFramework from @0.0.0.0:0
> Jun 29 23:22:37 titusagent-mainvpc-r3.8xlarge.2-i-04907efc9f1f8535c 
> mesos-slave[4290]: I0629 23:22:37.490603  5171 slave.cpp:2005] Sending queued 
> task ‘Titus-7590548-worker-0-4476’ to executor ‘docker-executor’ of framework 
> TitusFramework at executor(1)@100.66.11.10:17707{
> {code}
> In our executor, we see that the launch message arrives after the master has 
> already gotten the kill update. We then send non-terminal state updates to 
> the agent, and yet it doesn't forward these to our framework. We're using a 
> custom executor which is based on the older mesos-go bindings. 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Created] (MESOS-7897) Implement linearizable concurrent queue to use as event queue in libprocess

2017-08-16 Thread Dario Rexin (JIRA)
Dario Rexin created MESOS-7897:
--

 Summary: Implement linearizable concurrent queue to use as event 
queue in libprocess
 Key: MESOS-7897
 URL: https://issues.apache.org/jira/browse/MESOS-7897
 Project: Mesos
  Issue Type: Improvement
  Components: libprocess
Reporter: Dario Rexin
Assignee: Dario Rexin


We currently use the moodycamel ConcurrentQueue as the backing queue for the 
lockfree mailboxes in libprocess. Even though this queue is extremely fast, 
it's not strictly linearizable, which is why there is same additional code to 
re-establish linearizability. This makes the code harder to maintain and also 
eliminates most of the performance benefits of this queue. A simple concurrent 
linked queue implementation should yield better results and simplify the code 
of the EventQueue class.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Created] (MESOS-7898) Implement control message type that is handled with higher priority

2017-08-16 Thread Dario Rexin (JIRA)
Dario Rexin created MESOS-7898:
--

 Summary: Implement control message type that is handled with 
higher priority
 Key: MESOS-7898
 URL: https://issues.apache.org/jira/browse/MESOS-7898
 Project: Mesos
  Issue Type: Improvement
Reporter: Dario Rexin
Assignee: Dario Rexin


Up until most recently libprocess was able to enqueue messages in the front of 
the queue, so they would be processed before the other messages. The recent 
changes adding concurrent queues removed this capability. One way to implement 
this feature while keeping the mailbox lock free would be to use a second queue 
in the mailbox that is only used for control messages. When dequeuing a message 
this queue would be checked first, before the regular queue is considered. A 
similar mailbox can be found in Akka: 
http://doc.akka.io/docs/akka/2.5.3/scala/mailboxes.html 
(UnboundedControlAwareMailbox).

Since not all processes need this capability and there will be an impact on 
performance, we could consider making this an optional per-process 
configuration.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Comment Edited] (MESOS-7744) Mesos Agent Sends TASK_KILL status update to Master, and still launches task

2017-08-16 Thread Benjamin Mahler (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-7744?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16129236#comment-16129236
 ] 

Benjamin Mahler edited comment on MESOS-7744 at 8/16/17 6:41 PM:
-

[~karya] [~alexr] any reason to drop the targets instead of bumping them? I 
will re-target for 1.4.1 and 1.1.4 (or is there no 1.1.4 planned?)

I realized I added 1.1.3 because the version is considered unreleased in JIRA. 
[~alexr] can you close that version so others don't target it?


was (Author: bmahler):
[~karya] [~alexr] any reason to drop the targets instead of bumping them? I 
will re-target for 1.4.1 and 1.1.4 (or is there no 1.1.4 planned?)

> Mesos Agent Sends TASK_KILL status update to Master, and still launches task
> 
>
> Key: MESOS-7744
> URL: https://issues.apache.org/jira/browse/MESOS-7744
> Project: Mesos
>  Issue Type: Bug
>Affects Versions: 1.0.1
>Reporter: Sargun Dhillon
>Assignee: Benjamin Mahler
>Priority: Critical
>  Labels: reliability
>
> We sometimes launch jobs, and cancel them in ~7 seconds, if we don't get a 
> TASK_STARTING back from the agent. Under certain conditions it can result in 
> Mesos losing track of the task. The chunk of the logs which is interesting is 
> here:
> {code}
> Jun 29 23:22:26 titusagent-mainvpc-r3.8xlarge.2-i-04907efc9f1f8535c 
> mesos-slave[4290]: I0629 23:22:26.951799  5171 slave.cpp:1495] Got assigned 
> task Titus-7590548-worker-0-4476 for framework TitusFramework
> Jun 29 23:22:26 titusagent-mainvpc-r3.8xlarge.2-i-04907efc9f1f8535c 
> mesos-slave[4290]: I0629 23:22:26.952251  5171 slave.cpp:1614] Launching task 
> Titus-7590548-worker-0-4476 for framework TitusFramework
> Jun 29 23:22:37 titusagent-mainvpc-r3.8xlarge.2-i-04907efc9f1f8535c 
> mesos-slave[4290]: I0629 23:22:37.484611  5171 slave.cpp:1853] Queuing task 
> ‘Titus-7590548-worker-0-4476’ for executor ‘docker-executor’ of framework 
> TitusFramework at executor(1)@100.66.11.10:17707
> Jun 29 23:22:37 titusagent-mainvpc-r3.8xlarge.2-i-04907efc9f1f8535c 
> mesos-slave[4290]: I0629 23:22:37.487876  5171 slave.cpp:2035] Asked to kill 
> task Titus-7590548-worker-0-4476 of framework TitusFramework
> Jun 29 23:22:37 titusagent-mainvpc-r3.8xlarge.2-i-04907efc9f1f8535c 
> mesos-slave[4290]: I0629 23:22:37.488994  5171 slave.cpp:3211] Handling 
> status update TASK_KILLED (UUID: 898215d6-a244-4dbe-bc9c-878a22d36ea4) for 
> task Titus-7590548-worker-0-4476 of framework TitusFramework from @0.0.0.0:0
> Jun 29 23:22:37 titusagent-mainvpc-r3.8xlarge.2-i-04907efc9f1f8535c 
> mesos-slave[4290]: I0629 23:22:37.490603  5171 slave.cpp:2005] Sending queued 
> task ‘Titus-7590548-worker-0-4476’ to executor ‘docker-executor’ of framework 
> TitusFramework at executor(1)@100.66.11.10:17707{
> {code}
> In our executor, we see that the launch message arrives after the master has 
> already gotten the kill update. We then send non-terminal state updates to 
> the agent, and yet it doesn't forward these to our framework. We're using a 
> custom executor which is based on the older mesos-go bindings. 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Updated] (MESOS-7744) Mesos Agent Sends TASK_KILL status update to Master, and still launches task

2017-08-16 Thread Benjamin Mahler (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-7744?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Benjamin Mahler updated MESOS-7744:
---
Target Version/s: 1.1.3, 1.2.3, 1.3.2, 1.4.0, 1.5.0  (was: 1.2.3, 1.3.2, 
1.4.0, 1.5.0)

> Mesos Agent Sends TASK_KILL status update to Master, and still launches task
> 
>
> Key: MESOS-7744
> URL: https://issues.apache.org/jira/browse/MESOS-7744
> Project: Mesos
>  Issue Type: Bug
>Affects Versions: 1.0.1
>Reporter: Sargun Dhillon
>Assignee: Benjamin Mahler
>Priority: Blocker
>  Labels: reliability
>
> We sometimes launch jobs, and cancel them in ~7 seconds, if we don't get a 
> TASK_STARTING back from the agent. Under certain conditions it can result in 
> Mesos losing track of the task. The chunk of the logs which is interesting is 
> here:
> {code}
> Jun 29 23:22:26 titusagent-mainvpc-r3.8xlarge.2-i-04907efc9f1f8535c 
> mesos-slave[4290]: I0629 23:22:26.951799  5171 slave.cpp:1495] Got assigned 
> task Titus-7590548-worker-0-4476 for framework TitusFramework
> Jun 29 23:22:26 titusagent-mainvpc-r3.8xlarge.2-i-04907efc9f1f8535c 
> mesos-slave[4290]: I0629 23:22:26.952251  5171 slave.cpp:1614] Launching task 
> Titus-7590548-worker-0-4476 for framework TitusFramework
> Jun 29 23:22:37 titusagent-mainvpc-r3.8xlarge.2-i-04907efc9f1f8535c 
> mesos-slave[4290]: I0629 23:22:37.484611  5171 slave.cpp:1853] Queuing task 
> ‘Titus-7590548-worker-0-4476’ for executor ‘docker-executor’ of framework 
> TitusFramework at executor(1)@100.66.11.10:17707
> Jun 29 23:22:37 titusagent-mainvpc-r3.8xlarge.2-i-04907efc9f1f8535c 
> mesos-slave[4290]: I0629 23:22:37.487876  5171 slave.cpp:2035] Asked to kill 
> task Titus-7590548-worker-0-4476 of framework TitusFramework
> Jun 29 23:22:37 titusagent-mainvpc-r3.8xlarge.2-i-04907efc9f1f8535c 
> mesos-slave[4290]: I0629 23:22:37.488994  5171 slave.cpp:3211] Handling 
> status update TASK_KILLED (UUID: 898215d6-a244-4dbe-bc9c-878a22d36ea4) for 
> task Titus-7590548-worker-0-4476 of framework TitusFramework from @0.0.0.0:0
> Jun 29 23:22:37 titusagent-mainvpc-r3.8xlarge.2-i-04907efc9f1f8535c 
> mesos-slave[4290]: I0629 23:22:37.490603  5171 slave.cpp:2005] Sending queued 
> task ‘Titus-7590548-worker-0-4476’ to executor ‘docker-executor’ of framework 
> TitusFramework at executor(1)@100.66.11.10:17707{
> {code}
> In our executor, we see that the launch message arrives after the master has 
> already gotten the kill update. We then send non-terminal state updates to 
> the agent, and yet it doesn't forward these to our framework. We're using a 
> custom executor which is based on the older mesos-go bindings. 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Updated] (MESOS-7783) Framework might not receive status update when a just launched task is killed immediately

2017-08-16 Thread Benjamin Mahler (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-7783?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Benjamin Mahler updated MESOS-7783:
---
Target Version/s: 1.1.3, 1.2.3, 1.3.2, 1.4.0  (was: 1.2.3, 1.3.2, 1.4.0)

> Framework might not receive status update when a just launched task is killed 
> immediately
> -
>
> Key: MESOS-7783
> URL: https://issues.apache.org/jira/browse/MESOS-7783
> Project: Mesos
>  Issue Type: Bug
>  Components: agent
>Affects Versions: 1.2.0
>Reporter: Benjamin Bannier
>Assignee: Benjamin Mahler
>Priority: Blocker
>  Labels: reliability
> Attachments: GroupDeployIntegrationTest.log.zip, logs
>
>
> Our Marathon team are seeing issues in their integration test suite when 
> Marathon gets stuck in an infinite loop trying to kill a just launched task. 
> In their test a task launched which is immediately followed by killing the 
> task -- the framework does e.g., not wait for any task status update.
> In this case the launch and kill messages arrive at the agent in the correct 
> order, but both the launch and kill paths in the agent do not reach the point 
> where a status update is sent to the framework. Since the framework has seen 
> no status update on the task it re-triggers a kill, causing an infinite loop.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Updated] (MESOS-7744) Mesos Agent Sends TASK_KILL status update to Master, and still launches task

2017-08-16 Thread Benjamin Mahler (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-7744?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Benjamin Mahler updated MESOS-7744:
---
Target Version/s: 1.2.3, 1.3.2, 1.4.0, 1.5.0  (was: 1.2.3, 1.3.2, 1.5.0, 
1.4.1)

> Mesos Agent Sends TASK_KILL status update to Master, and still launches task
> 
>
> Key: MESOS-7744
> URL: https://issues.apache.org/jira/browse/MESOS-7744
> Project: Mesos
>  Issue Type: Bug
>Affects Versions: 1.0.1
>Reporter: Sargun Dhillon
>Assignee: Benjamin Mahler
>Priority: Blocker
>  Labels: reliability
>
> We sometimes launch jobs, and cancel them in ~7 seconds, if we don't get a 
> TASK_STARTING back from the agent. Under certain conditions it can result in 
> Mesos losing track of the task. The chunk of the logs which is interesting is 
> here:
> {code}
> Jun 29 23:22:26 titusagent-mainvpc-r3.8xlarge.2-i-04907efc9f1f8535c 
> mesos-slave[4290]: I0629 23:22:26.951799  5171 slave.cpp:1495] Got assigned 
> task Titus-7590548-worker-0-4476 for framework TitusFramework
> Jun 29 23:22:26 titusagent-mainvpc-r3.8xlarge.2-i-04907efc9f1f8535c 
> mesos-slave[4290]: I0629 23:22:26.952251  5171 slave.cpp:1614] Launching task 
> Titus-7590548-worker-0-4476 for framework TitusFramework
> Jun 29 23:22:37 titusagent-mainvpc-r3.8xlarge.2-i-04907efc9f1f8535c 
> mesos-slave[4290]: I0629 23:22:37.484611  5171 slave.cpp:1853] Queuing task 
> ‘Titus-7590548-worker-0-4476’ for executor ‘docker-executor’ of framework 
> TitusFramework at executor(1)@100.66.11.10:17707
> Jun 29 23:22:37 titusagent-mainvpc-r3.8xlarge.2-i-04907efc9f1f8535c 
> mesos-slave[4290]: I0629 23:22:37.487876  5171 slave.cpp:2035] Asked to kill 
> task Titus-7590548-worker-0-4476 of framework TitusFramework
> Jun 29 23:22:37 titusagent-mainvpc-r3.8xlarge.2-i-04907efc9f1f8535c 
> mesos-slave[4290]: I0629 23:22:37.488994  5171 slave.cpp:3211] Handling 
> status update TASK_KILLED (UUID: 898215d6-a244-4dbe-bc9c-878a22d36ea4) for 
> task Titus-7590548-worker-0-4476 of framework TitusFramework from @0.0.0.0:0
> Jun 29 23:22:37 titusagent-mainvpc-r3.8xlarge.2-i-04907efc9f1f8535c 
> mesos-slave[4290]: I0629 23:22:37.490603  5171 slave.cpp:2005] Sending queued 
> task ‘Titus-7590548-worker-0-4476’ to executor ‘docker-executor’ of framework 
> TitusFramework at executor(1)@100.66.11.10:17707{
> {code}
> In our executor, we see that the launch message arrives after the master has 
> already gotten the kill update. We then send non-terminal state updates to 
> the agent, and yet it doesn't forward these to our framework. We're using a 
> custom executor which is based on the older mesos-go bindings. 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Updated] (MESOS-7744) Mesos Agent Sends TASK_KILL status update to Master, and still launches task

2017-08-16 Thread Benjamin Mahler (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-7744?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Benjamin Mahler updated MESOS-7744:
---
Target Version/s: 1.1.3, 1.2.3, 1.3.2, 1.4.0  (was: 1.1.3, 1.2.3, 1.3.2, 
1.4.0, 1.5.0)

> Mesos Agent Sends TASK_KILL status update to Master, and still launches task
> 
>
> Key: MESOS-7744
> URL: https://issues.apache.org/jira/browse/MESOS-7744
> Project: Mesos
>  Issue Type: Bug
>Affects Versions: 1.0.1
>Reporter: Sargun Dhillon
>Assignee: Benjamin Mahler
>Priority: Blocker
>  Labels: reliability
>
> We sometimes launch jobs, and cancel them in ~7 seconds, if we don't get a 
> TASK_STARTING back from the agent. Under certain conditions it can result in 
> Mesos losing track of the task. The chunk of the logs which is interesting is 
> here:
> {code}
> Jun 29 23:22:26 titusagent-mainvpc-r3.8xlarge.2-i-04907efc9f1f8535c 
> mesos-slave[4290]: I0629 23:22:26.951799  5171 slave.cpp:1495] Got assigned 
> task Titus-7590548-worker-0-4476 for framework TitusFramework
> Jun 29 23:22:26 titusagent-mainvpc-r3.8xlarge.2-i-04907efc9f1f8535c 
> mesos-slave[4290]: I0629 23:22:26.952251  5171 slave.cpp:1614] Launching task 
> Titus-7590548-worker-0-4476 for framework TitusFramework
> Jun 29 23:22:37 titusagent-mainvpc-r3.8xlarge.2-i-04907efc9f1f8535c 
> mesos-slave[4290]: I0629 23:22:37.484611  5171 slave.cpp:1853] Queuing task 
> ‘Titus-7590548-worker-0-4476’ for executor ‘docker-executor’ of framework 
> TitusFramework at executor(1)@100.66.11.10:17707
> Jun 29 23:22:37 titusagent-mainvpc-r3.8xlarge.2-i-04907efc9f1f8535c 
> mesos-slave[4290]: I0629 23:22:37.487876  5171 slave.cpp:2035] Asked to kill 
> task Titus-7590548-worker-0-4476 of framework TitusFramework
> Jun 29 23:22:37 titusagent-mainvpc-r3.8xlarge.2-i-04907efc9f1f8535c 
> mesos-slave[4290]: I0629 23:22:37.488994  5171 slave.cpp:3211] Handling 
> status update TASK_KILLED (UUID: 898215d6-a244-4dbe-bc9c-878a22d36ea4) for 
> task Titus-7590548-worker-0-4476 of framework TitusFramework from @0.0.0.0:0
> Jun 29 23:22:37 titusagent-mainvpc-r3.8xlarge.2-i-04907efc9f1f8535c 
> mesos-slave[4290]: I0629 23:22:37.490603  5171 slave.cpp:2005] Sending queued 
> task ‘Titus-7590548-worker-0-4476’ to executor ‘docker-executor’ of framework 
> TitusFramework at executor(1)@100.66.11.10:17707{
> {code}
> In our executor, we see that the launch message arrives after the master has 
> already gotten the kill update. We then send non-terminal state updates to 
> the agent, and yet it doesn't forward these to our framework. We're using a 
> custom executor which is based on the older mesos-go bindings. 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Updated] (MESOS-7863) Agent may drop pending kill task status updates.

2017-08-16 Thread Benjamin Mahler (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-7863?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Benjamin Mahler updated MESOS-7863:
---
Target Version/s: 1.1.3, 1.2.3, 1.3.2, 1.4.0  (was: 1.2.3, 1.3.2, 1.4.0)

> Agent may drop pending kill task status updates.
> 
>
> Key: MESOS-7863
> URL: https://issues.apache.org/jira/browse/MESOS-7863
> Project: Mesos
>  Issue Type: Bug
>  Components: agent
>Reporter: Benjamin Mahler
>Assignee: Benjamin Mahler
>Priority: Blocker
>
> Currently there is an assumption that when a pending task is killed, the 
> framework will still be stored in the agent. However, this assumption can be 
> violated in two cases:
> # Another pending task was killed and we removed the framework in 
> 'Slave::run' thinking it was idle, because pending tasks were empty (we 
> remove from pending tasks when processing the kill). (MESOS-7783 is an 
> example instance of this).
> # The last executor terminated without tasks to send terminal updates for, or 
> the last terminated executor received its last acknowledgement. At this 
> point, we remove the framework thinking there were no pending tasks if the 
> task was killed (removed from pending).



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Updated] (MESOS-7865) Agent may process a kill task and still launch the task.

2017-08-16 Thread Benjamin Mahler (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-7865?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Benjamin Mahler updated MESOS-7865:
---
Target Version/s: 1.1.3, 1.2.3, 1.3.2, 1.4.0

> Agent may process a kill task and still launch the task.
> 
>
> Key: MESOS-7865
> URL: https://issues.apache.org/jira/browse/MESOS-7865
> Project: Mesos
>  Issue Type: Bug
>  Components: agent
>Reporter: Benjamin Mahler
>Assignee: Benjamin Mahler
>Priority: Blocker
>
> Based on the investigation of MESOS-7744, the agent has a race in which 
> "queued" tasks can still be launched after the agent has processed a kill 
> task for them. This race was introduced when {{Slave::statusUpdate}} was made 
> asynchronous:
> (1) {{Slave::__run}} completes, task is now within {{Executor::queuedTasks}}
> (2) {{Slave::killTask}} locates the executor based on the task ID residing in 
> queuedTasks, calls {{Slave::statusUpdate()}} with {{TASK_KILLED}}
> (3) {{Slave::___run}} assumes that killed tasks have been removed from 
> {{Executor::queuedTasks}}, but this now occurs asynchronously in 
> {{Slave::_statusUpdate}}. So, the executor still sees the queued task and 
> delivers it and adds the task to {{Executor::launchedTasks}}.
> (3) {{Slave::_statusUpdate}} runs, removes the task from 
> {{Executor::launchedTasks}} and adds it to {{Executor::terminatedTasks}}.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Updated] (MESOS-7744) Mesos Agent Sends TASK_KILL status update to Master, and still launches task

2017-08-16 Thread Benjamin Mahler (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-7744?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Benjamin Mahler updated MESOS-7744:
---
Priority: Blocker  (was: Critical)

> Mesos Agent Sends TASK_KILL status update to Master, and still launches task
> 
>
> Key: MESOS-7744
> URL: https://issues.apache.org/jira/browse/MESOS-7744
> Project: Mesos
>  Issue Type: Bug
>Affects Versions: 1.0.1
>Reporter: Sargun Dhillon
>Assignee: Benjamin Mahler
>Priority: Blocker
>  Labels: reliability
>
> We sometimes launch jobs, and cancel them in ~7 seconds, if we don't get a 
> TASK_STARTING back from the agent. Under certain conditions it can result in 
> Mesos losing track of the task. The chunk of the logs which is interesting is 
> here:
> {code}
> Jun 29 23:22:26 titusagent-mainvpc-r3.8xlarge.2-i-04907efc9f1f8535c 
> mesos-slave[4290]: I0629 23:22:26.951799  5171 slave.cpp:1495] Got assigned 
> task Titus-7590548-worker-0-4476 for framework TitusFramework
> Jun 29 23:22:26 titusagent-mainvpc-r3.8xlarge.2-i-04907efc9f1f8535c 
> mesos-slave[4290]: I0629 23:22:26.952251  5171 slave.cpp:1614] Launching task 
> Titus-7590548-worker-0-4476 for framework TitusFramework
> Jun 29 23:22:37 titusagent-mainvpc-r3.8xlarge.2-i-04907efc9f1f8535c 
> mesos-slave[4290]: I0629 23:22:37.484611  5171 slave.cpp:1853] Queuing task 
> ‘Titus-7590548-worker-0-4476’ for executor ‘docker-executor’ of framework 
> TitusFramework at executor(1)@100.66.11.10:17707
> Jun 29 23:22:37 titusagent-mainvpc-r3.8xlarge.2-i-04907efc9f1f8535c 
> mesos-slave[4290]: I0629 23:22:37.487876  5171 slave.cpp:2035] Asked to kill 
> task Titus-7590548-worker-0-4476 of framework TitusFramework
> Jun 29 23:22:37 titusagent-mainvpc-r3.8xlarge.2-i-04907efc9f1f8535c 
> mesos-slave[4290]: I0629 23:22:37.488994  5171 slave.cpp:3211] Handling 
> status update TASK_KILLED (UUID: 898215d6-a244-4dbe-bc9c-878a22d36ea4) for 
> task Titus-7590548-worker-0-4476 of framework TitusFramework from @0.0.0.0:0
> Jun 29 23:22:37 titusagent-mainvpc-r3.8xlarge.2-i-04907efc9f1f8535c 
> mesos-slave[4290]: I0629 23:22:37.490603  5171 slave.cpp:2005] Sending queued 
> task ‘Titus-7590548-worker-0-4476’ to executor ‘docker-executor’ of framework 
> TitusFramework at executor(1)@100.66.11.10:17707{
> {code}
> In our executor, we see that the launch message arrives after the master has 
> already gotten the kill update. We then send non-terminal state updates to 
> the agent, and yet it doesn't forward these to our framework. We're using a 
> custom executor which is based on the older mesos-go bindings. 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Updated] (MESOS-7865) Agent may process a kill task and still launch the task.

2017-08-16 Thread Benjamin Mahler (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-7865?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Benjamin Mahler updated MESOS-7865:
---
Priority: Blocker  (was: Critical)

> Agent may process a kill task and still launch the task.
> 
>
> Key: MESOS-7865
> URL: https://issues.apache.org/jira/browse/MESOS-7865
> Project: Mesos
>  Issue Type: Bug
>  Components: agent
>Reporter: Benjamin Mahler
>Assignee: Benjamin Mahler
>Priority: Blocker
>
> Based on the investigation of MESOS-7744, the agent has a race in which 
> "queued" tasks can still be launched after the agent has processed a kill 
> task for them. This race was introduced when {{Slave::statusUpdate}} was made 
> asynchronous:
> (1) {{Slave::__run}} completes, task is now within {{Executor::queuedTasks}}
> (2) {{Slave::killTask}} locates the executor based on the task ID residing in 
> queuedTasks, calls {{Slave::statusUpdate()}} with {{TASK_KILLED}}
> (3) {{Slave::___run}} assumes that killed tasks have been removed from 
> {{Executor::queuedTasks}}, but this now occurs asynchronously in 
> {{Slave::_statusUpdate}}. So, the executor still sees the queued task and 
> delivers it and adds the task to {{Executor::launchedTasks}}.
> (3) {{Slave::_statusUpdate}} runs, removes the task from 
> {{Executor::launchedTasks}} and adds it to {{Executor::terminatedTasks}}.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Updated] (MESOS-7863) Agent may drop pending kill task status updates.

2017-08-16 Thread Benjamin Mahler (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-7863?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Benjamin Mahler updated MESOS-7863:
---
Priority: Blocker  (was: Critical)

> Agent may drop pending kill task status updates.
> 
>
> Key: MESOS-7863
> URL: https://issues.apache.org/jira/browse/MESOS-7863
> Project: Mesos
>  Issue Type: Bug
>  Components: agent
>Reporter: Benjamin Mahler
>Assignee: Benjamin Mahler
>Priority: Blocker
>
> Currently there is an assumption that when a pending task is killed, the 
> framework will still be stored in the agent. However, this assumption can be 
> violated in two cases:
> # Another pending task was killed and we removed the framework in 
> 'Slave::run' thinking it was idle, because pending tasks were empty (we 
> remove from pending tasks when processing the kill). (MESOS-7783 is an 
> example instance of this).
> # The last executor terminated without tasks to send terminal updates for, or 
> the last terminated executor received its last acknowledgement. At this 
> point, we remove the framework thinking there were no pending tasks if the 
> task was killed (removed from pending).



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Updated] (MESOS-7783) Framework might not receive status update when a just launched task is killed immediately

2017-08-16 Thread Benjamin Mahler (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-7783?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Benjamin Mahler updated MESOS-7783:
---
Priority: Blocker  (was: Critical)

> Framework might not receive status update when a just launched task is killed 
> immediately
> -
>
> Key: MESOS-7783
> URL: https://issues.apache.org/jira/browse/MESOS-7783
> Project: Mesos
>  Issue Type: Bug
>  Components: agent
>Affects Versions: 1.2.0
>Reporter: Benjamin Bannier
>Assignee: Benjamin Mahler
>Priority: Blocker
>  Labels: reliability
> Attachments: GroupDeployIntegrationTest.log.zip, logs
>
>
> Our Marathon team are seeing issues in their integration test suite when 
> Marathon gets stuck in an infinite loop trying to kill a just launched task. 
> In their test a task launched which is immediately followed by killing the 
> task -- the framework does e.g., not wait for any task status update.
> In this case the launch and kill messages arrive at the agent in the correct 
> order, but both the launch and kill paths in the agent do not reach the point 
> where a status update is sent to the framework. Since the framework has seen 
> no status update on the task it re-triggers a kill, causing an infinite loop.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (MESOS-7744) Mesos Agent Sends TASK_KILL status update to Master, and still launches task

2017-08-16 Thread Benjamin Mahler (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-7744?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16129236#comment-16129236
 ] 

Benjamin Mahler commented on MESOS-7744:


[~karya] [~alexr] any reason to drop the targets instead of bumping them? I 
will re-target for 1.4.1 and 1.1.4 (or is there no 1.1.4 planned?)

> Mesos Agent Sends TASK_KILL status update to Master, and still launches task
> 
>
> Key: MESOS-7744
> URL: https://issues.apache.org/jira/browse/MESOS-7744
> Project: Mesos
>  Issue Type: Bug
>Affects Versions: 1.0.1
>Reporter: Sargun Dhillon
>Assignee: Benjamin Mahler
>Priority: Critical
>  Labels: reliability
>
> We sometimes launch jobs, and cancel them in ~7 seconds, if we don't get a 
> TASK_STARTING back from the agent. Under certain conditions it can result in 
> Mesos losing track of the task. The chunk of the logs which is interesting is 
> here:
> {code}
> Jun 29 23:22:26 titusagent-mainvpc-r3.8xlarge.2-i-04907efc9f1f8535c 
> mesos-slave[4290]: I0629 23:22:26.951799  5171 slave.cpp:1495] Got assigned 
> task Titus-7590548-worker-0-4476 for framework TitusFramework
> Jun 29 23:22:26 titusagent-mainvpc-r3.8xlarge.2-i-04907efc9f1f8535c 
> mesos-slave[4290]: I0629 23:22:26.952251  5171 slave.cpp:1614] Launching task 
> Titus-7590548-worker-0-4476 for framework TitusFramework
> Jun 29 23:22:37 titusagent-mainvpc-r3.8xlarge.2-i-04907efc9f1f8535c 
> mesos-slave[4290]: I0629 23:22:37.484611  5171 slave.cpp:1853] Queuing task 
> ‘Titus-7590548-worker-0-4476’ for executor ‘docker-executor’ of framework 
> TitusFramework at executor(1)@100.66.11.10:17707
> Jun 29 23:22:37 titusagent-mainvpc-r3.8xlarge.2-i-04907efc9f1f8535c 
> mesos-slave[4290]: I0629 23:22:37.487876  5171 slave.cpp:2035] Asked to kill 
> task Titus-7590548-worker-0-4476 of framework TitusFramework
> Jun 29 23:22:37 titusagent-mainvpc-r3.8xlarge.2-i-04907efc9f1f8535c 
> mesos-slave[4290]: I0629 23:22:37.488994  5171 slave.cpp:3211] Handling 
> status update TASK_KILLED (UUID: 898215d6-a244-4dbe-bc9c-878a22d36ea4) for 
> task Titus-7590548-worker-0-4476 of framework TitusFramework from @0.0.0.0:0
> Jun 29 23:22:37 titusagent-mainvpc-r3.8xlarge.2-i-04907efc9f1f8535c 
> mesos-slave[4290]: I0629 23:22:37.490603  5171 slave.cpp:2005] Sending queued 
> task ‘Titus-7590548-worker-0-4476’ to executor ‘docker-executor’ of framework 
> TitusFramework at executor(1)@100.66.11.10:17707{
> {code}
> In our executor, we see that the launch message arrives after the master has 
> already gotten the kill update. We then send non-terminal state updates to 
> the agent, and yet it doesn't forward these to our framework. We're using a 
> custom executor which is based on the older mesos-go bindings. 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (MESOS-7744) Mesos Agent Sends TASK_KILL status update to Master, and still launches task

2017-08-16 Thread Benjamin Mahler (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-7744?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16129261#comment-16129261
 ] 

Benjamin Mahler commented on MESOS-7744:


Ok, synced with [~alexr], looks like this will miss the 1.1.3 train and there 
is no 1.1.4 planned.

> Mesos Agent Sends TASK_KILL status update to Master, and still launches task
> 
>
> Key: MESOS-7744
> URL: https://issues.apache.org/jira/browse/MESOS-7744
> Project: Mesos
>  Issue Type: Bug
>Affects Versions: 1.0.1
>Reporter: Sargun Dhillon
>Assignee: Benjamin Mahler
>Priority: Critical
>  Labels: reliability
>
> We sometimes launch jobs, and cancel them in ~7 seconds, if we don't get a 
> TASK_STARTING back from the agent. Under certain conditions it can result in 
> Mesos losing track of the task. The chunk of the logs which is interesting is 
> here:
> {code}
> Jun 29 23:22:26 titusagent-mainvpc-r3.8xlarge.2-i-04907efc9f1f8535c 
> mesos-slave[4290]: I0629 23:22:26.951799  5171 slave.cpp:1495] Got assigned 
> task Titus-7590548-worker-0-4476 for framework TitusFramework
> Jun 29 23:22:26 titusagent-mainvpc-r3.8xlarge.2-i-04907efc9f1f8535c 
> mesos-slave[4290]: I0629 23:22:26.952251  5171 slave.cpp:1614] Launching task 
> Titus-7590548-worker-0-4476 for framework TitusFramework
> Jun 29 23:22:37 titusagent-mainvpc-r3.8xlarge.2-i-04907efc9f1f8535c 
> mesos-slave[4290]: I0629 23:22:37.484611  5171 slave.cpp:1853] Queuing task 
> ‘Titus-7590548-worker-0-4476’ for executor ‘docker-executor’ of framework 
> TitusFramework at executor(1)@100.66.11.10:17707
> Jun 29 23:22:37 titusagent-mainvpc-r3.8xlarge.2-i-04907efc9f1f8535c 
> mesos-slave[4290]: I0629 23:22:37.487876  5171 slave.cpp:2035] Asked to kill 
> task Titus-7590548-worker-0-4476 of framework TitusFramework
> Jun 29 23:22:37 titusagent-mainvpc-r3.8xlarge.2-i-04907efc9f1f8535c 
> mesos-slave[4290]: I0629 23:22:37.488994  5171 slave.cpp:3211] Handling 
> status update TASK_KILLED (UUID: 898215d6-a244-4dbe-bc9c-878a22d36ea4) for 
> task Titus-7590548-worker-0-4476 of framework TitusFramework from @0.0.0.0:0
> Jun 29 23:22:37 titusagent-mainvpc-r3.8xlarge.2-i-04907efc9f1f8535c 
> mesos-slave[4290]: I0629 23:22:37.490603  5171 slave.cpp:2005] Sending queued 
> task ‘Titus-7590548-worker-0-4476’ to executor ‘docker-executor’ of framework 
> TitusFramework at executor(1)@100.66.11.10:17707{
> {code}
> In our executor, we see that the launch message arrives after the master has 
> already gotten the kill update. We then send non-terminal state updates to 
> the agent, and yet it doesn't forward these to our framework. We're using a 
> custom executor which is based on the older mesos-go bindings. 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (MESOS-7513) Expose the container sandbox path to users via the API.

2017-08-16 Thread Benjamin Mahler (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-7513?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16129288#comment-16129288
 ] 

Benjamin Mahler commented on MESOS-7513:


[~zhitao] that's an option, after discussing with [~anandmazumdar] I updated 
the ticket to just be about simplifying the path and exposing it consistently 
in the master and agent APIs.

> Expose the container sandbox path to users via the API.
> ---
>
> Key: MESOS-7513
> URL: https://issues.apache.org/jira/browse/MESOS-7513
> Project: Mesos
>  Issue Type: Task
>Reporter: Anand Mazumdar
>  Labels: mesosphere
>
> Currently, only the agent exposes the executor sandbox via a {{directory}} 
> field in the executor JSON for the v0 API. The master's v0 API and all of the 
> v1 API do not expose the executor sandbox at all.
> As a result, users reverse engineer the logic for generating the path and use 
> it in their scripts. To add to the difficulty, the path currently includes 
> the agent's work directory which is only obtainable from the agent endpoints 
> (i.e. {{//frameworks//executors/}}) rather 
> than exposing a virtual path (i.e. {{/frameworks//executors/}}), 
> like we did for {{/slave/log}} and {{/master/log}}.
> We should expose the executor sandbox directory to users consistently in both 
> the master and agent v0/v1 APIs, as well as simplify the path format so that 
> users don't know about the agent's work directory.
> This also needs to work for nested containers.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Updated] (MESOS-7513) Expose the container sandbox path to users via the API.

2017-08-16 Thread Benjamin Mahler (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-7513?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Benjamin Mahler updated MESOS-7513:
---
Description: 
Currently, only the agent exposes the executor sandbox via a {{directory}} 
field in the executor JSON for the v0 API. The master's v0 API and all of the 
v1 API do not expose the executor sandbox at all.

As a result, users reverse engineer the logic for generating the path and use 
it in their scripts. To add to the difficulty, the path currently includes the 
agent's work directory which is only obtainable from the agent endpoints (i.e. 
{{//frameworks//executors/}}) rather than 
exposing a virtual path (i.e. {{/frameworks//executors/}}), like we 
did for {{/slave/log}} and {{/master/log}}.

We should expose the executor sandbox directory to users consistently in both 
the master and agent v0/v1 APIs, as well as simplify the path format so that 
users don't know about the agent's work directory.

This also needs to work for nested containers.

  was:
Currently, there is no public API for getting the path to the sandbox of a 
running container. This leads to folks reverse engineering the Mesos logic for 
generating the path and then using it in their scripts. This is already done by 
the Mesos Web UI and the DC/OS CLI. This is prone to errors if the Mesos path 
logic changes in the upcoming versions.

We should introduce a new calls on the v1 Agent API; 
{{GET_CONTAINER_SANDBOX_PATH}}/{{GET_EXECUTOR_SANDBOX_PATH}} to get the path to 
a running container (can be nested) and another call to get the path to the 
executor sandbox.

Summary: Expose the container sandbox path to users via the API.  (was: 
Consider introducing an API call to get the sandbox of a running container.)

> Expose the container sandbox path to users via the API.
> ---
>
> Key: MESOS-7513
> URL: https://issues.apache.org/jira/browse/MESOS-7513
> Project: Mesos
>  Issue Type: Task
>Reporter: Anand Mazumdar
>  Labels: mesosphere
>
> Currently, only the agent exposes the executor sandbox via a {{directory}} 
> field in the executor JSON for the v0 API. The master's v0 API and all of the 
> v1 API do not expose the executor sandbox at all.
> As a result, users reverse engineer the logic for generating the path and use 
> it in their scripts. To add to the difficulty, the path currently includes 
> the agent's work directory which is only obtainable from the agent endpoints 
> (i.e. {{//frameworks//executors/}}) rather 
> than exposing a virtual path (i.e. {{/frameworks//executors/}}), 
> like we did for {{/slave/log}} and {{/master/log}}.
> We should expose the executor sandbox directory to users consistently in both 
> the master and agent v0/v1 APIs, as well as simplify the path format so that 
> users don't know about the agent's work directory.
> This also needs to work for nested containers.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Updated] (MESOS-7865) Agent may process a kill task and still launch the task.

2017-08-16 Thread Benjamin Mahler (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-7865?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Benjamin Mahler updated MESOS-7865:
---
Priority: Critical  (was: Major)

> Agent may process a kill task and still launch the task.
> 
>
> Key: MESOS-7865
> URL: https://issues.apache.org/jira/browse/MESOS-7865
> Project: Mesos
>  Issue Type: Bug
>  Components: agent
>Reporter: Benjamin Mahler
>Assignee: Benjamin Mahler
>Priority: Critical
>
> Based on the investigation of MESOS-7744, the agent has a race in which 
> "queued" tasks can still be launched after the agent has processed a kill 
> task for them. This race was introduced when {{Slave::statusUpdate}} was made 
> asynchronous:
> (1) {{Slave::__run}} completes, task is now within {{Executor::queuedTasks}}
> (2) {{Slave::killTask}} locates the executor based on the task ID residing in 
> queuedTasks, calls {{Slave::statusUpdate()}} with {{TASK_KILLED}}
> (3) {{Slave::___run}} assumes that killed tasks have been removed from 
> {{Executor::queuedTasks}}, but this now occurs asynchronously in 
> {{Slave::_statusUpdate}}. So, the executor still sees the queued task and 
> delivers it and adds the task to {{Executor::launchedTasks}}.
> (3) {{Slave::_statusUpdate}} runs, removes the task from 
> {{Executor::launchedTasks}} and adds it to {{Executor::terminatedTasks}}.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Updated] (MESOS-7861) Include check output in the DefaultExecutor log

2017-08-16 Thread Greg Mann (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-7861?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Greg Mann updated MESOS-7861:
-
Shepherd: Greg Mann

> Include check output in the DefaultExecutor log
> ---
>
> Key: MESOS-7861
> URL: https://issues.apache.org/jira/browse/MESOS-7861
> Project: Mesos
>  Issue Type: Bug
>  Components: executor
>Affects Versions: 1.3.0
>Reporter: Michael Browning
>Assignee: Gastón Kleiman
>Priority: Minor
>  Labels: check, default-executor, health-check, mesosphere
>
> With the default executor, health and readiness checks are run in their own 
> nested containers, whose sandboxes are cleaned up right before performing the 
> next check. This makes access to stdout/stderr of previous runs of the check 
> command effectively impossible.
> Although the exit code of the command being run is reported in a task status, 
> it is often necessary to see the command's actual output when debugging a 
> framework issue, so the ability to access this output via the executor logs 
> would be helpful.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Updated] (MESOS-7315) Design doc for resource provider and storage integration.

2017-08-16 Thread Adam B (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-7315?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Adam B updated MESOS-7315:
--
Labels: storage  (was: )

> Design doc for resource provider and storage integration.
> -
>
> Key: MESOS-7315
> URL: https://issues.apache.org/jira/browse/MESOS-7315
> Project: Mesos
>  Issue Type: Task
>Reporter: Jie Yu
>Assignee: Jie Yu
>  Labels: storage
> Fix For: 1.4.0
>
>
> https://docs.google.com/document/d/125YWqg_5BB5OY9a6M7LZcby5RSqBwo2PZzpVLuxYXh4/edit?usp=sharing



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Updated] (MESOS-7336) Add resource provider API protobuf.

2017-08-16 Thread Adam B (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-7336?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Adam B updated MESOS-7336:
--
Labels: storage  (was: )

> Add resource provider API protobuf.
> ---
>
> Key: MESOS-7336
> URL: https://issues.apache.org/jira/browse/MESOS-7336
> Project: Mesos
>  Issue Type: Task
>Reporter: Jie Yu
>Assignee: Jie Yu
>  Labels: storage
> Fix For: 1.3.0
>
>
> Resource provider API will be Event/Call based, similar to the scheduler or 
> executor API. Resource providers will use this API to interact with the 
> master, sending Calls to the master and receiving Event from the master.
> The same API will be used for both local resource providers and external 
> resource providers.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Assigned] (MESOS-7861) Include check output in the DefaultExecutor log

2017-08-16 Thread JIRA

 [ 
https://issues.apache.org/jira/browse/MESOS-7861?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gastón Kleiman reassigned MESOS-7861:
-

Assignee: Gastón Kleiman

> Include check output in the DefaultExecutor log
> ---
>
> Key: MESOS-7861
> URL: https://issues.apache.org/jira/browse/MESOS-7861
> Project: Mesos
>  Issue Type: Bug
>  Components: executor
>Affects Versions: 1.3.0
>Reporter: Michael Browning
>Assignee: Gastón Kleiman
>Priority: Minor
>  Labels: check, default-executor, health-check, mesosphere
>
> With the default executor, health and readiness checks are run in their own 
> nested containers, whose sandboxes are cleaned up right before performing the 
> next check. This makes access to stdout/stderr of previous runs of the check 
> command effectively impossible.
> Although the exit code of the command being run is reported in a task status, 
> it is often necessary to see the command's actual output when debugging a 
> framework issue, so the ability to access this output via the executor logs 
> would be helpful.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Updated] (MESOS-7304) Fetcher should not depend on SlaveID.

2017-08-16 Thread Adam B (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-7304?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Adam B updated MESOS-7304:
--
Labels: mesosphere storage  (was: mesosphere)

> Fetcher should not depend on SlaveID.
> -
>
> Key: MESOS-7304
> URL: https://issues.apache.org/jira/browse/MESOS-7304
> Project: Mesos
>  Issue Type: Task
>  Components: containerization, fetcher
>Reporter: Jie Yu
>Assignee: Joseph Wu
>  Labels: mesosphere, storage
> Fix For: 1.4.0
>
>
> Currently, various Fetcher interfaces depends on SlaveID, which is an 
> unnecessary coupling. For instance:
> {code}
> Try Fetcher::recover(const SlaveID& slaveId, const Flags& flags);
> Future Fetcher::fetch(
> const ContainerID& containerId,
> const CommandInfo& commandInfo,
> const string& sandboxDirectory,
> const Option& user,
> const SlaveID& slaveId,
> const Flags& flags);
> {code}
> Looks like the only reason we need a SlaveID is because we need to calculate 
> the fetcher cache directory based on that. We should calculate the fetcher 
> cache directory in the caller and pass that directory to Fetcher.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Updated] (MESOS-7560) Add 'type' and 'name' to ResourceProviderInfo.

2017-08-16 Thread Adam B (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-7560?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Adam B updated MESOS-7560:
--
Labels: storage  (was: )

> Add 'type' and 'name' to ResourceProviderInfo.
> --
>
> Key: MESOS-7560
> URL: https://issues.apache.org/jira/browse/MESOS-7560
> Project: Mesos
>  Issue Type: Task
>Reporter: Jie Yu
>Assignee: Jie Yu
>  Labels: storage
> Fix For: 1.4.0
>
>
> The 'type' field will be used to load the corresponding implementation 
> (either internal or via module). To avoid conflict, the naming should follow 
> java packing naming scheme (e.g., 
> org.apache.mesos.resource_provider.local.storage).
> Since there could be multiple instances of the same resource provider type, 
> it's important to also add a 'name' field to distinguish between instances of 
> the same type.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Updated] (MESOS-7593) Update offer handling in the master to consider local resource providers

2017-08-16 Thread Adam B (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-7593?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Adam B updated MESOS-7593:
--
Labels: mesosphere storage  (was: mesosphere)

> Update offer handling in the master to consider local resource providers
> 
>
> Key: MESOS-7593
> URL: https://issues.apache.org/jira/browse/MESOS-7593
> Project: Mesos
>  Issue Type: Task
>  Components: master
>Reporter: Jan Schlicht
>Assignee: Jan Schlicht
>  Labels: mesosphere, storage
> Fix For: 1.4.0
>
>
> Adding local resource providers to the allocator will result in offers 
> created for them. This needs to be reflected in the master, i.e. the 
> resources need to be added to the offer that will be send to frameworks.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Updated] (MESOS-7347) Prototype resource offer operation handling in the master

2017-08-16 Thread Adam B (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-7347?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Adam B updated MESOS-7347:
--
Labels: mesosphere storage  (was: mesosphere)

> Prototype resource offer operation handling in the master
> -
>
> Key: MESOS-7347
> URL: https://issues.apache.org/jira/browse/MESOS-7347
> Project: Mesos
>  Issue Type: Task
>  Components: master
>Reporter: Jan Schlicht
>Assignee: Jan Schlicht
>  Labels: mesosphere, storage
>
> Prototype the following workflow in the master, in accordance with the 
> resource provider design;
> * Handle accept calls including resource provider related offer operations 
> ({{CREATE_VOLUME}}, ...)
> * Implement internal bookkeeping of the disk resources these operations will 
> be applied on
> * Implement resource bookkeeping for resource providers in the master
> * Send resource provider operations to resource providers



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Updated] (MESOS-7449) Refactor containerizers to not depend on TaskInfo or ExecutorInfo

2017-08-16 Thread Adam B (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-7449?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Adam B updated MESOS-7449:
--
Labels: mesosphere storage  (was: mesosphere)

> Refactor containerizers to not depend on TaskInfo or ExecutorInfo
> -
>
> Key: MESOS-7449
> URL: https://issues.apache.org/jira/browse/MESOS-7449
> Project: Mesos
>  Issue Type: Task
>  Components: containerization
>Reporter: Joseph Wu
>Assignee: Joseph Wu
>  Labels: mesosphere, storage
> Fix For: 1.4.0
>
>
> The Containerizer interfaces should be refactored so that they do not depend 
> on {{TaskInfo}} or {{ExecutorInfo}}, as a standalone container will have 
> neither.
> Currently, the {{launch}} interface depends on those fields.  Instead, we 
> should consistently use {{ContainerInfo}} and {{CommandInfo}} in 
> Containerizer and isolators.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Updated] (MESOS-7591) Update master to use resource provider IDs instead of agent ID in allocator calls.

2017-08-16 Thread Adam B (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-7591?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Adam B updated MESOS-7591:
--
Labels: mesosphere storage  (was: mesosphere)

> Update master to use resource provider IDs instead of agent ID in allocator 
> calls.
> --
>
> Key: MESOS-7591
> URL: https://issues.apache.org/jira/browse/MESOS-7591
> Project: Mesos
>  Issue Type: Task
>  Components: master
>Reporter: Jan Schlicht
>Assignee: Jan Schlicht
>  Labels: mesosphere, storage
> Fix For: 1.4.0
>
>
> MESOS-7388 updates the allocator interface to use {{ResourceProviderID}} 
> instead of {{SlaveID}}. The usage of the allocator in the master has to be 
> updated accordingly.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Updated] (MESOS-7571) Add `--resource_provider_config_dir` flag to the agent.

2017-08-16 Thread Adam B (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-7571?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Adam B updated MESOS-7571:
--
Labels: storage  (was: )

> Add `--resource_provider_config_dir` flag to the agent.
> ---
>
> Key: MESOS-7571
> URL: https://issues.apache.org/jira/browse/MESOS-7571
> Project: Mesos
>  Issue Type: Task
>Reporter: Jie Yu
>Assignee: Jie Yu
>  Labels: storage
> Fix For: 1.4.0
>
>
> Add an agent flag `--resource_provider_config_dir` to allow operators to 
> specify the list of local resource providers to register with the agent.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (MESOS-7007) filesystem/shared and --default_container_info broken since 1.1

2017-08-16 Thread R.B. Boyer (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-7007?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16129496#comment-16129496
 ] 

R.B. Boyer commented on MESOS-7007:
---

This bug still affects 1.2.2.  A variation of one of the above patches works on 
this lineage I think:

{code}
diff --git a/mesos-1.2.2/src/slave/containerizer/mesos/containerizer.cpp 
b/mesos-1.2.2/src/slave/containerizer/mesos/containerizer.cpp
index 777ffc7..2057166 100644
--- a/mesos-1.2.2/src/slave/containerizer/mesos/containerizer.cpp
+++ b/mesos-1.2.2/src/slave/containerizer/mesos/containerizer.cpp
@@ -1037,6 +1037,7 @@ 
src/slave/containerizer/mesos/isolators/network/cni/cni.cpp: executo
   containerConfig.mutable_executor_info()->CopyFrom(executorInfo);
   containerConfig.mutable_command_info()->CopyFrom(executorInfo.command());
   containerConfig.mutable_resources()->CopyFrom(executorInfo.resources());
+  containerConfig.mutable_container_info()->CopyFrom(executorInfo.container());
   containerConfig.set_directory(directory);
 
   if (user.isSome()) {
{code}


> filesystem/shared and --default_container_info broken since 1.1
> ---
>
> Key: MESOS-7007
> URL: https://issues.apache.org/jira/browse/MESOS-7007
> Project: Mesos
>  Issue Type: Bug
>  Components: agent
>Affects Versions: 1.1.0, 1.2.0
>Reporter: Pierre Cheynier
>Assignee: Chun-Hung Hsiao
>  Labels: storage
>
> I face this issue, that prevent me to upgrade to 1.1.0 (and the change was 
> consequently introduced in this version):
> I'm using default_container_info to mount a /tmp volume in the container's 
> mount namespace from its current sandbox, meaning that each container have a 
> dedicated /tmp, thanks to the {{filesystem/shared}} isolator.
> I noticed through our automation pipeline that integration tests were failing 
> and found that this is because /tmp (the one from the host!) contents is 
> trashed each time a container is created.
> Here is my setup: 
> * 
> {{--isolation='cgroups/cpu,cgroups/mem,namespaces/pid,*disk/du,filesystem/shared,filesystem/linux*,docker/runtime'}}
> * 
> {{--default_container_info='\{"type":"MESOS","volumes":\[\{"host_path":"tmp","container_path":"/tmp","mode":"RW"\}\]\}'}}
> I discovered this issue in the early days of 1.1 (end of Nov, spoke with 
> someone on Slack), but had unfortunately no time to dig into the symptoms a 
> bit more.
> I found nothing interesting even using GLOGv=3.
> Maybe it's a bad usage of isolators that trigger this issue ? If it's the 
> case, then at least a documentation update should be done.
> Let me know if more information is needed.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Comment Edited] (MESOS-7007) filesystem/shared and --default_container_info broken since 1.1

2017-08-16 Thread R.B. Boyer (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-7007?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16129496#comment-16129496
 ] 

R.B. Boyer edited comment on MESOS-7007 at 8/16/17 10:11 PM:
-

This bug still affects 1.2.2.  A variation of one of the above patches also 
works on 1.2.x:

{code}
diff --git a/mesos-1.2.2/src/slave/containerizer/mesos/containerizer.cpp 
b/mesos-1.2.2/src/slave/containerizer/mesos/containerizer.cpp
index 777ffc7..2057166 100644
--- a/mesos-1.2.2/src/slave/containerizer/mesos/containerizer.cpp
+++ b/mesos-1.2.2/src/slave/containerizer/mesos/containerizer.cpp
@@ -1037,6 +1037,7 @@ 
src/slave/containerizer/mesos/isolators/network/cni/cni.cpp: executo
   containerConfig.mutable_executor_info()->CopyFrom(executorInfo);
   containerConfig.mutable_command_info()->CopyFrom(executorInfo.command());
   containerConfig.mutable_resources()->CopyFrom(executorInfo.resources());
+  containerConfig.mutable_container_info()->CopyFrom(executorInfo.container());
   containerConfig.set_directory(directory);
 
   if (user.isSome()) {
{code}

I don't think I know enough about this particular file to understand why you'd 
copy something from the executorinfo blindly (ala my patch) or conditionally 
ala patchest 58980.


was (Author: naelyn):
This bug still affects 1.2.2.  A variation of one of the above patches works on 
this lineage I think:

{code}
diff --git a/mesos-1.2.2/src/slave/containerizer/mesos/containerizer.cpp 
b/mesos-1.2.2/src/slave/containerizer/mesos/containerizer.cpp
index 777ffc7..2057166 100644
--- a/mesos-1.2.2/src/slave/containerizer/mesos/containerizer.cpp
+++ b/mesos-1.2.2/src/slave/containerizer/mesos/containerizer.cpp
@@ -1037,6 +1037,7 @@ 
src/slave/containerizer/mesos/isolators/network/cni/cni.cpp: executo
   containerConfig.mutable_executor_info()->CopyFrom(executorInfo);
   containerConfig.mutable_command_info()->CopyFrom(executorInfo.command());
   containerConfig.mutable_resources()->CopyFrom(executorInfo.resources());
+  containerConfig.mutable_container_info()->CopyFrom(executorInfo.container());
   containerConfig.set_directory(directory);
 
   if (user.isSome()) {
{code}


> filesystem/shared and --default_container_info broken since 1.1
> ---
>
> Key: MESOS-7007
> URL: https://issues.apache.org/jira/browse/MESOS-7007
> Project: Mesos
>  Issue Type: Bug
>  Components: agent
>Affects Versions: 1.1.0, 1.2.0
>Reporter: Pierre Cheynier
>Assignee: Chun-Hung Hsiao
>  Labels: storage
>
> I face this issue, that prevent me to upgrade to 1.1.0 (and the change was 
> consequently introduced in this version):
> I'm using default_container_info to mount a /tmp volume in the container's 
> mount namespace from its current sandbox, meaning that each container have a 
> dedicated /tmp, thanks to the {{filesystem/shared}} isolator.
> I noticed through our automation pipeline that integration tests were failing 
> and found that this is because /tmp (the one from the host!) contents is 
> trashed each time a container is created.
> Here is my setup: 
> * 
> {{--isolation='cgroups/cpu,cgroups/mem,namespaces/pid,*disk/du,filesystem/shared,filesystem/linux*,docker/runtime'}}
> * 
> {{--default_container_info='\{"type":"MESOS","volumes":\[\{"host_path":"tmp","container_path":"/tmp","mode":"RW"\}\]\}'}}
> I discovered this issue in the early days of 1.1 (end of Nov, spoke with 
> someone on Slack), but had unfortunately no time to dig into the symptoms a 
> bit more.
> I found nothing interesting even using GLOGv=3.
> Maybe it's a bad usage of isolators that trigger this issue ? If it's the 
> case, then at least a documentation update should be done.
> Let me know if more information is needed.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)