[jira] [Created] (MESOS-7894) Mesos Executor UI - Disk:Used Field isn't populated with Docker Container Runtime
Justin Lee created MESOS-7894: - Summary: Mesos Executor UI - Disk:Used Field isn't populated with Docker Container Runtime Key: MESOS-7894 URL: https://issues.apache.org/jira/browse/MESOS-7894 Project: Mesos Issue Type: Bug Affects Versions: 1.2.2 Environment: DC/OS 1.9.2 (CentOS 7.3, Docker 1.13.1, Mesos 1.2.2, Marathon 1.4.5) Reporter: Justin Lee Priority: Minor If you use the Docker container runtime, the 'Disk' 'Used' field never gets populated in the Mesos UI (on the executor/task page). Steps to Reproduce: in DC/OS 1.9.2, deploy two apps: {code:javascript} { "id": "/dummy-disk-docker", "cmd": "dd if=/dev/zero of=$MESOS_SANDBOX/testfile bs=128M count=1; tail -f /dev/null", "instances": 1, "cpus": 0.1, "mem": 256, "disk": 150, "container": { "type": "DOCKER", "docker": { "image": "alpine" } } } {code} {code:javascript} { "id": "/dummy-disk-ucr", "cmd": "dd if=/dev/zero of=$MESOS_SANDBOX/testfile bs=128M count=1; tail -f /dev/null", "instances": 1, "cpus": 0.1, "mem": 256, "disk": 150, "container": { "type": "MESOS" "docker": { "image": "alpine" } } } {code} Wait for them the deploy. Then, navigate to the mesos UI, and go to the executor/task page for the two tasks. On the UCR task, eventually the "Used Disk" field should populate with 128 MB (the size of the dummy file). The same field on the Docker task will never get populated. Both containers are writing to the same location on the agent filesystem (/var/lib/mesos/slave/slaves//frameworks//executors//runs/latest, but only one reports the data through the UI. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (MESOS-7744) Mesos Agent Sends TASK_KILL status update to Master, and still launches task
[ https://issues.apache.org/jira/browse/MESOS-7744?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16128608#comment-16128608 ] Alexander Rukletsov commented on MESOS-7744: And dropping 1.1.3 target. > Mesos Agent Sends TASK_KILL status update to Master, and still launches task > > > Key: MESOS-7744 > URL: https://issues.apache.org/jira/browse/MESOS-7744 > Project: Mesos > Issue Type: Bug >Affects Versions: 1.0.1 >Reporter: Sargun Dhillon >Assignee: Benjamin Mahler >Priority: Critical > Labels: reliability > > We sometimes launch jobs, and cancel them in ~7 seconds, if we don't get a > TASK_STARTING back from the agent. Under certain conditions it can result in > Mesos losing track of the task. The chunk of the logs which is interesting is > here: > {code} > Jun 29 23:22:26 titusagent-mainvpc-r3.8xlarge.2-i-04907efc9f1f8535c > mesos-slave[4290]: I0629 23:22:26.951799 5171 slave.cpp:1495] Got assigned > task Titus-7590548-worker-0-4476 for framework TitusFramework > Jun 29 23:22:26 titusagent-mainvpc-r3.8xlarge.2-i-04907efc9f1f8535c > mesos-slave[4290]: I0629 23:22:26.952251 5171 slave.cpp:1614] Launching task > Titus-7590548-worker-0-4476 for framework TitusFramework > Jun 29 23:22:37 titusagent-mainvpc-r3.8xlarge.2-i-04907efc9f1f8535c > mesos-slave[4290]: I0629 23:22:37.484611 5171 slave.cpp:1853] Queuing task > ‘Titus-7590548-worker-0-4476’ for executor ‘docker-executor’ of framework > TitusFramework at executor(1)@100.66.11.10:17707 > Jun 29 23:22:37 titusagent-mainvpc-r3.8xlarge.2-i-04907efc9f1f8535c > mesos-slave[4290]: I0629 23:22:37.487876 5171 slave.cpp:2035] Asked to kill > task Titus-7590548-worker-0-4476 of framework TitusFramework > Jun 29 23:22:37 titusagent-mainvpc-r3.8xlarge.2-i-04907efc9f1f8535c > mesos-slave[4290]: I0629 23:22:37.488994 5171 slave.cpp:3211] Handling > status update TASK_KILLED (UUID: 898215d6-a244-4dbe-bc9c-878a22d36ea4) for > task Titus-7590548-worker-0-4476 of framework TitusFramework from @0.0.0.0:0 > Jun 29 23:22:37 titusagent-mainvpc-r3.8xlarge.2-i-04907efc9f1f8535c > mesos-slave[4290]: I0629 23:22:37.490603 5171 slave.cpp:2005] Sending queued > task ‘Titus-7590548-worker-0-4476’ to executor ‘docker-executor’ of framework > TitusFramework at executor(1)@100.66.11.10:17707{ > {code} > In our executor, we see that the launch message arrives after the master has > already gotten the kill update. We then send non-terminal state updates to > the agent, and yet it doesn't forward these to our framework. We're using a > custom executor which is based on the older mesos-go bindings. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Updated] (MESOS-7744) Mesos Agent Sends TASK_KILL status update to Master, and still launches task
[ https://issues.apache.org/jira/browse/MESOS-7744?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Alexander Rukletsov updated MESOS-7744: --- Target Version/s: 1.2.3, 1.3.2, 1.5.0, 1.4.1 (was: 1.1.3, 1.2.3, 1.3.2, 1.5.0) > Mesos Agent Sends TASK_KILL status update to Master, and still launches task > > > Key: MESOS-7744 > URL: https://issues.apache.org/jira/browse/MESOS-7744 > Project: Mesos > Issue Type: Bug >Affects Versions: 1.0.1 >Reporter: Sargun Dhillon >Assignee: Benjamin Mahler >Priority: Critical > Labels: reliability > > We sometimes launch jobs, and cancel them in ~7 seconds, if we don't get a > TASK_STARTING back from the agent. Under certain conditions it can result in > Mesos losing track of the task. The chunk of the logs which is interesting is > here: > {code} > Jun 29 23:22:26 titusagent-mainvpc-r3.8xlarge.2-i-04907efc9f1f8535c > mesos-slave[4290]: I0629 23:22:26.951799 5171 slave.cpp:1495] Got assigned > task Titus-7590548-worker-0-4476 for framework TitusFramework > Jun 29 23:22:26 titusagent-mainvpc-r3.8xlarge.2-i-04907efc9f1f8535c > mesos-slave[4290]: I0629 23:22:26.952251 5171 slave.cpp:1614] Launching task > Titus-7590548-worker-0-4476 for framework TitusFramework > Jun 29 23:22:37 titusagent-mainvpc-r3.8xlarge.2-i-04907efc9f1f8535c > mesos-slave[4290]: I0629 23:22:37.484611 5171 slave.cpp:1853] Queuing task > ‘Titus-7590548-worker-0-4476’ for executor ‘docker-executor’ of framework > TitusFramework at executor(1)@100.66.11.10:17707 > Jun 29 23:22:37 titusagent-mainvpc-r3.8xlarge.2-i-04907efc9f1f8535c > mesos-slave[4290]: I0629 23:22:37.487876 5171 slave.cpp:2035] Asked to kill > task Titus-7590548-worker-0-4476 of framework TitusFramework > Jun 29 23:22:37 titusagent-mainvpc-r3.8xlarge.2-i-04907efc9f1f8535c > mesos-slave[4290]: I0629 23:22:37.488994 5171 slave.cpp:3211] Handling > status update TASK_KILLED (UUID: 898215d6-a244-4dbe-bc9c-878a22d36ea4) for > task Titus-7590548-worker-0-4476 of framework TitusFramework from @0.0.0.0:0 > Jun 29 23:22:37 titusagent-mainvpc-r3.8xlarge.2-i-04907efc9f1f8535c > mesos-slave[4290]: I0629 23:22:37.490603 5171 slave.cpp:2005] Sending queued > task ‘Titus-7590548-worker-0-4476’ to executor ‘docker-executor’ of framework > TitusFramework at executor(1)@100.66.11.10:17707{ > {code} > In our executor, we see that the launch message arrives after the master has > already gotten the kill update. We then send non-terminal state updates to > the agent, and yet it doesn't forward these to our framework. We're using a > custom executor which is based on the older mesos-go bindings. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Updated] (MESOS-7894) Mesos Executor UI - Disk:Used Field isn't populated with Docker Container Runtime
[ https://issues.apache.org/jira/browse/MESOS-7894?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Justin Lee updated MESOS-7894: -- Attachment: UCR on left, Docker on right.png > Mesos Executor UI - Disk:Used Field isn't populated with Docker Container > Runtime > - > > Key: MESOS-7894 > URL: https://issues.apache.org/jira/browse/MESOS-7894 > Project: Mesos > Issue Type: Bug >Affects Versions: 1.2.2 > Environment: DC/OS 1.9.2 (CentOS 7.3, Docker 1.13.1, Mesos 1.2.2, > Marathon 1.4.5) >Reporter: Justin Lee >Priority: Minor > Attachments: UCR on left, Docker on right.png > > > If you use the Docker container runtime, the 'Disk' 'Used' field never gets > populated in the Mesos UI (on the executor/task page). > Steps to Reproduce: > in DC/OS 1.9.2, deploy two apps: > {code:javascript} > { > "id": "/dummy-disk-docker", > "cmd": "dd if=/dev/zero of=$MESOS_SANDBOX/testfile bs=128M count=1; tail -f > /dev/null", > "instances": 1, > "cpus": 0.1, > "mem": 256, > "disk": 150, > "container": { > "type": "DOCKER", > "docker": { > "image": "alpine" > } > } > } > {code} > {code:javascript} > { > "id": "/dummy-disk-ucr", > "cmd": "dd if=/dev/zero of=$MESOS_SANDBOX/testfile bs=128M count=1; tail -f > /dev/null", > "instances": 1, > "cpus": 0.1, > "mem": 256, > "disk": 150, > "container": { > "type": "MESOS" > "docker": { > "image": "alpine" > } > } > } > {code} > Wait for them the deploy. > Then, navigate to the mesos UI, and go to the executor/task page for the two > tasks. > On the UCR task, eventually the "Used Disk" field should populate with 128 MB > (the size of the dummy file). > The same field on the Docker task will never get populated. > Both containers are writing to the same location on the agent filesystem > (/var/lib/mesos/slave/slaves//frameworks//executors//runs/latest, > but only one reports the data through the UI. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Created] (MESOS-7896) Use std::error_code for reporting platform-dependent errors
Ilya Pronin created MESOS-7896: -- Summary: Use std::error_code for reporting platform-dependent errors Key: MESOS-7896 URL: https://issues.apache.org/jira/browse/MESOS-7896 Project: Mesos Issue Type: Improvement Reporter: Ilya Pronin Priority: Minor It may be useful to return an error code from various functions to be able to distinguish different kinds of errors. E.g. for being able to ignore {{ENOENT}} from {{unlink()}}. This can be achieved by returning {{Try}}, but this is not portable. Since C++11 STL has {{std::error_code}} that hides platform-dependent error code behind a portable error condition. We can use it for error reporting. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Assigned] (MESOS-7791) subprocess' childMain using ABORT when encountering user errors
[ https://issues.apache.org/jira/browse/MESOS-7791?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrei Budnik reassigned MESOS-7791: Assignee: Andrei Budnik > subprocess' childMain using ABORT when encountering user errors > --- > > Key: MESOS-7791 > URL: https://issues.apache.org/jira/browse/MESOS-7791 > Project: Mesos > Issue Type: Bug > Components: libprocess >Affects Versions: 1.4.0 >Reporter: Benjamin Bannier >Assignee: Andrei Budnik > Labels: mesosphere, tech-debt > > In {{process/posix/subprocess.hpp}}'s {{childMain}} we exit with {{ABORT}} > when there was a user error, > {noformat} > ABORT: > (/pkg/src/mesos/3rdparty/libprocess/include/process/posix/subprocess.hpp:195): > Failed to os::execvpe on path '/SOME/PATH': Argument list too long > {noformat} > We here abort instead of simply {{_exit}}'ing and letting the user know that > we couldn't deal with the given arguments. > Abort can potentially dump core, and since this abort is before the > {{execvpe}}, the process image can potentially be large (e.g., >300 MB) which > could quickly fill up a lot of disk space. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (MESOS-7895) ZK session timeout is unconfigurable in agent and scheduler drivers
[ https://issues.apache.org/jira/browse/MESOS-7895?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16128827#comment-16128827 ] Ilya Pronin commented on MESOS-7895: [~vinodkone] can you shepherd this please? Review requests for agents: https://reviews.apache.org/r/61689/ https://reviews.apache.org/r/61690/ > ZK session timeout is unconfigurable in agent and scheduler drivers > --- > > Key: MESOS-7895 > URL: https://issues.apache.org/jira/browse/MESOS-7895 > Project: Mesos > Issue Type: Improvement >Affects Versions: 1.3.0 >Reporter: Ilya Pronin >Assignee: Ilya Pronin >Priority: Minor > > {{ZooKeeperMasterDetector}} in agents and scheduler drivers uses the default > ZK session timeout (10 secs). This timeout may have to be increased to cope > with long ZK upgrades or ZK GC pauses (with local ZK sessions these can cause > lots of {{TASK_LOST}}, because sessions expire on disconnection after > {{session_timeout * 2 / 3}}). -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Updated] (MESOS-7894) Mesos Executor UI - Disk:Used Field isn't populated with Docker Container Runtime
[ https://issues.apache.org/jira/browse/MESOS-7894?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Justin Lee updated MESOS-7894: -- Attachment: UCR on left, Docker on right..png > Mesos Executor UI - Disk:Used Field isn't populated with Docker Container > Runtime > - > > Key: MESOS-7894 > URL: https://issues.apache.org/jira/browse/MESOS-7894 > Project: Mesos > Issue Type: Bug >Affects Versions: 1.2.2 > Environment: DC/OS 1.9.2 (CentOS 7.3, Docker 1.13.1, Mesos 1.2.2, > Marathon 1.4.5) >Reporter: Justin Lee >Priority: Minor > Attachments: UCR on left, Docker on right.png > > > If you use the Docker container runtime, the 'Disk' 'Used' field never gets > populated in the Mesos UI (on the executor/task page). > Steps to Reproduce: > in DC/OS 1.9.2, deploy two apps: > {code:javascript} > { > "id": "/dummy-disk-docker", > "cmd": "dd if=/dev/zero of=$MESOS_SANDBOX/testfile bs=128M count=1; tail -f > /dev/null", > "instances": 1, > "cpus": 0.1, > "mem": 256, > "disk": 150, > "container": { > "type": "DOCKER", > "docker": { > "image": "alpine" > } > } > } > {code} > {code:javascript} > { > "id": "/dummy-disk-ucr", > "cmd": "dd if=/dev/zero of=$MESOS_SANDBOX/testfile bs=128M count=1; tail -f > /dev/null", > "instances": 1, > "cpus": 0.1, > "mem": 256, > "disk": 150, > "container": { > "type": "MESOS" > "docker": { > "image": "alpine" > } > } > } > {code} > Wait for them the deploy. > Then, navigate to the mesos UI, and go to the executor/task page for the two > tasks. > On the UCR task, eventually the "Used Disk" field should populate with 128 MB > (the size of the dummy file). > The same field on the Docker task will never get populated. > Both containers are writing to the same location on the agent filesystem > (/var/lib/mesos/slave/slaves//frameworks//executors//runs/latest, > but only one reports the data through the UI. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Updated] (MESOS-7894) Mesos Executor UI - Disk:Used Field isn't populated with Docker Container Runtime
[ https://issues.apache.org/jira/browse/MESOS-7894?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Justin Lee updated MESOS-7894: -- Description: If you use the Docker container runtime, the 'Disk' 'Used' field never gets populated in the Mesos UI (on the executor/task page). Steps to Reproduce: in DC/OS 1.9.2, deploy two apps: {code:javascript} { "id": "/dummy-disk-docker", "cmd": "dd if=/dev/zero of=$MESOS_SANDBOX/testfile bs=128M count=1; tail -f /dev/null", "instances": 1, "cpus": 0.1, "mem": 256, "disk": 150, "container": { "type": "DOCKER", "docker": { "image": "alpine" } } } {code} {code:javascript} { "id": "/dummy-disk-ucr", "cmd": "dd if=/dev/zero of=$MESOS_SANDBOX/testfile bs=128M count=1; tail -f /dev/null", "instances": 1, "cpus": 0.1, "mem": 256, "disk": 150, "container": { "type": "MESOS" "docker": { "image": "alpine" } } } {code} Wait for them the deploy. Then, navigate to the mesos UI, and go to the executor/task page for the two tasks. On the UCR task, eventually the "Used Disk" field should populate with 128 MB (the size of the dummy file). The same field on the Docker task will never get populated. Both containers are writing to the same location on the agent filesystem ({{/var/lib/mesos/slave/slaves//frameworks//executors//runs/latest}}), but only one reports the data through the UI. was: If you use the Docker container runtime, the 'Disk' 'Used' field never gets populated in the Mesos UI (on the executor/task page). Steps to Reproduce: in DC/OS 1.9.2, deploy two apps: {code:javascript} { "id": "/dummy-disk-docker", "cmd": "dd if=/dev/zero of=$MESOS_SANDBOX/testfile bs=128M count=1; tail -f /dev/null", "instances": 1, "cpus": 0.1, "mem": 256, "disk": 150, "container": { "type": "DOCKER", "docker": { "image": "alpine" } } } {code} {code:javascript} { "id": "/dummy-disk-ucr", "cmd": "dd if=/dev/zero of=$MESOS_SANDBOX/testfile bs=128M count=1; tail -f /dev/null", "instances": 1, "cpus": 0.1, "mem": 256, "disk": 150, "container": { "type": "MESOS" "docker": { "image": "alpine" } } } {code} Wait for them the deploy. Then, navigate to the mesos UI, and go to the executor/task page for the two tasks. On the UCR task, eventually the "Used Disk" field should populate with 128 MB (the size of the dummy file). The same field on the Docker task will never get populated. Both containers are writing to the same location on the agent filesystem (/var/lib/mesos/slave/slaves//frameworks//executors//runs/latest, but only one reports the data through the UI. > Mesos Executor UI - Disk:Used Field isn't populated with Docker Container > Runtime > - > > Key: MESOS-7894 > URL: https://issues.apache.org/jira/browse/MESOS-7894 > Project: Mesos > Issue Type: Bug >Affects Versions: 1.2.2 > Environment: DC/OS 1.9.2 (CentOS 7.3, Docker 1.13.1, Mesos 1.2.2, > Marathon 1.4.5) >Reporter: Justin Lee >Priority: Minor > Attachments: UCR on left, Docker on right.png > > > If you use the Docker container runtime, the 'Disk' 'Used' field never gets > populated in the Mesos UI (on the executor/task page). > Steps to Reproduce: > in DC/OS 1.9.2, deploy two apps: > {code:javascript} > { > "id": "/dummy-disk-docker", > "cmd": "dd if=/dev/zero of=$MESOS_SANDBOX/testfile bs=128M count=1; tail -f > /dev/null", > "instances": 1, > "cpus": 0.1, > "mem": 256, > "disk": 150, > "container": { > "type": "DOCKER", > "docker": { > "image": "alpine" > } > } > } > {code} > {code:javascript} > { > "id": "/dummy-disk-ucr", > "cmd": "dd if=/dev/zero of=$MESOS_SANDBOX/testfile bs=128M count=1; tail -f > /dev/null", > "instances": 1, > "cpus": 0.1, > "mem": 256, > "disk": 150, > "container": { > "type": "MESOS" > "docker": { > "image": "alpine" > } > } > } > {code} > Wait for them the deploy. > Then, navigate to the mesos UI, and go to the executor/task page for the two > tasks. > On the UCR task, eventually the "Used Disk" field should populate with 128 MB > (the size of the dummy file). > The same field on the Docker task will never get populated. > Both containers are writing to the same location on the agent filesystem > ({{/var/lib/mesos/slave/slaves//frameworks//executors//runs/latest}}), > but only one reports the data through the UI. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Updated] (MESOS-7894) Mesos Executor UI - Disk:Used Field isn't populated with Docker Container Runtime
[ https://issues.apache.org/jira/browse/MESOS-7894?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Justin Lee updated MESOS-7894: -- Attachment: (was: UCR on left, Docker on right..png) > Mesos Executor UI - Disk:Used Field isn't populated with Docker Container > Runtime > - > > Key: MESOS-7894 > URL: https://issues.apache.org/jira/browse/MESOS-7894 > Project: Mesos > Issue Type: Bug >Affects Versions: 1.2.2 > Environment: DC/OS 1.9.2 (CentOS 7.3, Docker 1.13.1, Mesos 1.2.2, > Marathon 1.4.5) >Reporter: Justin Lee >Priority: Minor > Attachments: UCR on left, Docker on right.png > > > If you use the Docker container runtime, the 'Disk' 'Used' field never gets > populated in the Mesos UI (on the executor/task page). > Steps to Reproduce: > in DC/OS 1.9.2, deploy two apps: > {code:javascript} > { > "id": "/dummy-disk-docker", > "cmd": "dd if=/dev/zero of=$MESOS_SANDBOX/testfile bs=128M count=1; tail -f > /dev/null", > "instances": 1, > "cpus": 0.1, > "mem": 256, > "disk": 150, > "container": { > "type": "DOCKER", > "docker": { > "image": "alpine" > } > } > } > {code} > {code:javascript} > { > "id": "/dummy-disk-ucr", > "cmd": "dd if=/dev/zero of=$MESOS_SANDBOX/testfile bs=128M count=1; tail -f > /dev/null", > "instances": 1, > "cpus": 0.1, > "mem": 256, > "disk": 150, > "container": { > "type": "MESOS" > "docker": { > "image": "alpine" > } > } > } > {code} > Wait for them the deploy. > Then, navigate to the mesos UI, and go to the executor/task page for the two > tasks. > On the UCR task, eventually the "Used Disk" field should populate with 128 MB > (the size of the dummy file). > The same field on the Docker task will never get populated. > Both containers are writing to the same location on the agent filesystem > (/var/lib/mesos/slave/slaves//frameworks//executors//runs/latest, > but only one reports the data through the UI. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Created] (MESOS-7895) ZK session timeout is unconfigurable in agent and scheduler drivers
Ilya Pronin created MESOS-7895: -- Summary: ZK session timeout is unconfigurable in agent and scheduler drivers Key: MESOS-7895 URL: https://issues.apache.org/jira/browse/MESOS-7895 Project: Mesos Issue Type: Improvement Affects Versions: 1.3.0 Reporter: Ilya Pronin Assignee: Ilya Pronin Priority: Minor {{ZooKeeperMasterDetector}} in agents and scheduler drivers uses the default ZK session timeout (10 secs). This timeout may have to be increased to cope with long ZK upgrades or ZK GC pauses (with local ZK sessions these can cause lots of {{TASK_LOST}}, because sessions expire on disconnection after {{session_timeout * 2 / 3}}). -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (MESOS-7894) Mesos UI - Disk:Used Field isn't populated with Docker Container Runtime.
[ https://issues.apache.org/jira/browse/MESOS-7894?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16128868#comment-16128868 ] Joseph Wu commented on MESOS-7894: -- This is due to how the disk resource field is populated on the agent's resource statistics endpoint. The MesosContainerizer (UCR) will fetch usage metrics from each isolator (which includes a {{disk/du}} isolator reporting disk usage) and then merge these results together, giving you the 128 MB you expect. The DockerContainerizer just reports some cgroups stats, which does not include disk info. > Mesos UI - Disk:Used Field isn't populated with Docker Container Runtime. > - > > Key: MESOS-7894 > URL: https://issues.apache.org/jira/browse/MESOS-7894 > Project: Mesos > Issue Type: Bug >Affects Versions: 1.2.2 > Environment: DC/OS 1.9.2 (CentOS 7.3, Docker 1.13.1, Mesos 1.2.2, > Marathon 1.4.5) >Reporter: Justin Lee >Priority: Minor > Attachments: UCR on left, Docker on right.png > > > If you use the Docker container runtime, the 'Disk' 'Used' field never gets > populated in the Mesos UI (on the executor/task page). > Steps to Reproduce: > in DC/OS 1.9.2, deploy two apps: > {code:javascript} > { > "id": "/dummy-disk-docker", > "cmd": "dd if=/dev/zero of=$MESOS_SANDBOX/testfile bs=128M count=1; tail -f > /dev/null", > "instances": 1, > "cpus": 0.1, > "mem": 256, > "disk": 150, > "container": { > "type": "DOCKER", > "docker": { > "image": "alpine" > } > } > } > {code} > {code:javascript} > { > "id": "/dummy-disk-ucr", > "cmd": "dd if=/dev/zero of=$MESOS_SANDBOX/testfile bs=128M count=1; tail -f > /dev/null", > "instances": 1, > "cpus": 0.1, > "mem": 256, > "disk": 150, > "container": { > "type": "MESOS" > "docker": { > "image": "alpine" > } > } > } > {code} > Wait for them the deploy. > Then, navigate to the mesos UI, and go to the executor/task page for the two > tasks. > On the UCR task, eventually the "Used Disk" field should populate with 128 MB > (the size of the dummy file). > The same field on the Docker task will never get populated. > Both containers are writing to the same location on the agent filesystem > ({{/var/lib/mesos/slave/slaves//frameworks//executors//runs/latest}}), > but only one reports the data through the UI. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (MESOS-7895) ZK session timeout is unconfigurable in agent and scheduler drivers
[ https://issues.apache.org/jira/browse/MESOS-7895?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16128997#comment-16128997 ] Vinod Kone commented on MESOS-7895: --- Sorry, I do not have time :( Maybe [~xujyan] can? > ZK session timeout is unconfigurable in agent and scheduler drivers > --- > > Key: MESOS-7895 > URL: https://issues.apache.org/jira/browse/MESOS-7895 > Project: Mesos > Issue Type: Improvement >Affects Versions: 1.3.0 >Reporter: Ilya Pronin >Assignee: Ilya Pronin >Priority: Minor > > {{ZooKeeperMasterDetector}} in agents and scheduler drivers uses the default > ZK session timeout (10 secs). This timeout may have to be increased to cope > with long ZK upgrades or ZK GC pauses (with local ZK sessions these can cause > lots of {{TASK_LOST}}, because sessions expire on disconnection after > {{session_timeout * 2 / 3}}). -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (MESOS-7714) Fix agent downgrade for reservation refinement
[ https://issues.apache.org/jira/browse/MESOS-7714?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16129728#comment-16129728 ] Adam B commented on MESOS-7714: --- Downgrading from blocker because [~mcypark] says "I think we’ll have to ship without it". Please retarget to 1.4.1 and/or 1.5.0 so [~karya] and [~anandmazumdar] can cut 1.4.0-rc1 > Fix agent downgrade for reservation refinement > -- > > Key: MESOS-7714 > URL: https://issues.apache.org/jira/browse/MESOS-7714 > Project: Mesos > Issue Type: Bug >Reporter: Michael Park >Assignee: Michael Park >Priority: Critical > > The agent code only partially supports downgrading of an agent correctly. > The checkpointed resources are done correctly, but the resources within > the {{SlaveInfo}} message as well as tasks and executors also need to be > downgraded > correctly and converted back on recovery. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Updated] (MESOS-7100) Missing AGENT_REMOVED event in event stream
[ https://issues.apache.org/jira/browse/MESOS-7100?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Adam B updated MESOS-7100: -- Target Version/s: (was: 1.4.0) > Missing AGENT_REMOVED event in event stream > --- > > Key: MESOS-7100 > URL: https://issues.apache.org/jira/browse/MESOS-7100 > Project: Mesos > Issue Type: Bug > Components: HTTP API >Affects Versions: 1.1.0 >Reporter: Haralds Ulmanis >Priority: Minor > Labels: mesosphere > > I'm playing with event stream via HTTP endpoints. > So - i got all events - SUBSCRIBED, TASK_ADDED, TASK_UPDATED, AGENT_ADDED > except AGENT_REMOVED. > What i do: > Just stop agent or terminate server(if cloud). > What i expect: > Once it disappear from agent list (in mesos UI) to get event AGENT_REMOVED. > Not sure about internals, maybe that is not correct event and agents got > removed after some period, if they do not come up. But in general some event > to indicate that agent went offline and not available would be good. > If AGENT_REMOVED and AGENT_ADDED is kinda one-time events .. maybe something > like: AGENT_CONNECTED/RECONNECTED, AGENT_LEAVE events would be great. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Updated] (MESOS-7707) Compile with newer Boost versions
[ https://issues.apache.org/jira/browse/MESOS-7707?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Adam B updated MESOS-7707: -- Fix Version/s: (was: 1.4.0) > Compile with newer Boost versions > - > > Key: MESOS-7707 > URL: https://issues.apache.org/jira/browse/MESOS-7707 > Project: Mesos > Issue Type: Bug > Components: c++ api >Affects Versions: 1.4.0 >Reporter: David Carlier >Priority: Minor > > Hello first look at mesos codebase and tried to compile with boost 1.6.4 > shipped with the system thus > https://reviews.apache.org/r/60356/ > Hope it s useful. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Updated] (MESOS-7707) Compile with newer Boost versions
[ https://issues.apache.org/jira/browse/MESOS-7707?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Adam B updated MESOS-7707: -- Affects Version/s: 1.4.0 > Compile with newer Boost versions > - > > Key: MESOS-7707 > URL: https://issues.apache.org/jira/browse/MESOS-7707 > Project: Mesos > Issue Type: Bug > Components: c++ api >Affects Versions: 1.4.0 >Reporter: David Carlier >Priority: Minor > > Hello first look at mesos codebase and tried to compile with boost 1.6.4 > shipped with the system thus > https://reviews.apache.org/r/60356/ > Hope it s useful. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Updated] (MESOS-7707) Compile with newer Boost versions
[ https://issues.apache.org/jira/browse/MESOS-7707?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Adam B updated MESOS-7707: -- Target Version/s: (was: 1.4.0) > Compile with newer Boost versions > - > > Key: MESOS-7707 > URL: https://issues.apache.org/jira/browse/MESOS-7707 > Project: Mesos > Issue Type: Bug > Components: c++ api >Affects Versions: 1.4.0 >Reporter: David Carlier >Priority: Minor > > Hello first look at mesos codebase and tried to compile with boost 1.6.4 > shipped with the system thus > https://reviews.apache.org/r/60356/ > Hope it s useful. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Updated] (MESOS-7714) Fix agent downgrade for reservation refinement
[ https://issues.apache.org/jira/browse/MESOS-7714?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Adam B updated MESOS-7714: -- Priority: Critical (was: Blocker) > Fix agent downgrade for reservation refinement > -- > > Key: MESOS-7714 > URL: https://issues.apache.org/jira/browse/MESOS-7714 > Project: Mesos > Issue Type: Bug >Reporter: Michael Park >Assignee: Michael Park >Priority: Critical > > The agent code only partially supports downgrading of an agent correctly. > The checkpointed resources are done correctly, but the resources within > the {{SlaveInfo}} message as well as tasks and executors also need to be > downgraded > correctly and converted back on recovery. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (MESOS-6641) Remove deprecated hooks from our module API.
[ https://issues.apache.org/jira/browse/MESOS-6641?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16129735#comment-16129735 ] Adam B commented on MESOS-6641: --- Removing target version until we have "Accepted" this ticket, and scoped/prioritized/assigned it for a release. > Remove deprecated hooks from our module API. > > > Key: MESOS-6641 > URL: https://issues.apache.org/jira/browse/MESOS-6641 > Project: Mesos > Issue Type: Improvement > Components: modules >Reporter: Till Toenshoff >Priority: Minor > Labels: deprecation, hooks, tech-debt > > By now we have at least one deprecated hook in our modules API which is > {{slavePreLaunchDockerHook}}. > There is a new one coming in now which is deprecating > {{slavePreLaunchDockerEnvironmentDecorator}}. > We need to actually remove those deprecations while making the community > aware - this ticket is meant for tracking this. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (MESOS-7100) Missing AGENT_REMOVED event in event stream
[ https://issues.apache.org/jira/browse/MESOS-7100?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16129732#comment-16129732 ] Adam B commented on MESOS-7100: --- Removing target version until we have "Accepted" this ticket, and scoped/prioritized/assigned it for a release. > Missing AGENT_REMOVED event in event stream > --- > > Key: MESOS-7100 > URL: https://issues.apache.org/jira/browse/MESOS-7100 > Project: Mesos > Issue Type: Bug > Components: HTTP API >Affects Versions: 1.1.0 >Reporter: Haralds Ulmanis >Priority: Minor > Labels: mesosphere > > I'm playing with event stream via HTTP endpoints. > So - i got all events - SUBSCRIBED, TASK_ADDED, TASK_UPDATED, AGENT_ADDED > except AGENT_REMOVED. > What i do: > Just stop agent or terminate server(if cloud). > What i expect: > Once it disappear from agent list (in mesos UI) to get event AGENT_REMOVED. > Not sure about internals, maybe that is not correct event and agents got > removed after some period, if they do not come up. But in general some event > to indicate that agent went offline and not available would be good. > If AGENT_REMOVED and AGENT_ADDED is kinda one-time events .. maybe something > like: AGENT_CONNECTED/RECONNECTED, AGENT_LEAVE events would be great. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (MESOS-7233) Use varint comparator in replica log
[ https://issues.apache.org/jira/browse/MESOS-7233?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16129734#comment-16129734 ] Adam B commented on MESOS-7233: --- Removing target version until we have "Accepted" this ticket, and scoped/prioritized it for a release. It appears there is concern about the approach taken in the 5-month-old review. > Use varint comparator in replica log > > > Key: MESOS-7233 > URL: https://issues.apache.org/jira/browse/MESOS-7233 > Project: Mesos > Issue Type: Improvement > Components: replicated log >Reporter: Tomasz Janiszewski >Assignee: Tomasz Janiszewski >Priority: Minor > > Since bug discussed at > https://groups.google.com/forum/#\!topic/leveldb/F-rDkWiQm6c > was fixed. After upgrading leveldb to 1.19 we could replace > default byte-wise comparator with varint comparator. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Updated] (MESOS-6641) Remove deprecated hooks from our module API.
[ https://issues.apache.org/jira/browse/MESOS-6641?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Adam B updated MESOS-6641: -- Target Version/s: (was: 1.4.0) > Remove deprecated hooks from our module API. > > > Key: MESOS-6641 > URL: https://issues.apache.org/jira/browse/MESOS-6641 > Project: Mesos > Issue Type: Improvement > Components: modules >Reporter: Till Toenshoff >Priority: Minor > Labels: deprecation, hooks, tech-debt > > By now we have at least one deprecated hook in our modules API which is > {{slavePreLaunchDockerHook}}. > There is a new one coming in now which is deprecating > {{slavePreLaunchDockerEnvironmentDecorator}}. > We need to actually remove those deprecations while making the community > aware - this ticket is meant for tracking this. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Updated] (MESOS-7233) Use varint comparator in replica log
[ https://issues.apache.org/jira/browse/MESOS-7233?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Adam B updated MESOS-7233: -- Target Version/s: (was: 1.4.0) > Use varint comparator in replica log > > > Key: MESOS-7233 > URL: https://issues.apache.org/jira/browse/MESOS-7233 > Project: Mesos > Issue Type: Improvement > Components: replicated log >Reporter: Tomasz Janiszewski >Assignee: Tomasz Janiszewski >Priority: Minor > > Since bug discussed at > https://groups.google.com/forum/#\!topic/leveldb/F-rDkWiQm6c > was fixed. After upgrading leveldb to 1.19 we could replace > default byte-wise comparator with varint comparator. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (MESOS-7007) filesystem/shared and --default_container_info broken since 1.1
[ https://issues.apache.org/jira/browse/MESOS-7007?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16129526#comment-16129526 ] R.B. Boyer commented on MESOS-7007: --- Also from my parsing of the master branch today (`git:3587ca14c81c61949239698539df3e67774c9071`) the affected code moved to `src/slave/slave.cpp` and still has the same underlying bug: {code} // Add the default container info to the executor info. // TODO(jieyu): Rename the flag to be default_mesos_container_info. if (!executorInfo_.has_container() && flags.default_container_info.isSome()) { executorInfo_.mutable_container()->CopyFrom( flags.default_container_info.get()); } // Bundle all the container launch fields together. ContainerConfig containerConfig; containerConfig.mutable_executor_info()->CopyFrom(executorInfo_); containerConfig.mutable_command_info()->CopyFrom(executorInfo_.command()); containerConfig.mutable_resources()->CopyFrom(executorInfo_.resources()); containerConfig.set_directory(executor->directory); if (executor->user.isSome()) { containerConfig.set_user(executor->user.get()); } if (executor->isCommandExecutor()) { if (taskInfo.isSome()) { containerConfig.mutable_task_info()->CopyFrom(taskInfo.get()); if (taskInfo.get().has_container()) { containerConfig.mutable_container_info() ->CopyFrom(taskInfo.get().container()); } } } else { if (executorInfo_.has_container()) { containerConfig.mutable_container_info() ->CopyFrom(executorInfo_.container()); } } {code} > filesystem/shared and --default_container_info broken since 1.1 > --- > > Key: MESOS-7007 > URL: https://issues.apache.org/jira/browse/MESOS-7007 > Project: Mesos > Issue Type: Bug > Components: agent >Affects Versions: 1.1.0, 1.2.0 >Reporter: Pierre Cheynier >Assignee: Chun-Hung Hsiao > Labels: storage > > I face this issue, that prevent me to upgrade to 1.1.0 (and the change was > consequently introduced in this version): > I'm using default_container_info to mount a /tmp volume in the container's > mount namespace from its current sandbox, meaning that each container have a > dedicated /tmp, thanks to the {{filesystem/shared}} isolator. > I noticed through our automation pipeline that integration tests were failing > and found that this is because /tmp (the one from the host!) contents is > trashed each time a container is created. > Here is my setup: > * > {{--isolation='cgroups/cpu,cgroups/mem,namespaces/pid,*disk/du,filesystem/shared,filesystem/linux*,docker/runtime'}} > * > {{--default_container_info='\{"type":"MESOS","volumes":\[\{"host_path":"tmp","container_path":"/tmp","mode":"RW"\}\]\}'}} > I discovered this issue in the early days of 1.1 (end of Nov, spoke with > someone on Slack), but had unfortunately no time to dig into the symptoms a > bit more. > I found nothing interesting even using GLOGv=3. > Maybe it's a bad usage of isolators that trigger this issue ? If it's the > case, then at least a documentation update should be done. > Let me know if more information is needed. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Updated] (MESOS-7705) Reconsider restricting the resource format for frameworks.
[ https://issues.apache.org/jira/browse/MESOS-7705?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Park updated MESOS-7705: Priority: Major (was: Blocker) > Reconsider restricting the resource format for frameworks. > -- > > Key: MESOS-7705 > URL: https://issues.apache.org/jira/browse/MESOS-7705 > Project: Mesos > Issue Type: Bug > Components: master >Reporter: Michael Park >Assignee: Michael Park > > We output the "endpoint" format through the endpoints > for backward compatibility of external tooling. A framework should be > able to use the result of an endpoint and pass it back to Mesos, > since the result was produced by Mesos. This is especially applicable > to the V1 API. We also allow the "pre-reservation-refinement" format > because existing "resources files" are written in that format, and > they should still be usable without modification. > This is probably too flexible however, since a framework without > a RESERVATION_REFINEMENT capability could make refined reservations > using the "post-reservation-refinement" format, although they wouldn't be > offered such resources. It still seems undesirable if anyone were to > run into it, and we should consider adding sensible restrictions. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Updated] (MESOS-7705) Reconsider restricting the resource format for frameworks.
[ https://issues.apache.org/jira/browse/MESOS-7705?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Park updated MESOS-7705: Target Version/s: 1.5.0, 1.4.1 (was: 1.4.0) > Reconsider restricting the resource format for frameworks. > -- > > Key: MESOS-7705 > URL: https://issues.apache.org/jira/browse/MESOS-7705 > Project: Mesos > Issue Type: Bug > Components: master >Reporter: Michael Park >Assignee: Michael Park >Priority: Blocker > > We output the "endpoint" format through the endpoints > for backward compatibility of external tooling. A framework should be > able to use the result of an endpoint and pass it back to Mesos, > since the result was produced by Mesos. This is especially applicable > to the V1 API. We also allow the "pre-reservation-refinement" format > because existing "resources files" are written in that format, and > they should still be usable without modification. > This is probably too flexible however, since a framework without > a RESERVATION_REFINEMENT capability could make refined reservations > using the "post-reservation-refinement" format, although they wouldn't be > offered such resources. It still seems undesirable if anyone were to > run into it, and we should consider adding sensible restrictions. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Updated] (MESOS-7865) Agent may process a kill task and still launch the task.
[ https://issues.apache.org/jira/browse/MESOS-7865?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Benjamin Mahler updated MESOS-7865: --- Priority: Critical (was: Blocker) > Agent may process a kill task and still launch the task. > > > Key: MESOS-7865 > URL: https://issues.apache.org/jira/browse/MESOS-7865 > Project: Mesos > Issue Type: Bug > Components: agent >Reporter: Benjamin Mahler >Assignee: Benjamin Mahler >Priority: Critical > > Based on the investigation of MESOS-7744, the agent has a race in which > "queued" tasks can still be launched after the agent has processed a kill > task for them. This race was introduced when {{Slave::statusUpdate}} was made > asynchronous: > (1) {{Slave::__run}} completes, task is now within {{Executor::queuedTasks}} > (2) {{Slave::killTask}} locates the executor based on the task ID residing in > queuedTasks, calls {{Slave::statusUpdate()}} with {{TASK_KILLED}} > (3) {{Slave::___run}} assumes that killed tasks have been removed from > {{Executor::queuedTasks}}, but this now occurs asynchronously in > {{Slave::_statusUpdate}}. So, the executor still sees the queued task and > delivers it and adds the task to {{Executor::launchedTasks}}. > (3) {{Slave::_statusUpdate}} runs, removes the task from > {{Executor::launchedTasks}} and adds it to {{Executor::terminatedTasks}}. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Updated] (MESOS-7783) Framework might not receive status update when a just launched task is killed immediately
[ https://issues.apache.org/jira/browse/MESOS-7783?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Benjamin Mahler updated MESOS-7783: --- Priority: Critical (was: Blocker) > Framework might not receive status update when a just launched task is killed > immediately > - > > Key: MESOS-7783 > URL: https://issues.apache.org/jira/browse/MESOS-7783 > Project: Mesos > Issue Type: Bug > Components: agent >Affects Versions: 1.2.0 >Reporter: Benjamin Bannier >Assignee: Benjamin Mahler >Priority: Critical > Labels: reliability > Attachments: GroupDeployIntegrationTest.log.zip, logs > > > Our Marathon team are seeing issues in their integration test suite when > Marathon gets stuck in an infinite loop trying to kill a just launched task. > In their test a task launched which is immediately followed by killing the > task -- the framework does e.g., not wait for any task status update. > In this case the launch and kill messages arrive at the agent in the correct > order, but both the launch and kill paths in the agent do not reach the point > where a status update is sent to the framework. Since the framework has seen > no status update on the task it re-triggers a kill, causing an infinite loop. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Updated] (MESOS-7863) Agent may drop pending kill task status updates.
[ https://issues.apache.org/jira/browse/MESOS-7863?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Benjamin Mahler updated MESOS-7863: --- Priority: Critical (was: Blocker) > Agent may drop pending kill task status updates. > > > Key: MESOS-7863 > URL: https://issues.apache.org/jira/browse/MESOS-7863 > Project: Mesos > Issue Type: Bug > Components: agent >Reporter: Benjamin Mahler >Assignee: Benjamin Mahler >Priority: Critical > > Currently there is an assumption that when a pending task is killed, the > framework will still be stored in the agent. However, this assumption can be > violated in two cases: > # Another pending task was killed and we removed the framework in > 'Slave::run' thinking it was idle, because pending tasks were empty (we > remove from pending tasks when processing the kill). (MESOS-7783 is an > example instance of this). > # The last executor terminated without tasks to send terminal updates for, or > the last terminated executor received its last acknowledgement. At this > point, we remove the framework thinking there were no pending tasks if the > task was killed (removed from pending). -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Updated] (MESOS-7744) Mesos Agent Sends TASK_KILL status update to Master, and still launches task
[ https://issues.apache.org/jira/browse/MESOS-7744?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Benjamin Mahler updated MESOS-7744: --- Priority: Critical (was: Blocker) > Mesos Agent Sends TASK_KILL status update to Master, and still launches task > > > Key: MESOS-7744 > URL: https://issues.apache.org/jira/browse/MESOS-7744 > Project: Mesos > Issue Type: Bug >Affects Versions: 1.0.1 >Reporter: Sargun Dhillon >Assignee: Benjamin Mahler >Priority: Critical > Labels: reliability > > We sometimes launch jobs, and cancel them in ~7 seconds, if we don't get a > TASK_STARTING back from the agent. Under certain conditions it can result in > Mesos losing track of the task. The chunk of the logs which is interesting is > here: > {code} > Jun 29 23:22:26 titusagent-mainvpc-r3.8xlarge.2-i-04907efc9f1f8535c > mesos-slave[4290]: I0629 23:22:26.951799 5171 slave.cpp:1495] Got assigned > task Titus-7590548-worker-0-4476 for framework TitusFramework > Jun 29 23:22:26 titusagent-mainvpc-r3.8xlarge.2-i-04907efc9f1f8535c > mesos-slave[4290]: I0629 23:22:26.952251 5171 slave.cpp:1614] Launching task > Titus-7590548-worker-0-4476 for framework TitusFramework > Jun 29 23:22:37 titusagent-mainvpc-r3.8xlarge.2-i-04907efc9f1f8535c > mesos-slave[4290]: I0629 23:22:37.484611 5171 slave.cpp:1853] Queuing task > ‘Titus-7590548-worker-0-4476’ for executor ‘docker-executor’ of framework > TitusFramework at executor(1)@100.66.11.10:17707 > Jun 29 23:22:37 titusagent-mainvpc-r3.8xlarge.2-i-04907efc9f1f8535c > mesos-slave[4290]: I0629 23:22:37.487876 5171 slave.cpp:2035] Asked to kill > task Titus-7590548-worker-0-4476 of framework TitusFramework > Jun 29 23:22:37 titusagent-mainvpc-r3.8xlarge.2-i-04907efc9f1f8535c > mesos-slave[4290]: I0629 23:22:37.488994 5171 slave.cpp:3211] Handling > status update TASK_KILLED (UUID: 898215d6-a244-4dbe-bc9c-878a22d36ea4) for > task Titus-7590548-worker-0-4476 of framework TitusFramework from @0.0.0.0:0 > Jun 29 23:22:37 titusagent-mainvpc-r3.8xlarge.2-i-04907efc9f1f8535c > mesos-slave[4290]: I0629 23:22:37.490603 5171 slave.cpp:2005] Sending queued > task ‘Titus-7590548-worker-0-4476’ to executor ‘docker-executor’ of framework > TitusFramework at executor(1)@100.66.11.10:17707{ > {code} > In our executor, we see that the launch message arrives after the master has > already gotten the kill update. We then send non-terminal state updates to > the agent, and yet it doesn't forward these to our framework. We're using a > custom executor which is based on the older mesos-go bindings. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Created] (MESOS-7897) Implement linearizable concurrent queue to use as event queue in libprocess
Dario Rexin created MESOS-7897: -- Summary: Implement linearizable concurrent queue to use as event queue in libprocess Key: MESOS-7897 URL: https://issues.apache.org/jira/browse/MESOS-7897 Project: Mesos Issue Type: Improvement Components: libprocess Reporter: Dario Rexin Assignee: Dario Rexin We currently use the moodycamel ConcurrentQueue as the backing queue for the lockfree mailboxes in libprocess. Even though this queue is extremely fast, it's not strictly linearizable, which is why there is same additional code to re-establish linearizability. This makes the code harder to maintain and also eliminates most of the performance benefits of this queue. A simple concurrent linked queue implementation should yield better results and simplify the code of the EventQueue class. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Created] (MESOS-7898) Implement control message type that is handled with higher priority
Dario Rexin created MESOS-7898: -- Summary: Implement control message type that is handled with higher priority Key: MESOS-7898 URL: https://issues.apache.org/jira/browse/MESOS-7898 Project: Mesos Issue Type: Improvement Reporter: Dario Rexin Assignee: Dario Rexin Up until most recently libprocess was able to enqueue messages in the front of the queue, so they would be processed before the other messages. The recent changes adding concurrent queues removed this capability. One way to implement this feature while keeping the mailbox lock free would be to use a second queue in the mailbox that is only used for control messages. When dequeuing a message this queue would be checked first, before the regular queue is considered. A similar mailbox can be found in Akka: http://doc.akka.io/docs/akka/2.5.3/scala/mailboxes.html (UnboundedControlAwareMailbox). Since not all processes need this capability and there will be an impact on performance, we could consider making this an optional per-process configuration. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Comment Edited] (MESOS-7744) Mesos Agent Sends TASK_KILL status update to Master, and still launches task
[ https://issues.apache.org/jira/browse/MESOS-7744?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16129236#comment-16129236 ] Benjamin Mahler edited comment on MESOS-7744 at 8/16/17 6:41 PM: - [~karya] [~alexr] any reason to drop the targets instead of bumping them? I will re-target for 1.4.1 and 1.1.4 (or is there no 1.1.4 planned?) I realized I added 1.1.3 because the version is considered unreleased in JIRA. [~alexr] can you close that version so others don't target it? was (Author: bmahler): [~karya] [~alexr] any reason to drop the targets instead of bumping them? I will re-target for 1.4.1 and 1.1.4 (or is there no 1.1.4 planned?) > Mesos Agent Sends TASK_KILL status update to Master, and still launches task > > > Key: MESOS-7744 > URL: https://issues.apache.org/jira/browse/MESOS-7744 > Project: Mesos > Issue Type: Bug >Affects Versions: 1.0.1 >Reporter: Sargun Dhillon >Assignee: Benjamin Mahler >Priority: Critical > Labels: reliability > > We sometimes launch jobs, and cancel them in ~7 seconds, if we don't get a > TASK_STARTING back from the agent. Under certain conditions it can result in > Mesos losing track of the task. The chunk of the logs which is interesting is > here: > {code} > Jun 29 23:22:26 titusagent-mainvpc-r3.8xlarge.2-i-04907efc9f1f8535c > mesos-slave[4290]: I0629 23:22:26.951799 5171 slave.cpp:1495] Got assigned > task Titus-7590548-worker-0-4476 for framework TitusFramework > Jun 29 23:22:26 titusagent-mainvpc-r3.8xlarge.2-i-04907efc9f1f8535c > mesos-slave[4290]: I0629 23:22:26.952251 5171 slave.cpp:1614] Launching task > Titus-7590548-worker-0-4476 for framework TitusFramework > Jun 29 23:22:37 titusagent-mainvpc-r3.8xlarge.2-i-04907efc9f1f8535c > mesos-slave[4290]: I0629 23:22:37.484611 5171 slave.cpp:1853] Queuing task > ‘Titus-7590548-worker-0-4476’ for executor ‘docker-executor’ of framework > TitusFramework at executor(1)@100.66.11.10:17707 > Jun 29 23:22:37 titusagent-mainvpc-r3.8xlarge.2-i-04907efc9f1f8535c > mesos-slave[4290]: I0629 23:22:37.487876 5171 slave.cpp:2035] Asked to kill > task Titus-7590548-worker-0-4476 of framework TitusFramework > Jun 29 23:22:37 titusagent-mainvpc-r3.8xlarge.2-i-04907efc9f1f8535c > mesos-slave[4290]: I0629 23:22:37.488994 5171 slave.cpp:3211] Handling > status update TASK_KILLED (UUID: 898215d6-a244-4dbe-bc9c-878a22d36ea4) for > task Titus-7590548-worker-0-4476 of framework TitusFramework from @0.0.0.0:0 > Jun 29 23:22:37 titusagent-mainvpc-r3.8xlarge.2-i-04907efc9f1f8535c > mesos-slave[4290]: I0629 23:22:37.490603 5171 slave.cpp:2005] Sending queued > task ‘Titus-7590548-worker-0-4476’ to executor ‘docker-executor’ of framework > TitusFramework at executor(1)@100.66.11.10:17707{ > {code} > In our executor, we see that the launch message arrives after the master has > already gotten the kill update. We then send non-terminal state updates to > the agent, and yet it doesn't forward these to our framework. We're using a > custom executor which is based on the older mesos-go bindings. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Updated] (MESOS-7744) Mesos Agent Sends TASK_KILL status update to Master, and still launches task
[ https://issues.apache.org/jira/browse/MESOS-7744?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Benjamin Mahler updated MESOS-7744: --- Target Version/s: 1.1.3, 1.2.3, 1.3.2, 1.4.0, 1.5.0 (was: 1.2.3, 1.3.2, 1.4.0, 1.5.0) > Mesos Agent Sends TASK_KILL status update to Master, and still launches task > > > Key: MESOS-7744 > URL: https://issues.apache.org/jira/browse/MESOS-7744 > Project: Mesos > Issue Type: Bug >Affects Versions: 1.0.1 >Reporter: Sargun Dhillon >Assignee: Benjamin Mahler >Priority: Blocker > Labels: reliability > > We sometimes launch jobs, and cancel them in ~7 seconds, if we don't get a > TASK_STARTING back from the agent. Under certain conditions it can result in > Mesos losing track of the task. The chunk of the logs which is interesting is > here: > {code} > Jun 29 23:22:26 titusagent-mainvpc-r3.8xlarge.2-i-04907efc9f1f8535c > mesos-slave[4290]: I0629 23:22:26.951799 5171 slave.cpp:1495] Got assigned > task Titus-7590548-worker-0-4476 for framework TitusFramework > Jun 29 23:22:26 titusagent-mainvpc-r3.8xlarge.2-i-04907efc9f1f8535c > mesos-slave[4290]: I0629 23:22:26.952251 5171 slave.cpp:1614] Launching task > Titus-7590548-worker-0-4476 for framework TitusFramework > Jun 29 23:22:37 titusagent-mainvpc-r3.8xlarge.2-i-04907efc9f1f8535c > mesos-slave[4290]: I0629 23:22:37.484611 5171 slave.cpp:1853] Queuing task > ‘Titus-7590548-worker-0-4476’ for executor ‘docker-executor’ of framework > TitusFramework at executor(1)@100.66.11.10:17707 > Jun 29 23:22:37 titusagent-mainvpc-r3.8xlarge.2-i-04907efc9f1f8535c > mesos-slave[4290]: I0629 23:22:37.487876 5171 slave.cpp:2035] Asked to kill > task Titus-7590548-worker-0-4476 of framework TitusFramework > Jun 29 23:22:37 titusagent-mainvpc-r3.8xlarge.2-i-04907efc9f1f8535c > mesos-slave[4290]: I0629 23:22:37.488994 5171 slave.cpp:3211] Handling > status update TASK_KILLED (UUID: 898215d6-a244-4dbe-bc9c-878a22d36ea4) for > task Titus-7590548-worker-0-4476 of framework TitusFramework from @0.0.0.0:0 > Jun 29 23:22:37 titusagent-mainvpc-r3.8xlarge.2-i-04907efc9f1f8535c > mesos-slave[4290]: I0629 23:22:37.490603 5171 slave.cpp:2005] Sending queued > task ‘Titus-7590548-worker-0-4476’ to executor ‘docker-executor’ of framework > TitusFramework at executor(1)@100.66.11.10:17707{ > {code} > In our executor, we see that the launch message arrives after the master has > already gotten the kill update. We then send non-terminal state updates to > the agent, and yet it doesn't forward these to our framework. We're using a > custom executor which is based on the older mesos-go bindings. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Updated] (MESOS-7783) Framework might not receive status update when a just launched task is killed immediately
[ https://issues.apache.org/jira/browse/MESOS-7783?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Benjamin Mahler updated MESOS-7783: --- Target Version/s: 1.1.3, 1.2.3, 1.3.2, 1.4.0 (was: 1.2.3, 1.3.2, 1.4.0) > Framework might not receive status update when a just launched task is killed > immediately > - > > Key: MESOS-7783 > URL: https://issues.apache.org/jira/browse/MESOS-7783 > Project: Mesos > Issue Type: Bug > Components: agent >Affects Versions: 1.2.0 >Reporter: Benjamin Bannier >Assignee: Benjamin Mahler >Priority: Blocker > Labels: reliability > Attachments: GroupDeployIntegrationTest.log.zip, logs > > > Our Marathon team are seeing issues in their integration test suite when > Marathon gets stuck in an infinite loop trying to kill a just launched task. > In their test a task launched which is immediately followed by killing the > task -- the framework does e.g., not wait for any task status update. > In this case the launch and kill messages arrive at the agent in the correct > order, but both the launch and kill paths in the agent do not reach the point > where a status update is sent to the framework. Since the framework has seen > no status update on the task it re-triggers a kill, causing an infinite loop. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Updated] (MESOS-7744) Mesos Agent Sends TASK_KILL status update to Master, and still launches task
[ https://issues.apache.org/jira/browse/MESOS-7744?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Benjamin Mahler updated MESOS-7744: --- Target Version/s: 1.2.3, 1.3.2, 1.4.0, 1.5.0 (was: 1.2.3, 1.3.2, 1.5.0, 1.4.1) > Mesos Agent Sends TASK_KILL status update to Master, and still launches task > > > Key: MESOS-7744 > URL: https://issues.apache.org/jira/browse/MESOS-7744 > Project: Mesos > Issue Type: Bug >Affects Versions: 1.0.1 >Reporter: Sargun Dhillon >Assignee: Benjamin Mahler >Priority: Blocker > Labels: reliability > > We sometimes launch jobs, and cancel them in ~7 seconds, if we don't get a > TASK_STARTING back from the agent. Under certain conditions it can result in > Mesos losing track of the task. The chunk of the logs which is interesting is > here: > {code} > Jun 29 23:22:26 titusagent-mainvpc-r3.8xlarge.2-i-04907efc9f1f8535c > mesos-slave[4290]: I0629 23:22:26.951799 5171 slave.cpp:1495] Got assigned > task Titus-7590548-worker-0-4476 for framework TitusFramework > Jun 29 23:22:26 titusagent-mainvpc-r3.8xlarge.2-i-04907efc9f1f8535c > mesos-slave[4290]: I0629 23:22:26.952251 5171 slave.cpp:1614] Launching task > Titus-7590548-worker-0-4476 for framework TitusFramework > Jun 29 23:22:37 titusagent-mainvpc-r3.8xlarge.2-i-04907efc9f1f8535c > mesos-slave[4290]: I0629 23:22:37.484611 5171 slave.cpp:1853] Queuing task > ‘Titus-7590548-worker-0-4476’ for executor ‘docker-executor’ of framework > TitusFramework at executor(1)@100.66.11.10:17707 > Jun 29 23:22:37 titusagent-mainvpc-r3.8xlarge.2-i-04907efc9f1f8535c > mesos-slave[4290]: I0629 23:22:37.487876 5171 slave.cpp:2035] Asked to kill > task Titus-7590548-worker-0-4476 of framework TitusFramework > Jun 29 23:22:37 titusagent-mainvpc-r3.8xlarge.2-i-04907efc9f1f8535c > mesos-slave[4290]: I0629 23:22:37.488994 5171 slave.cpp:3211] Handling > status update TASK_KILLED (UUID: 898215d6-a244-4dbe-bc9c-878a22d36ea4) for > task Titus-7590548-worker-0-4476 of framework TitusFramework from @0.0.0.0:0 > Jun 29 23:22:37 titusagent-mainvpc-r3.8xlarge.2-i-04907efc9f1f8535c > mesos-slave[4290]: I0629 23:22:37.490603 5171 slave.cpp:2005] Sending queued > task ‘Titus-7590548-worker-0-4476’ to executor ‘docker-executor’ of framework > TitusFramework at executor(1)@100.66.11.10:17707{ > {code} > In our executor, we see that the launch message arrives after the master has > already gotten the kill update. We then send non-terminal state updates to > the agent, and yet it doesn't forward these to our framework. We're using a > custom executor which is based on the older mesos-go bindings. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Updated] (MESOS-7744) Mesos Agent Sends TASK_KILL status update to Master, and still launches task
[ https://issues.apache.org/jira/browse/MESOS-7744?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Benjamin Mahler updated MESOS-7744: --- Target Version/s: 1.1.3, 1.2.3, 1.3.2, 1.4.0 (was: 1.1.3, 1.2.3, 1.3.2, 1.4.0, 1.5.0) > Mesos Agent Sends TASK_KILL status update to Master, and still launches task > > > Key: MESOS-7744 > URL: https://issues.apache.org/jira/browse/MESOS-7744 > Project: Mesos > Issue Type: Bug >Affects Versions: 1.0.1 >Reporter: Sargun Dhillon >Assignee: Benjamin Mahler >Priority: Blocker > Labels: reliability > > We sometimes launch jobs, and cancel them in ~7 seconds, if we don't get a > TASK_STARTING back from the agent. Under certain conditions it can result in > Mesos losing track of the task. The chunk of the logs which is interesting is > here: > {code} > Jun 29 23:22:26 titusagent-mainvpc-r3.8xlarge.2-i-04907efc9f1f8535c > mesos-slave[4290]: I0629 23:22:26.951799 5171 slave.cpp:1495] Got assigned > task Titus-7590548-worker-0-4476 for framework TitusFramework > Jun 29 23:22:26 titusagent-mainvpc-r3.8xlarge.2-i-04907efc9f1f8535c > mesos-slave[4290]: I0629 23:22:26.952251 5171 slave.cpp:1614] Launching task > Titus-7590548-worker-0-4476 for framework TitusFramework > Jun 29 23:22:37 titusagent-mainvpc-r3.8xlarge.2-i-04907efc9f1f8535c > mesos-slave[4290]: I0629 23:22:37.484611 5171 slave.cpp:1853] Queuing task > ‘Titus-7590548-worker-0-4476’ for executor ‘docker-executor’ of framework > TitusFramework at executor(1)@100.66.11.10:17707 > Jun 29 23:22:37 titusagent-mainvpc-r3.8xlarge.2-i-04907efc9f1f8535c > mesos-slave[4290]: I0629 23:22:37.487876 5171 slave.cpp:2035] Asked to kill > task Titus-7590548-worker-0-4476 of framework TitusFramework > Jun 29 23:22:37 titusagent-mainvpc-r3.8xlarge.2-i-04907efc9f1f8535c > mesos-slave[4290]: I0629 23:22:37.488994 5171 slave.cpp:3211] Handling > status update TASK_KILLED (UUID: 898215d6-a244-4dbe-bc9c-878a22d36ea4) for > task Titus-7590548-worker-0-4476 of framework TitusFramework from @0.0.0.0:0 > Jun 29 23:22:37 titusagent-mainvpc-r3.8xlarge.2-i-04907efc9f1f8535c > mesos-slave[4290]: I0629 23:22:37.490603 5171 slave.cpp:2005] Sending queued > task ‘Titus-7590548-worker-0-4476’ to executor ‘docker-executor’ of framework > TitusFramework at executor(1)@100.66.11.10:17707{ > {code} > In our executor, we see that the launch message arrives after the master has > already gotten the kill update. We then send non-terminal state updates to > the agent, and yet it doesn't forward these to our framework. We're using a > custom executor which is based on the older mesos-go bindings. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Updated] (MESOS-7863) Agent may drop pending kill task status updates.
[ https://issues.apache.org/jira/browse/MESOS-7863?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Benjamin Mahler updated MESOS-7863: --- Target Version/s: 1.1.3, 1.2.3, 1.3.2, 1.4.0 (was: 1.2.3, 1.3.2, 1.4.0) > Agent may drop pending kill task status updates. > > > Key: MESOS-7863 > URL: https://issues.apache.org/jira/browse/MESOS-7863 > Project: Mesos > Issue Type: Bug > Components: agent >Reporter: Benjamin Mahler >Assignee: Benjamin Mahler >Priority: Blocker > > Currently there is an assumption that when a pending task is killed, the > framework will still be stored in the agent. However, this assumption can be > violated in two cases: > # Another pending task was killed and we removed the framework in > 'Slave::run' thinking it was idle, because pending tasks were empty (we > remove from pending tasks when processing the kill). (MESOS-7783 is an > example instance of this). > # The last executor terminated without tasks to send terminal updates for, or > the last terminated executor received its last acknowledgement. At this > point, we remove the framework thinking there were no pending tasks if the > task was killed (removed from pending). -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Updated] (MESOS-7865) Agent may process a kill task and still launch the task.
[ https://issues.apache.org/jira/browse/MESOS-7865?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Benjamin Mahler updated MESOS-7865: --- Target Version/s: 1.1.3, 1.2.3, 1.3.2, 1.4.0 > Agent may process a kill task and still launch the task. > > > Key: MESOS-7865 > URL: https://issues.apache.org/jira/browse/MESOS-7865 > Project: Mesos > Issue Type: Bug > Components: agent >Reporter: Benjamin Mahler >Assignee: Benjamin Mahler >Priority: Blocker > > Based on the investigation of MESOS-7744, the agent has a race in which > "queued" tasks can still be launched after the agent has processed a kill > task for them. This race was introduced when {{Slave::statusUpdate}} was made > asynchronous: > (1) {{Slave::__run}} completes, task is now within {{Executor::queuedTasks}} > (2) {{Slave::killTask}} locates the executor based on the task ID residing in > queuedTasks, calls {{Slave::statusUpdate()}} with {{TASK_KILLED}} > (3) {{Slave::___run}} assumes that killed tasks have been removed from > {{Executor::queuedTasks}}, but this now occurs asynchronously in > {{Slave::_statusUpdate}}. So, the executor still sees the queued task and > delivers it and adds the task to {{Executor::launchedTasks}}. > (3) {{Slave::_statusUpdate}} runs, removes the task from > {{Executor::launchedTasks}} and adds it to {{Executor::terminatedTasks}}. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Updated] (MESOS-7744) Mesos Agent Sends TASK_KILL status update to Master, and still launches task
[ https://issues.apache.org/jira/browse/MESOS-7744?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Benjamin Mahler updated MESOS-7744: --- Priority: Blocker (was: Critical) > Mesos Agent Sends TASK_KILL status update to Master, and still launches task > > > Key: MESOS-7744 > URL: https://issues.apache.org/jira/browse/MESOS-7744 > Project: Mesos > Issue Type: Bug >Affects Versions: 1.0.1 >Reporter: Sargun Dhillon >Assignee: Benjamin Mahler >Priority: Blocker > Labels: reliability > > We sometimes launch jobs, and cancel them in ~7 seconds, if we don't get a > TASK_STARTING back from the agent. Under certain conditions it can result in > Mesos losing track of the task. The chunk of the logs which is interesting is > here: > {code} > Jun 29 23:22:26 titusagent-mainvpc-r3.8xlarge.2-i-04907efc9f1f8535c > mesos-slave[4290]: I0629 23:22:26.951799 5171 slave.cpp:1495] Got assigned > task Titus-7590548-worker-0-4476 for framework TitusFramework > Jun 29 23:22:26 titusagent-mainvpc-r3.8xlarge.2-i-04907efc9f1f8535c > mesos-slave[4290]: I0629 23:22:26.952251 5171 slave.cpp:1614] Launching task > Titus-7590548-worker-0-4476 for framework TitusFramework > Jun 29 23:22:37 titusagent-mainvpc-r3.8xlarge.2-i-04907efc9f1f8535c > mesos-slave[4290]: I0629 23:22:37.484611 5171 slave.cpp:1853] Queuing task > ‘Titus-7590548-worker-0-4476’ for executor ‘docker-executor’ of framework > TitusFramework at executor(1)@100.66.11.10:17707 > Jun 29 23:22:37 titusagent-mainvpc-r3.8xlarge.2-i-04907efc9f1f8535c > mesos-slave[4290]: I0629 23:22:37.487876 5171 slave.cpp:2035] Asked to kill > task Titus-7590548-worker-0-4476 of framework TitusFramework > Jun 29 23:22:37 titusagent-mainvpc-r3.8xlarge.2-i-04907efc9f1f8535c > mesos-slave[4290]: I0629 23:22:37.488994 5171 slave.cpp:3211] Handling > status update TASK_KILLED (UUID: 898215d6-a244-4dbe-bc9c-878a22d36ea4) for > task Titus-7590548-worker-0-4476 of framework TitusFramework from @0.0.0.0:0 > Jun 29 23:22:37 titusagent-mainvpc-r3.8xlarge.2-i-04907efc9f1f8535c > mesos-slave[4290]: I0629 23:22:37.490603 5171 slave.cpp:2005] Sending queued > task ‘Titus-7590548-worker-0-4476’ to executor ‘docker-executor’ of framework > TitusFramework at executor(1)@100.66.11.10:17707{ > {code} > In our executor, we see that the launch message arrives after the master has > already gotten the kill update. We then send non-terminal state updates to > the agent, and yet it doesn't forward these to our framework. We're using a > custom executor which is based on the older mesos-go bindings. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Updated] (MESOS-7865) Agent may process a kill task and still launch the task.
[ https://issues.apache.org/jira/browse/MESOS-7865?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Benjamin Mahler updated MESOS-7865: --- Priority: Blocker (was: Critical) > Agent may process a kill task and still launch the task. > > > Key: MESOS-7865 > URL: https://issues.apache.org/jira/browse/MESOS-7865 > Project: Mesos > Issue Type: Bug > Components: agent >Reporter: Benjamin Mahler >Assignee: Benjamin Mahler >Priority: Blocker > > Based on the investigation of MESOS-7744, the agent has a race in which > "queued" tasks can still be launched after the agent has processed a kill > task for them. This race was introduced when {{Slave::statusUpdate}} was made > asynchronous: > (1) {{Slave::__run}} completes, task is now within {{Executor::queuedTasks}} > (2) {{Slave::killTask}} locates the executor based on the task ID residing in > queuedTasks, calls {{Slave::statusUpdate()}} with {{TASK_KILLED}} > (3) {{Slave::___run}} assumes that killed tasks have been removed from > {{Executor::queuedTasks}}, but this now occurs asynchronously in > {{Slave::_statusUpdate}}. So, the executor still sees the queued task and > delivers it and adds the task to {{Executor::launchedTasks}}. > (3) {{Slave::_statusUpdate}} runs, removes the task from > {{Executor::launchedTasks}} and adds it to {{Executor::terminatedTasks}}. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Updated] (MESOS-7863) Agent may drop pending kill task status updates.
[ https://issues.apache.org/jira/browse/MESOS-7863?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Benjamin Mahler updated MESOS-7863: --- Priority: Blocker (was: Critical) > Agent may drop pending kill task status updates. > > > Key: MESOS-7863 > URL: https://issues.apache.org/jira/browse/MESOS-7863 > Project: Mesos > Issue Type: Bug > Components: agent >Reporter: Benjamin Mahler >Assignee: Benjamin Mahler >Priority: Blocker > > Currently there is an assumption that when a pending task is killed, the > framework will still be stored in the agent. However, this assumption can be > violated in two cases: > # Another pending task was killed and we removed the framework in > 'Slave::run' thinking it was idle, because pending tasks were empty (we > remove from pending tasks when processing the kill). (MESOS-7783 is an > example instance of this). > # The last executor terminated without tasks to send terminal updates for, or > the last terminated executor received its last acknowledgement. At this > point, we remove the framework thinking there were no pending tasks if the > task was killed (removed from pending). -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Updated] (MESOS-7783) Framework might not receive status update when a just launched task is killed immediately
[ https://issues.apache.org/jira/browse/MESOS-7783?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Benjamin Mahler updated MESOS-7783: --- Priority: Blocker (was: Critical) > Framework might not receive status update when a just launched task is killed > immediately > - > > Key: MESOS-7783 > URL: https://issues.apache.org/jira/browse/MESOS-7783 > Project: Mesos > Issue Type: Bug > Components: agent >Affects Versions: 1.2.0 >Reporter: Benjamin Bannier >Assignee: Benjamin Mahler >Priority: Blocker > Labels: reliability > Attachments: GroupDeployIntegrationTest.log.zip, logs > > > Our Marathon team are seeing issues in their integration test suite when > Marathon gets stuck in an infinite loop trying to kill a just launched task. > In their test a task launched which is immediately followed by killing the > task -- the framework does e.g., not wait for any task status update. > In this case the launch and kill messages arrive at the agent in the correct > order, but both the launch and kill paths in the agent do not reach the point > where a status update is sent to the framework. Since the framework has seen > no status update on the task it re-triggers a kill, causing an infinite loop. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (MESOS-7744) Mesos Agent Sends TASK_KILL status update to Master, and still launches task
[ https://issues.apache.org/jira/browse/MESOS-7744?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16129236#comment-16129236 ] Benjamin Mahler commented on MESOS-7744: [~karya] [~alexr] any reason to drop the targets instead of bumping them? I will re-target for 1.4.1 and 1.1.4 (or is there no 1.1.4 planned?) > Mesos Agent Sends TASK_KILL status update to Master, and still launches task > > > Key: MESOS-7744 > URL: https://issues.apache.org/jira/browse/MESOS-7744 > Project: Mesos > Issue Type: Bug >Affects Versions: 1.0.1 >Reporter: Sargun Dhillon >Assignee: Benjamin Mahler >Priority: Critical > Labels: reliability > > We sometimes launch jobs, and cancel them in ~7 seconds, if we don't get a > TASK_STARTING back from the agent. Under certain conditions it can result in > Mesos losing track of the task. The chunk of the logs which is interesting is > here: > {code} > Jun 29 23:22:26 titusagent-mainvpc-r3.8xlarge.2-i-04907efc9f1f8535c > mesos-slave[4290]: I0629 23:22:26.951799 5171 slave.cpp:1495] Got assigned > task Titus-7590548-worker-0-4476 for framework TitusFramework > Jun 29 23:22:26 titusagent-mainvpc-r3.8xlarge.2-i-04907efc9f1f8535c > mesos-slave[4290]: I0629 23:22:26.952251 5171 slave.cpp:1614] Launching task > Titus-7590548-worker-0-4476 for framework TitusFramework > Jun 29 23:22:37 titusagent-mainvpc-r3.8xlarge.2-i-04907efc9f1f8535c > mesos-slave[4290]: I0629 23:22:37.484611 5171 slave.cpp:1853] Queuing task > ‘Titus-7590548-worker-0-4476’ for executor ‘docker-executor’ of framework > TitusFramework at executor(1)@100.66.11.10:17707 > Jun 29 23:22:37 titusagent-mainvpc-r3.8xlarge.2-i-04907efc9f1f8535c > mesos-slave[4290]: I0629 23:22:37.487876 5171 slave.cpp:2035] Asked to kill > task Titus-7590548-worker-0-4476 of framework TitusFramework > Jun 29 23:22:37 titusagent-mainvpc-r3.8xlarge.2-i-04907efc9f1f8535c > mesos-slave[4290]: I0629 23:22:37.488994 5171 slave.cpp:3211] Handling > status update TASK_KILLED (UUID: 898215d6-a244-4dbe-bc9c-878a22d36ea4) for > task Titus-7590548-worker-0-4476 of framework TitusFramework from @0.0.0.0:0 > Jun 29 23:22:37 titusagent-mainvpc-r3.8xlarge.2-i-04907efc9f1f8535c > mesos-slave[4290]: I0629 23:22:37.490603 5171 slave.cpp:2005] Sending queued > task ‘Titus-7590548-worker-0-4476’ to executor ‘docker-executor’ of framework > TitusFramework at executor(1)@100.66.11.10:17707{ > {code} > In our executor, we see that the launch message arrives after the master has > already gotten the kill update. We then send non-terminal state updates to > the agent, and yet it doesn't forward these to our framework. We're using a > custom executor which is based on the older mesos-go bindings. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (MESOS-7744) Mesos Agent Sends TASK_KILL status update to Master, and still launches task
[ https://issues.apache.org/jira/browse/MESOS-7744?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16129261#comment-16129261 ] Benjamin Mahler commented on MESOS-7744: Ok, synced with [~alexr], looks like this will miss the 1.1.3 train and there is no 1.1.4 planned. > Mesos Agent Sends TASK_KILL status update to Master, and still launches task > > > Key: MESOS-7744 > URL: https://issues.apache.org/jira/browse/MESOS-7744 > Project: Mesos > Issue Type: Bug >Affects Versions: 1.0.1 >Reporter: Sargun Dhillon >Assignee: Benjamin Mahler >Priority: Critical > Labels: reliability > > We sometimes launch jobs, and cancel them in ~7 seconds, if we don't get a > TASK_STARTING back from the agent. Under certain conditions it can result in > Mesos losing track of the task. The chunk of the logs which is interesting is > here: > {code} > Jun 29 23:22:26 titusagent-mainvpc-r3.8xlarge.2-i-04907efc9f1f8535c > mesos-slave[4290]: I0629 23:22:26.951799 5171 slave.cpp:1495] Got assigned > task Titus-7590548-worker-0-4476 for framework TitusFramework > Jun 29 23:22:26 titusagent-mainvpc-r3.8xlarge.2-i-04907efc9f1f8535c > mesos-slave[4290]: I0629 23:22:26.952251 5171 slave.cpp:1614] Launching task > Titus-7590548-worker-0-4476 for framework TitusFramework > Jun 29 23:22:37 titusagent-mainvpc-r3.8xlarge.2-i-04907efc9f1f8535c > mesos-slave[4290]: I0629 23:22:37.484611 5171 slave.cpp:1853] Queuing task > ‘Titus-7590548-worker-0-4476’ for executor ‘docker-executor’ of framework > TitusFramework at executor(1)@100.66.11.10:17707 > Jun 29 23:22:37 titusagent-mainvpc-r3.8xlarge.2-i-04907efc9f1f8535c > mesos-slave[4290]: I0629 23:22:37.487876 5171 slave.cpp:2035] Asked to kill > task Titus-7590548-worker-0-4476 of framework TitusFramework > Jun 29 23:22:37 titusagent-mainvpc-r3.8xlarge.2-i-04907efc9f1f8535c > mesos-slave[4290]: I0629 23:22:37.488994 5171 slave.cpp:3211] Handling > status update TASK_KILLED (UUID: 898215d6-a244-4dbe-bc9c-878a22d36ea4) for > task Titus-7590548-worker-0-4476 of framework TitusFramework from @0.0.0.0:0 > Jun 29 23:22:37 titusagent-mainvpc-r3.8xlarge.2-i-04907efc9f1f8535c > mesos-slave[4290]: I0629 23:22:37.490603 5171 slave.cpp:2005] Sending queued > task ‘Titus-7590548-worker-0-4476’ to executor ‘docker-executor’ of framework > TitusFramework at executor(1)@100.66.11.10:17707{ > {code} > In our executor, we see that the launch message arrives after the master has > already gotten the kill update. We then send non-terminal state updates to > the agent, and yet it doesn't forward these to our framework. We're using a > custom executor which is based on the older mesos-go bindings. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (MESOS-7513) Expose the container sandbox path to users via the API.
[ https://issues.apache.org/jira/browse/MESOS-7513?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16129288#comment-16129288 ] Benjamin Mahler commented on MESOS-7513: [~zhitao] that's an option, after discussing with [~anandmazumdar] I updated the ticket to just be about simplifying the path and exposing it consistently in the master and agent APIs. > Expose the container sandbox path to users via the API. > --- > > Key: MESOS-7513 > URL: https://issues.apache.org/jira/browse/MESOS-7513 > Project: Mesos > Issue Type: Task >Reporter: Anand Mazumdar > Labels: mesosphere > > Currently, only the agent exposes the executor sandbox via a {{directory}} > field in the executor JSON for the v0 API. The master's v0 API and all of the > v1 API do not expose the executor sandbox at all. > As a result, users reverse engineer the logic for generating the path and use > it in their scripts. To add to the difficulty, the path currently includes > the agent's work directory which is only obtainable from the agent endpoints > (i.e. {{//frameworks//executors/}}) rather > than exposing a virtual path (i.e. {{/frameworks//executors/}}), > like we did for {{/slave/log}} and {{/master/log}}. > We should expose the executor sandbox directory to users consistently in both > the master and agent v0/v1 APIs, as well as simplify the path format so that > users don't know about the agent's work directory. > This also needs to work for nested containers. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Updated] (MESOS-7513) Expose the container sandbox path to users via the API.
[ https://issues.apache.org/jira/browse/MESOS-7513?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Benjamin Mahler updated MESOS-7513: --- Description: Currently, only the agent exposes the executor sandbox via a {{directory}} field in the executor JSON for the v0 API. The master's v0 API and all of the v1 API do not expose the executor sandbox at all. As a result, users reverse engineer the logic for generating the path and use it in their scripts. To add to the difficulty, the path currently includes the agent's work directory which is only obtainable from the agent endpoints (i.e. {{//frameworks//executors/}}) rather than exposing a virtual path (i.e. {{/frameworks//executors/}}), like we did for {{/slave/log}} and {{/master/log}}. We should expose the executor sandbox directory to users consistently in both the master and agent v0/v1 APIs, as well as simplify the path format so that users don't know about the agent's work directory. This also needs to work for nested containers. was: Currently, there is no public API for getting the path to the sandbox of a running container. This leads to folks reverse engineering the Mesos logic for generating the path and then using it in their scripts. This is already done by the Mesos Web UI and the DC/OS CLI. This is prone to errors if the Mesos path logic changes in the upcoming versions. We should introduce a new calls on the v1 Agent API; {{GET_CONTAINER_SANDBOX_PATH}}/{{GET_EXECUTOR_SANDBOX_PATH}} to get the path to a running container (can be nested) and another call to get the path to the executor sandbox. Summary: Expose the container sandbox path to users via the API. (was: Consider introducing an API call to get the sandbox of a running container.) > Expose the container sandbox path to users via the API. > --- > > Key: MESOS-7513 > URL: https://issues.apache.org/jira/browse/MESOS-7513 > Project: Mesos > Issue Type: Task >Reporter: Anand Mazumdar > Labels: mesosphere > > Currently, only the agent exposes the executor sandbox via a {{directory}} > field in the executor JSON for the v0 API. The master's v0 API and all of the > v1 API do not expose the executor sandbox at all. > As a result, users reverse engineer the logic for generating the path and use > it in their scripts. To add to the difficulty, the path currently includes > the agent's work directory which is only obtainable from the agent endpoints > (i.e. {{//frameworks//executors/}}) rather > than exposing a virtual path (i.e. {{/frameworks//executors/}}), > like we did for {{/slave/log}} and {{/master/log}}. > We should expose the executor sandbox directory to users consistently in both > the master and agent v0/v1 APIs, as well as simplify the path format so that > users don't know about the agent's work directory. > This also needs to work for nested containers. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Updated] (MESOS-7865) Agent may process a kill task and still launch the task.
[ https://issues.apache.org/jira/browse/MESOS-7865?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Benjamin Mahler updated MESOS-7865: --- Priority: Critical (was: Major) > Agent may process a kill task and still launch the task. > > > Key: MESOS-7865 > URL: https://issues.apache.org/jira/browse/MESOS-7865 > Project: Mesos > Issue Type: Bug > Components: agent >Reporter: Benjamin Mahler >Assignee: Benjamin Mahler >Priority: Critical > > Based on the investigation of MESOS-7744, the agent has a race in which > "queued" tasks can still be launched after the agent has processed a kill > task for them. This race was introduced when {{Slave::statusUpdate}} was made > asynchronous: > (1) {{Slave::__run}} completes, task is now within {{Executor::queuedTasks}} > (2) {{Slave::killTask}} locates the executor based on the task ID residing in > queuedTasks, calls {{Slave::statusUpdate()}} with {{TASK_KILLED}} > (3) {{Slave::___run}} assumes that killed tasks have been removed from > {{Executor::queuedTasks}}, but this now occurs asynchronously in > {{Slave::_statusUpdate}}. So, the executor still sees the queued task and > delivers it and adds the task to {{Executor::launchedTasks}}. > (3) {{Slave::_statusUpdate}} runs, removes the task from > {{Executor::launchedTasks}} and adds it to {{Executor::terminatedTasks}}. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Updated] (MESOS-7861) Include check output in the DefaultExecutor log
[ https://issues.apache.org/jira/browse/MESOS-7861?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Greg Mann updated MESOS-7861: - Shepherd: Greg Mann > Include check output in the DefaultExecutor log > --- > > Key: MESOS-7861 > URL: https://issues.apache.org/jira/browse/MESOS-7861 > Project: Mesos > Issue Type: Bug > Components: executor >Affects Versions: 1.3.0 >Reporter: Michael Browning >Assignee: Gastón Kleiman >Priority: Minor > Labels: check, default-executor, health-check, mesosphere > > With the default executor, health and readiness checks are run in their own > nested containers, whose sandboxes are cleaned up right before performing the > next check. This makes access to stdout/stderr of previous runs of the check > command effectively impossible. > Although the exit code of the command being run is reported in a task status, > it is often necessary to see the command's actual output when debugging a > framework issue, so the ability to access this output via the executor logs > would be helpful. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Updated] (MESOS-7315) Design doc for resource provider and storage integration.
[ https://issues.apache.org/jira/browse/MESOS-7315?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Adam B updated MESOS-7315: -- Labels: storage (was: ) > Design doc for resource provider and storage integration. > - > > Key: MESOS-7315 > URL: https://issues.apache.org/jira/browse/MESOS-7315 > Project: Mesos > Issue Type: Task >Reporter: Jie Yu >Assignee: Jie Yu > Labels: storage > Fix For: 1.4.0 > > > https://docs.google.com/document/d/125YWqg_5BB5OY9a6M7LZcby5RSqBwo2PZzpVLuxYXh4/edit?usp=sharing -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Updated] (MESOS-7336) Add resource provider API protobuf.
[ https://issues.apache.org/jira/browse/MESOS-7336?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Adam B updated MESOS-7336: -- Labels: storage (was: ) > Add resource provider API protobuf. > --- > > Key: MESOS-7336 > URL: https://issues.apache.org/jira/browse/MESOS-7336 > Project: Mesos > Issue Type: Task >Reporter: Jie Yu >Assignee: Jie Yu > Labels: storage > Fix For: 1.3.0 > > > Resource provider API will be Event/Call based, similar to the scheduler or > executor API. Resource providers will use this API to interact with the > master, sending Calls to the master and receiving Event from the master. > The same API will be used for both local resource providers and external > resource providers. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Assigned] (MESOS-7861) Include check output in the DefaultExecutor log
[ https://issues.apache.org/jira/browse/MESOS-7861?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Gastón Kleiman reassigned MESOS-7861: - Assignee: Gastón Kleiman > Include check output in the DefaultExecutor log > --- > > Key: MESOS-7861 > URL: https://issues.apache.org/jira/browse/MESOS-7861 > Project: Mesos > Issue Type: Bug > Components: executor >Affects Versions: 1.3.0 >Reporter: Michael Browning >Assignee: Gastón Kleiman >Priority: Minor > Labels: check, default-executor, health-check, mesosphere > > With the default executor, health and readiness checks are run in their own > nested containers, whose sandboxes are cleaned up right before performing the > next check. This makes access to stdout/stderr of previous runs of the check > command effectively impossible. > Although the exit code of the command being run is reported in a task status, > it is often necessary to see the command's actual output when debugging a > framework issue, so the ability to access this output via the executor logs > would be helpful. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Updated] (MESOS-7304) Fetcher should not depend on SlaveID.
[ https://issues.apache.org/jira/browse/MESOS-7304?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Adam B updated MESOS-7304: -- Labels: mesosphere storage (was: mesosphere) > Fetcher should not depend on SlaveID. > - > > Key: MESOS-7304 > URL: https://issues.apache.org/jira/browse/MESOS-7304 > Project: Mesos > Issue Type: Task > Components: containerization, fetcher >Reporter: Jie Yu >Assignee: Joseph Wu > Labels: mesosphere, storage > Fix For: 1.4.0 > > > Currently, various Fetcher interfaces depends on SlaveID, which is an > unnecessary coupling. For instance: > {code} > Try Fetcher::recover(const SlaveID& slaveId, const Flags& flags); > Future Fetcher::fetch( > const ContainerID& containerId, > const CommandInfo& commandInfo, > const string& sandboxDirectory, > const Option& user, > const SlaveID& slaveId, > const Flags& flags); > {code} > Looks like the only reason we need a SlaveID is because we need to calculate > the fetcher cache directory based on that. We should calculate the fetcher > cache directory in the caller and pass that directory to Fetcher. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Updated] (MESOS-7560) Add 'type' and 'name' to ResourceProviderInfo.
[ https://issues.apache.org/jira/browse/MESOS-7560?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Adam B updated MESOS-7560: -- Labels: storage (was: ) > Add 'type' and 'name' to ResourceProviderInfo. > -- > > Key: MESOS-7560 > URL: https://issues.apache.org/jira/browse/MESOS-7560 > Project: Mesos > Issue Type: Task >Reporter: Jie Yu >Assignee: Jie Yu > Labels: storage > Fix For: 1.4.0 > > > The 'type' field will be used to load the corresponding implementation > (either internal or via module). To avoid conflict, the naming should follow > java packing naming scheme (e.g., > org.apache.mesos.resource_provider.local.storage). > Since there could be multiple instances of the same resource provider type, > it's important to also add a 'name' field to distinguish between instances of > the same type. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Updated] (MESOS-7593) Update offer handling in the master to consider local resource providers
[ https://issues.apache.org/jira/browse/MESOS-7593?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Adam B updated MESOS-7593: -- Labels: mesosphere storage (was: mesosphere) > Update offer handling in the master to consider local resource providers > > > Key: MESOS-7593 > URL: https://issues.apache.org/jira/browse/MESOS-7593 > Project: Mesos > Issue Type: Task > Components: master >Reporter: Jan Schlicht >Assignee: Jan Schlicht > Labels: mesosphere, storage > Fix For: 1.4.0 > > > Adding local resource providers to the allocator will result in offers > created for them. This needs to be reflected in the master, i.e. the > resources need to be added to the offer that will be send to frameworks. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Updated] (MESOS-7347) Prototype resource offer operation handling in the master
[ https://issues.apache.org/jira/browse/MESOS-7347?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Adam B updated MESOS-7347: -- Labels: mesosphere storage (was: mesosphere) > Prototype resource offer operation handling in the master > - > > Key: MESOS-7347 > URL: https://issues.apache.org/jira/browse/MESOS-7347 > Project: Mesos > Issue Type: Task > Components: master >Reporter: Jan Schlicht >Assignee: Jan Schlicht > Labels: mesosphere, storage > > Prototype the following workflow in the master, in accordance with the > resource provider design; > * Handle accept calls including resource provider related offer operations > ({{CREATE_VOLUME}}, ...) > * Implement internal bookkeeping of the disk resources these operations will > be applied on > * Implement resource bookkeeping for resource providers in the master > * Send resource provider operations to resource providers -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Updated] (MESOS-7449) Refactor containerizers to not depend on TaskInfo or ExecutorInfo
[ https://issues.apache.org/jira/browse/MESOS-7449?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Adam B updated MESOS-7449: -- Labels: mesosphere storage (was: mesosphere) > Refactor containerizers to not depend on TaskInfo or ExecutorInfo > - > > Key: MESOS-7449 > URL: https://issues.apache.org/jira/browse/MESOS-7449 > Project: Mesos > Issue Type: Task > Components: containerization >Reporter: Joseph Wu >Assignee: Joseph Wu > Labels: mesosphere, storage > Fix For: 1.4.0 > > > The Containerizer interfaces should be refactored so that they do not depend > on {{TaskInfo}} or {{ExecutorInfo}}, as a standalone container will have > neither. > Currently, the {{launch}} interface depends on those fields. Instead, we > should consistently use {{ContainerInfo}} and {{CommandInfo}} in > Containerizer and isolators. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Updated] (MESOS-7591) Update master to use resource provider IDs instead of agent ID in allocator calls.
[ https://issues.apache.org/jira/browse/MESOS-7591?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Adam B updated MESOS-7591: -- Labels: mesosphere storage (was: mesosphere) > Update master to use resource provider IDs instead of agent ID in allocator > calls. > -- > > Key: MESOS-7591 > URL: https://issues.apache.org/jira/browse/MESOS-7591 > Project: Mesos > Issue Type: Task > Components: master >Reporter: Jan Schlicht >Assignee: Jan Schlicht > Labels: mesosphere, storage > Fix For: 1.4.0 > > > MESOS-7388 updates the allocator interface to use {{ResourceProviderID}} > instead of {{SlaveID}}. The usage of the allocator in the master has to be > updated accordingly. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Updated] (MESOS-7571) Add `--resource_provider_config_dir` flag to the agent.
[ https://issues.apache.org/jira/browse/MESOS-7571?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Adam B updated MESOS-7571: -- Labels: storage (was: ) > Add `--resource_provider_config_dir` flag to the agent. > --- > > Key: MESOS-7571 > URL: https://issues.apache.org/jira/browse/MESOS-7571 > Project: Mesos > Issue Type: Task >Reporter: Jie Yu >Assignee: Jie Yu > Labels: storage > Fix For: 1.4.0 > > > Add an agent flag `--resource_provider_config_dir` to allow operators to > specify the list of local resource providers to register with the agent. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (MESOS-7007) filesystem/shared and --default_container_info broken since 1.1
[ https://issues.apache.org/jira/browse/MESOS-7007?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16129496#comment-16129496 ] R.B. Boyer commented on MESOS-7007: --- This bug still affects 1.2.2. A variation of one of the above patches works on this lineage I think: {code} diff --git a/mesos-1.2.2/src/slave/containerizer/mesos/containerizer.cpp b/mesos-1.2.2/src/slave/containerizer/mesos/containerizer.cpp index 777ffc7..2057166 100644 --- a/mesos-1.2.2/src/slave/containerizer/mesos/containerizer.cpp +++ b/mesos-1.2.2/src/slave/containerizer/mesos/containerizer.cpp @@ -1037,6 +1037,7 @@ src/slave/containerizer/mesos/isolators/network/cni/cni.cpp: executo containerConfig.mutable_executor_info()->CopyFrom(executorInfo); containerConfig.mutable_command_info()->CopyFrom(executorInfo.command()); containerConfig.mutable_resources()->CopyFrom(executorInfo.resources()); + containerConfig.mutable_container_info()->CopyFrom(executorInfo.container()); containerConfig.set_directory(directory); if (user.isSome()) { {code} > filesystem/shared and --default_container_info broken since 1.1 > --- > > Key: MESOS-7007 > URL: https://issues.apache.org/jira/browse/MESOS-7007 > Project: Mesos > Issue Type: Bug > Components: agent >Affects Versions: 1.1.0, 1.2.0 >Reporter: Pierre Cheynier >Assignee: Chun-Hung Hsiao > Labels: storage > > I face this issue, that prevent me to upgrade to 1.1.0 (and the change was > consequently introduced in this version): > I'm using default_container_info to mount a /tmp volume in the container's > mount namespace from its current sandbox, meaning that each container have a > dedicated /tmp, thanks to the {{filesystem/shared}} isolator. > I noticed through our automation pipeline that integration tests were failing > and found that this is because /tmp (the one from the host!) contents is > trashed each time a container is created. > Here is my setup: > * > {{--isolation='cgroups/cpu,cgroups/mem,namespaces/pid,*disk/du,filesystem/shared,filesystem/linux*,docker/runtime'}} > * > {{--default_container_info='\{"type":"MESOS","volumes":\[\{"host_path":"tmp","container_path":"/tmp","mode":"RW"\}\]\}'}} > I discovered this issue in the early days of 1.1 (end of Nov, spoke with > someone on Slack), but had unfortunately no time to dig into the symptoms a > bit more. > I found nothing interesting even using GLOGv=3. > Maybe it's a bad usage of isolators that trigger this issue ? If it's the > case, then at least a documentation update should be done. > Let me know if more information is needed. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Comment Edited] (MESOS-7007) filesystem/shared and --default_container_info broken since 1.1
[ https://issues.apache.org/jira/browse/MESOS-7007?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16129496#comment-16129496 ] R.B. Boyer edited comment on MESOS-7007 at 8/16/17 10:11 PM: - This bug still affects 1.2.2. A variation of one of the above patches also works on 1.2.x: {code} diff --git a/mesos-1.2.2/src/slave/containerizer/mesos/containerizer.cpp b/mesos-1.2.2/src/slave/containerizer/mesos/containerizer.cpp index 777ffc7..2057166 100644 --- a/mesos-1.2.2/src/slave/containerizer/mesos/containerizer.cpp +++ b/mesos-1.2.2/src/slave/containerizer/mesos/containerizer.cpp @@ -1037,6 +1037,7 @@ src/slave/containerizer/mesos/isolators/network/cni/cni.cpp: executo containerConfig.mutable_executor_info()->CopyFrom(executorInfo); containerConfig.mutable_command_info()->CopyFrom(executorInfo.command()); containerConfig.mutable_resources()->CopyFrom(executorInfo.resources()); + containerConfig.mutable_container_info()->CopyFrom(executorInfo.container()); containerConfig.set_directory(directory); if (user.isSome()) { {code} I don't think I know enough about this particular file to understand why you'd copy something from the executorinfo blindly (ala my patch) or conditionally ala patchest 58980. was (Author: naelyn): This bug still affects 1.2.2. A variation of one of the above patches works on this lineage I think: {code} diff --git a/mesos-1.2.2/src/slave/containerizer/mesos/containerizer.cpp b/mesos-1.2.2/src/slave/containerizer/mesos/containerizer.cpp index 777ffc7..2057166 100644 --- a/mesos-1.2.2/src/slave/containerizer/mesos/containerizer.cpp +++ b/mesos-1.2.2/src/slave/containerizer/mesos/containerizer.cpp @@ -1037,6 +1037,7 @@ src/slave/containerizer/mesos/isolators/network/cni/cni.cpp: executo containerConfig.mutable_executor_info()->CopyFrom(executorInfo); containerConfig.mutable_command_info()->CopyFrom(executorInfo.command()); containerConfig.mutable_resources()->CopyFrom(executorInfo.resources()); + containerConfig.mutable_container_info()->CopyFrom(executorInfo.container()); containerConfig.set_directory(directory); if (user.isSome()) { {code} > filesystem/shared and --default_container_info broken since 1.1 > --- > > Key: MESOS-7007 > URL: https://issues.apache.org/jira/browse/MESOS-7007 > Project: Mesos > Issue Type: Bug > Components: agent >Affects Versions: 1.1.0, 1.2.0 >Reporter: Pierre Cheynier >Assignee: Chun-Hung Hsiao > Labels: storage > > I face this issue, that prevent me to upgrade to 1.1.0 (and the change was > consequently introduced in this version): > I'm using default_container_info to mount a /tmp volume in the container's > mount namespace from its current sandbox, meaning that each container have a > dedicated /tmp, thanks to the {{filesystem/shared}} isolator. > I noticed through our automation pipeline that integration tests were failing > and found that this is because /tmp (the one from the host!) contents is > trashed each time a container is created. > Here is my setup: > * > {{--isolation='cgroups/cpu,cgroups/mem,namespaces/pid,*disk/du,filesystem/shared,filesystem/linux*,docker/runtime'}} > * > {{--default_container_info='\{"type":"MESOS","volumes":\[\{"host_path":"tmp","container_path":"/tmp","mode":"RW"\}\]\}'}} > I discovered this issue in the early days of 1.1 (end of Nov, spoke with > someone on Slack), but had unfortunately no time to dig into the symptoms a > bit more. > I found nothing interesting even using GLOGv=3. > Maybe it's a bad usage of isolators that trigger this issue ? If it's the > case, then at least a documentation update should be done. > Let me know if more information is needed. -- This message was sent by Atlassian JIRA (v6.4.14#64029)