[jira] [Created] (MESOS-9930) DRF sorter may omit clients in sorting after removing an inactive leaf node.
Meng Zhu created MESOS-9930: --- Summary: DRF sorter may omit clients in sorting after removing an inactive leaf node. Key: MESOS-9930 URL: https://issues.apache.org/jira/browse/MESOS-9930 Project: Mesos Issue Type: Bug Components: allocation Reporter: Meng Zhu Assignee: Meng Zhu The sorter assumes inactive leaf nodes are placed in the tail in the children list of a node. However, when collapsing a parent node with a single "." virtual child node, its position may fail to be updated due to a bug in `Sorter::remove()`: {noformat} CHECK(child->isLeaf()); current->kind = child->kind; ... if (current->kind == Node::INTERNAL) { } {noformat} This bug would manifest, if (1) we have a/b and a/. (2) deactivate(a), i.e. a/. becomes inactive_leaf (3) remove(a/b) When these happens, a/. will collapse to `a` as an inactive_leaf, due to the bug above, however, it will not be placed at the end, resulting in all the clients after `a` not included in the sort(). Luckily, this should never happen in practice, because only frameworks will get deactivated, and frameworks don’t have sub clients. -- This message was sent by Atlassian JIRA (v7.6.14#76016)
[jira] [Assigned] (MESOS-9887) Race condition between two terminal task status updates for Docker executor.
[ https://issues.apache.org/jira/browse/MESOS-9887?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrei Budnik reassigned MESOS-9887: Assignee: Andrei Budnik > Race condition between two terminal task status updates for Docker executor. > > > Key: MESOS-9887 > URL: https://issues.apache.org/jira/browse/MESOS-9887 > Project: Mesos > Issue Type: Bug > Components: agent, containerization >Reporter: Andrei Budnik >Assignee: Andrei Budnik >Priority: Blocker > Labels: agent, containerization > Attachments: race_example.txt > > > h2. Overview > Expected behavior: > Task successfully finishes and sends TASK_FINISHED status update. > Observed behavior: > Task successfully finishes, but the agent sends TASK_FAILED with the reason > "REASON_EXECUTOR_TERMINATED". > In normal circumstances, Docker executor > [sends|https://github.com/apache/mesos/blob/0026ea46dc35cbba1f442b8e425c6cbaf81ee8f8/src/docker/executor.cpp#L758] > final status update TASK_FINISHED to the agent, which then [gets > processed|https://github.com/apache/mesos/blob/0026ea46dc35cbba1f442b8e425c6cbaf81ee8f8/src/slave/slave.cpp#L5543] > by the agent before termination of the executor's process. > However, if the processing of the initial TASK_FINISHED gets delayed, then > there is a chance that Docker executor terminates and the agent > [triggers|https://github.com/apache/mesos/blob/0026ea46dc35cbba1f442b8e425c6cbaf81ee8f8/src/slave/slave.cpp#L6662] > TASK_FAILED which will [be > handled|https://github.com/apache/mesos/blob/0026ea46dc35cbba1f442b8e425c6cbaf81ee8f8/src/slave/slave.cpp#L5816-L5826] > prior to the TASK_FINISHED status update. > See attached logs which contain an example of the race condition. > h2. Reproducing bug > 1. Add the following code: > {code:java} > static int c = 0; > if (++c == 3) { // to skip TASK_STARTING and TASK_RUNNING status updates. > ::sleep(2); > } > {code} > to the > [`ComposingContainerizerProcess::status`|https://github.com/apache/mesos/blob/0026ea46dc35cbba1f442b8e425c6cbaf81ee8f8/src/slave/containerizer/composing.cpp#L578] > and to the > [`DockerContainerizerProcess::status`|https://github.com/apache/mesos/blob/0026ea46dc35cbba1f442b8e425c6cbaf81ee8f8/src/slave/containerizer/docker.cpp#L2167]. > 2. Recompile mesos > 3. Launch mesos master and agent locally > 4. Launch a simple Docker task via `mesos-execute`: > {code} > # cd build > ./src/mesos-execute --master="`hostname`:5050" --name="a" > --containerizer=docker --docker_image=alpine --resources="cpus:1;mem:32" > --command="ls" > {code} > h2. Race condition - description > 1. Mesos agent receives TASK_FINISHED status update and then subscribes on > [`containerizer->status()`|https://github.com/apache/mesos/blob/0026ea46dc35cbba1f442b8e425c6cbaf81ee8f8/src/slave/slave.cpp#L5754-L5761]. > 2. `containerizer->status()` operation for TASK_FINISHED status update gets > delayed in the composing containerizer (e.g. due to switch of the worker > thread that executes `status` method). > 3. Docker executor terminates and the agent > [triggers|https://github.com/apache/mesos/blob/0026ea46dc35cbba1f442b8e425c6cbaf81ee8f8/src/slave/slave.cpp#L6662] > TASK_FAILED. > 4. Docker containerizer destroys the container. A registered callback for the > `containerizer->wait` call in the composing containerizer dispatches [lambda > function|https://github.com/apache/mesos/blob/0026ea46dc35cbba1f442b8e425c6cbaf81ee8f8/src/slave/containerizer/composing.cpp#L368-L373] > that will clean up `containers_` map. > 5. Composing c'zer resumes and dispatches > `[status()|https://github.com/apache/mesos/blob/0026ea46dc35cbba1f442b8e425c6cbaf81ee8f8/src/slave/containerizer/composing.cpp#L579]` > method to the Docker containerizer for TASK_FINISHED, which in turn hangs > for a few seconds. > 6. Corresponding `containerId` gets removed from the `containers_` map of the > composing c'zer. > 7. Mesos agent subscribes on > [`containerizer->status()`|https://github.com/apache/mesos/blob/0026ea46dc35cbba1f442b8e425c6cbaf81ee8f8/src/slave/slave.cpp#L5754-L5761] > for the TASK_FAILED status update. > 8. Composing c'zer returns ["Container not > found"|https://github.com/apache/mesos/blob/0026ea46dc35cbba1f442b8e425c6cbaf81ee8f8/src/slave/containerizer/composing.cpp#L576] > for TASK_FAILED. > 9. > `[Slave::_statusUpdate|https://github.com/apache/mesos/blob/0026ea46dc35cbba1f442b8e425c6cbaf81ee8f8/src/slave/slave.cpp#L5826]` > stores TASK_FAILED terminal status update in the executor's data structure. > 10. Docker containerizer resumes and finishes processing of `status()` method > for TASK_FINISHED. Finally, it returns control to the `Slave::_statusUpdate` > continuation. This method >
[jira] [Created] (MESOS-9929) maintenance schedule page End - Just know not correct
none created MESOS-9929: --- Summary: maintenance schedule page End - Just know not correct Key: MESOS-9929 URL: https://issues.apache.org/jira/browse/MESOS-9929 Project: Mesos Issue Type: Task Reporter: none Attachments: 20190808-095824-m01.local_5050.png, 20190808-095834-m01.local_5050.png -- This message was sent by Atlassian JIRA (v7.6.14#76016)
[jira] [Assigned] (MESOS-9908) Introduce a new agent flag and support docker volume chown to task user.
[ https://issues.apache.org/jira/browse/MESOS-9908?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Gilbert Song reassigned MESOS-9908: --- Assignee: Gilbert Song > Introduce a new agent flag and support docker volume chown to task user. > > > Key: MESOS-9908 > URL: https://issues.apache.org/jira/browse/MESOS-9908 > Project: Mesos > Issue Type: Improvement > Components: containerization >Reporter: Gilbert Song >Assignee: Gilbert Song >Priority: Major > Labels: containerization > > Currently, docker volume is always mounted as root, which is not accessible > by non-root task users. For security concerns, there are use cases that > operator may only allow non-root users to run as container user and docker > volume needs to be supported for those non-root users. > A new agent flag is needed to make this support configurable, because > chown-ing a docker volume may be limited to some use case - e.g., multiple > non-root users on different hosts sharing the same docker volume > simultaneously. Operators are expected to turn on this flag if their > cluster's docker volume is not shared by multiple non-root users. -- This message was sent by Atlassian JIRA (v7.6.14#76016)