[jira] [Created] (MESOS-9930) DRF sorter may omit clients in sorting after removing an inactive leaf node.

2019-08-08 Thread Meng Zhu (JIRA)
Meng Zhu created MESOS-9930:
---

 Summary: DRF sorter may omit clients in sorting after removing an 
inactive leaf node.
 Key: MESOS-9930
 URL: https://issues.apache.org/jira/browse/MESOS-9930
 Project: Mesos
  Issue Type: Bug
  Components: allocation
Reporter: Meng Zhu
Assignee: Meng Zhu


The sorter assumes inactive leaf nodes are placed in the tail in the children 
list of a node.
However, when collapsing a parent node with a single "." virtual child node, 
its position may fail to be updated due to a bug in `Sorter::remove()`:

{noformat}
CHECK(child->isLeaf());

current->kind = child->kind;
...
if (current->kind == Node::INTERNAL) {
}
{noformat}

This bug would manifest, if
(1) we have a/b and a/.
(2) deactivate(a),  i.e. a/. becomes inactive_leaf
(3) remove(a/b)
When these happens, a/. will collapse to `a` as an inactive_leaf, due to the 
bug above, however, it will not be placed at the end, resulting in all the 
clients after `a` not included in the sort().

Luckily, this should never happen in practice, because only frameworks will get 
deactivated, and frameworks don’t have sub clients.




--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Assigned] (MESOS-9887) Race condition between two terminal task status updates for Docker executor.

2019-08-08 Thread Andrei Budnik (JIRA)


 [ 
https://issues.apache.org/jira/browse/MESOS-9887?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrei Budnik reassigned MESOS-9887:


Assignee: Andrei Budnik

> Race condition between two terminal task status updates for Docker executor.
> 
>
> Key: MESOS-9887
> URL: https://issues.apache.org/jira/browse/MESOS-9887
> Project: Mesos
>  Issue Type: Bug
>  Components: agent, containerization
>Reporter: Andrei Budnik
>Assignee: Andrei Budnik
>Priority: Blocker
>  Labels: agent, containerization
> Attachments: race_example.txt
>
>
> h2. Overview
> Expected behavior:
>  Task successfully finishes and sends TASK_FINISHED status update.
> Observed behavior:
>  Task successfully finishes, but the agent sends TASK_FAILED with the reason 
> "REASON_EXECUTOR_TERMINATED".
> In normal circumstances, Docker executor 
> [sends|https://github.com/apache/mesos/blob/0026ea46dc35cbba1f442b8e425c6cbaf81ee8f8/src/docker/executor.cpp#L758]
>  final status update TASK_FINISHED to the agent, which then [gets 
> processed|https://github.com/apache/mesos/blob/0026ea46dc35cbba1f442b8e425c6cbaf81ee8f8/src/slave/slave.cpp#L5543]
>  by the agent before termination of the executor's process.
> However, if the processing of the initial TASK_FINISHED gets delayed, then 
> there is a chance that Docker executor terminates and the agent 
> [triggers|https://github.com/apache/mesos/blob/0026ea46dc35cbba1f442b8e425c6cbaf81ee8f8/src/slave/slave.cpp#L6662]
>  TASK_FAILED which will [be 
> handled|https://github.com/apache/mesos/blob/0026ea46dc35cbba1f442b8e425c6cbaf81ee8f8/src/slave/slave.cpp#L5816-L5826]
>  prior to the TASK_FINISHED status update.
> See attached logs which contain an example of the race condition.
> h2. Reproducing bug
> 1. Add the following code:
> {code:java}
>   static int c = 0;
>   if (++c == 3) { // to skip TASK_STARTING and TASK_RUNNING status updates.
> ::sleep(2);
>   }
> {code}
> to the 
> [`ComposingContainerizerProcess::status`|https://github.com/apache/mesos/blob/0026ea46dc35cbba1f442b8e425c6cbaf81ee8f8/src/slave/containerizer/composing.cpp#L578]
>  and to the 
> [`DockerContainerizerProcess::status`|https://github.com/apache/mesos/blob/0026ea46dc35cbba1f442b8e425c6cbaf81ee8f8/src/slave/containerizer/docker.cpp#L2167].
> 2. Recompile mesos
> 3. Launch mesos master and agent locally
> 4. Launch a simple Docker task via `mesos-execute`:
> {code}
> #  cd build
> ./src/mesos-execute --master="`hostname`:5050" --name="a" 
> --containerizer=docker --docker_image=alpine --resources="cpus:1;mem:32" 
> --command="ls"
> {code}
> h2. Race condition - description
> 1. Mesos agent receives TASK_FINISHED status update and then subscribes on 
> [`containerizer->status()`|https://github.com/apache/mesos/blob/0026ea46dc35cbba1f442b8e425c6cbaf81ee8f8/src/slave/slave.cpp#L5754-L5761].
> 2. `containerizer->status()` operation for TASK_FINISHED status update gets 
> delayed in the composing containerizer (e.g. due to switch of the worker 
> thread that executes `status` method).
> 3. Docker executor terminates and the agent 
> [triggers|https://github.com/apache/mesos/blob/0026ea46dc35cbba1f442b8e425c6cbaf81ee8f8/src/slave/slave.cpp#L6662]
>  TASK_FAILED.
> 4. Docker containerizer destroys the container. A registered callback for the 
> `containerizer->wait` call in the composing containerizer dispatches [lambda 
> function|https://github.com/apache/mesos/blob/0026ea46dc35cbba1f442b8e425c6cbaf81ee8f8/src/slave/containerizer/composing.cpp#L368-L373]
>  that will clean up `containers_` map.
> 5. Composing c'zer resumes and dispatches 
> `[status()|https://github.com/apache/mesos/blob/0026ea46dc35cbba1f442b8e425c6cbaf81ee8f8/src/slave/containerizer/composing.cpp#L579]`
>  method to the Docker containerizer for TASK_FINISHED, which in turn hangs 
> for a few seconds.
> 6. Corresponding `containerId` gets removed from the `containers_` map of the 
> composing c'zer.
> 7. Mesos agent subscribes on 
> [`containerizer->status()`|https://github.com/apache/mesos/blob/0026ea46dc35cbba1f442b8e425c6cbaf81ee8f8/src/slave/slave.cpp#L5754-L5761]
>  for the TASK_FAILED status update.
> 8. Composing c'zer returns ["Container not 
> found"|https://github.com/apache/mesos/blob/0026ea46dc35cbba1f442b8e425c6cbaf81ee8f8/src/slave/containerizer/composing.cpp#L576]
>  for TASK_FAILED.
> 9. 
> `[Slave::_statusUpdate|https://github.com/apache/mesos/blob/0026ea46dc35cbba1f442b8e425c6cbaf81ee8f8/src/slave/slave.cpp#L5826]`
>  stores TASK_FAILED terminal status update in the executor's data structure.
> 10. Docker containerizer resumes and finishes processing of `status()` method 
> for TASK_FINISHED. Finally, it returns control to the `Slave::_statusUpdate` 
> continuation. This method 
> 

[jira] [Created] (MESOS-9929) maintenance schedule page End - Just know not correct

2019-08-08 Thread none (JIRA)
none created MESOS-9929:
---

 Summary: maintenance schedule page End - Just know not correct
 Key: MESOS-9929
 URL: https://issues.apache.org/jira/browse/MESOS-9929
 Project: Mesos
  Issue Type: Task
Reporter: none
 Attachments: 20190808-095824-m01.local_5050.png, 
20190808-095834-m01.local_5050.png





--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Assigned] (MESOS-9908) Introduce a new agent flag and support docker volume chown to task user.

2019-08-08 Thread Gilbert Song (JIRA)


 [ 
https://issues.apache.org/jira/browse/MESOS-9908?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gilbert Song reassigned MESOS-9908:
---

Assignee: Gilbert Song

> Introduce a new agent flag and support docker volume chown to task user.
> 
>
> Key: MESOS-9908
> URL: https://issues.apache.org/jira/browse/MESOS-9908
> Project: Mesos
>  Issue Type: Improvement
>  Components: containerization
>Reporter: Gilbert Song
>Assignee: Gilbert Song
>Priority: Major
>  Labels: containerization
>
> Currently, docker volume is always mounted as root, which is not accessible 
> by non-root task users. For security concerns, there are use cases that 
> operator may only allow non-root users to run as container user and docker 
> volume needs to be supported for those non-root users.
> A new agent flag is needed to make this support configurable, because 
> chown-ing a docker volume may be limited to some use case - e.g., multiple 
> non-root users on different hosts sharing the same docker volume 
> simultaneously. Operators are expected to turn on this flag if their 
> cluster's docker volume is not shared by multiple non-root users.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)