[jira] [Commented] (MESOS-9672) Docker containerizer should ignore pids of executors that do not pass the connection check.

2019-03-25 Thread Vinod Kone (JIRA)


[ 
https://issues.apache.org/jira/browse/MESOS-9672?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16801146#comment-16801146
 ] 

Vinod Kone commented on MESOS-9672:
---

I guess we would still need this incase the pid re-use happens even without an 
agent reboot (highly unlikely but technically possible).

> Docker containerizer should ignore pids of executors that do not pass the 
> connection check.
> ---
>
> Key: MESOS-9672
> URL: https://issues.apache.org/jira/browse/MESOS-9672
> Project: Mesos
>  Issue Type: Bug
>  Components: containerization
>Reporter: Meng Zhu
>Priority: Major
>  Labels: containerization
>
> When recovering executors with a tracked pid we first try to establish a 
> connection to its libprocess address to avoid reaping an irrelevant process:
> https://github.com/apache/mesos/blob/4580834471fb3bc0b95e2b96e04a63d34faef724/src/slave/containerizer/docker.cpp#L1019-L1054
> If the connection fails to establish, we should not track its pid: 
> https://github.com/apache/mesos/blob/4580834471fb3bc0b95e2b96e04a63d34faef724/src/slave/containerizer/docker.cpp#L1071
> One trouble this might cause is that if the pid is being used by another 
> executor, this could lead to duplicate pid error and lead the agent into a 
> crash loop:
> https://github.com/apache/mesos/blob/4580834471fb3bc0b95e2b96e04a63d34faef724/src/slave/containerizer/docker.cpp#L1066-L1068



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Assigned] (MESOS-7911) Non-checkpointing framework's tasks should not be marked LOST when agent disconnects.

2019-03-25 Thread Greg Mann (JIRA)


 [ 
https://issues.apache.org/jira/browse/MESOS-7911?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Greg Mann reassigned MESOS-7911:


Assignee: (was: Benno Evers)

> Non-checkpointing framework's tasks should not be marked LOST when agent 
> disconnects.
> -
>
> Key: MESOS-7911
> URL: https://issues.apache.org/jira/browse/MESOS-7911
> Project: Mesos
>  Issue Type: Bug
>Reporter: Benjamin Mahler
>Priority: Critical
>  Labels: foundations, reliability
>
> Currently, when framework with checkpointing disabled has tasks running on an 
> agent and that agent disconnects from the master, the master will mark those 
> tasks LOST and remove them from its memory. The assumption is that the agent 
> is disconnecting because it terminated.
> However, it's possible that this disconnection occurred due to a transient 
> loss of connectivity and the agent re-connects while never having terminated. 
> This case violates our assumption of there being no unknown tasks to the 
> master:
> ```
>  void Master::reconcileKnownSlave(
>  Slave* slave,
>  const vector& executors,
>  const vector& tasks)
>  {
>  ...
> // TODO(bmahler): There's an implicit assumption here the slave
>  // cannot have tasks unknown to the master. This _should_ be the
>  // case since the causal relationship is:
>  // slave removes task -> master removes task
>  // Add error logging for any violations of this assumption!
>  ```
> As a result, the tasks would remain on the agent but the master would not 
> know about them!
> A more appropriate action here would be:
> # When an agent disconnects, mark the tasks as unreachable.
> ## If the framework is not partition aware, only show it the last known task 
> state.
> ## If the framework is partition aware, let it know that it's now unreachable.
> # If the agent re-connects:
> ## And the agent had restarted, let the non-checkpointing framework know its 
> tasks are GONE/LOST.
> ## If the agent still holds the tasks, the tasks are restored as reachable.
> # If the agent gets removed:
> ## For partition aware non-checkpointing frameworks, let them know the tasks 
> are unreachable.
> ## For non partition aware non-checkpointing frameworks, let them know the 
> tasks are lost and kill them if the agent comes back.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (MESOS-9677) RPM packages should be built with launcher sealing

2019-03-25 Thread Benjamin Bannier (JIRA)


[ 
https://issues.apache.org/jira/browse/MESOS-9677?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16800793#comment-16800793
 ] 

Benjamin Bannier commented on MESOS-9677:
-

A patch for the package spec is posted at [https://reviews.apache.org/r/70295/].

> RPM packages should be built with launcher sealing
> --
>
> Key: MESOS-9677
> URL: https://issues.apache.org/jira/browse/MESOS-9677
> Project: Mesos
>  Issue Type: Wish
>  Components: build
>Affects Versions: 1.8.0
>Reporter: Benjamin Bannier
>Priority: Major
>  Labels: mesosphere, packaging, rpm
>
> We should consider enabling launcher sealing in the Mesos RPM packages. Since 
> this feature is built conditionally, it is hard to write e.g., module code 
> against Mesos packages since required functions might be missing (e.g., 
> [https://github.com/dcos/dcos-mesos-modules/commit/8ce70e6cc789054831daa3058647e326b2b11bc9]
>  cannot be linked against the default RPM package anymore). The RPM's target 
> platform centos7 should include a recent enough kernel for this.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (MESOS-9677) RPM packages should be built with launcher sealing

2019-03-25 Thread Benjamin Bannier (JIRA)
Benjamin Bannier created MESOS-9677:
---

 Summary: RPM packages should be built with launcher sealing
 Key: MESOS-9677
 URL: https://issues.apache.org/jira/browse/MESOS-9677
 Project: Mesos
  Issue Type: Wish
  Components: build
Affects Versions: 1.8.0
Reporter: Benjamin Bannier


We should consider enabling launcher sealing in the Mesos RPM packages. Since 
this feature is built conditionally, it is hard to write e.g., module code 
against Mesos packages since required functions might be missing (e.g., 
[https://github.com/dcos/dcos-mesos-modules/commit/8ce70e6cc789054831daa3058647e326b2b11bc9]
 cannot be linked against the default RPM package anymore). The RPM's target 
platform centos7 should include a recent enough kernel for this.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)