[jira] [Commented] (MESOS-6414) Task cleanup fails when the containers includes cgroups not owned by Mesos
[ https://issues.apache.org/jira/browse/MESOS-6414?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15589394#comment-15589394 ] Anindya Sinha commented on MESOS-6414: -- Let us assume a task is launched which creates a sub-cgroup through an external service. So, the cgroup hierarchy is something like: /sys/fs/cgroup/freezer/mesos// Say the task fails, so the container exits, and when launcher->destroy() is called, we do a recursive cgroups::get() to get all cgroups and we get absolute paths for both as well as . And then the TasksKiller() is initiated for as well as resulting in freeze(), thaw(), etc. for each of them in parallel, followed by a killed(). However, since the is created by an external service, that service may do a cleanup of without Mesos' knowledge. If that happens, any of the cleanup operations (freeze(), thaw(), etc) for the may fail in the flow of TasksKiller() for the (since the external service removed /sys/fs/cgroup/freezer/mesos// before Mesos could do a cleanup in TasksKiller). As a result, we exit out of cleanup of at that point which seems incorrect since all cleanup has actually happened. To avoid this issue (ie. race of cleanup of between the external service and Mesos), I am suggesting to treat failure in any of these steps as a failure for all cases except when the failure is due to non-existence of (ie. it has already been cleaned up by an external service, so we treat this as a success). > Task cleanup fails when the containers includes cgroups not owned by Mesos > -- > > Key: MESOS-6414 > URL: https://issues.apache.org/jira/browse/MESOS-6414 > Project: Mesos > Issue Type: Bug > Components: cgroups >Reporter: Anindya Sinha >Assignee: Anindya Sinha >Priority: Minor > > If a mesos task is launched in a cgroup outside of the context of Mesos, > Mesos is unaware of that cgroup created in the task context. > Now when the Mesos task terminates: Mesos tries to cleanup all cgroups within > the top level cgroup it knows about. If the cgroup created in the task > context exists when LinuxLauncherProcess::destroy() is called but is > eventually cleaned up by the container before we do a freeze() or thaw() or > remove(), it fails at those stages leading to an incomplete cleanup of the > container. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MESOS-6414) Task cleanup fails when the containers includes cgroups not owned by Mesos
[ https://issues.apache.org/jira/browse/MESOS-6414?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15589364#comment-15589364 ] haosdent commented on MESOS-6414: - hi, [~gilbert] I chat with [~anindya.sinha] before. He means the cgroups destroy racing between docker daemon and mesos agent if launch docker in the mesos container. Let me update the ticket. > Task cleanup fails when the containers includes cgroups not owned by Mesos > -- > > Key: MESOS-6414 > URL: https://issues.apache.org/jira/browse/MESOS-6414 > Project: Mesos > Issue Type: Bug > Components: cgroups >Reporter: Anindya Sinha >Assignee: Anindya Sinha >Priority: Minor > > If a mesos task is launched in a cgroup outside of the context of Mesos, > Mesos is unaware of that cgroup created in the task context. > Now when the Mesos task terminates: Mesos tries to cleanup all cgroups within > the top level cgroup it knows about. If the cgroup created in the task > context exists when LinuxLauncherProcess::destroy() is called but is > eventually cleaned up by the container before we do a freeze() or thaw() or > remove(), it fails at those stages leading to an incomplete cleanup of the > container. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MESOS-6414) Task cleanup fails when the containers includes cgroups not owned by Mesos
[ https://issues.apache.org/jira/browse/MESOS-6414?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15589302#comment-15589302 ] Gilbert Song commented on MESOS-6414: - [~anindya.sinha], would you mind providing more context about why you want a mesos task launched in a cgroup which is not created by mesos? The LinuxLauncher::destroy() would clean up all cgroups which are created by fork(). It assumes all cgroups under the freezerhierachy are previously created by Mesos. Or as [~haosd...@gmail.com] mentioned, are you asking for cgroup namespace support? > Task cleanup fails when the containers includes cgroups not owned by Mesos > -- > > Key: MESOS-6414 > URL: https://issues.apache.org/jira/browse/MESOS-6414 > Project: Mesos > Issue Type: Bug > Components: cgroups >Reporter: Anindya Sinha >Assignee: Anindya Sinha >Priority: Minor > > If a mesos task is launched in a cgroup outside of the context of Mesos, > Mesos is unaware of that cgroup created in the task context. > Now when the Mesos task terminates: Mesos tries to cleanup all cgroups within > the top level cgroup it knows about. If the cgroup created in the task > context exists when LinuxLauncherProcess::destroy() is called but is > eventually cleaned up by the container before we do a freeze() or thaw() or > remove(), it fails at those stages leading to an incomplete cleanup of the > container. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MESOS-6414) Task cleanup fails when the containers includes cgroups not owned by Mesos
[ https://issues.apache.org/jira/browse/MESOS-6414?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15587476#comment-15587476 ] haosdent commented on MESOS-6414: - In additional, may you share the details about how to reproduce this? I would like to verify if cgroups namespace could resolve this or not [~anindya.sinha] > Task cleanup fails when the containers includes cgroups not owned by Mesos > -- > > Key: MESOS-6414 > URL: https://issues.apache.org/jira/browse/MESOS-6414 > Project: Mesos > Issue Type: Bug > Components: cgroups >Reporter: Anindya Sinha >Assignee: Anindya Sinha >Priority: Minor > > If a mesos task is launched in a cgroup outside of the context of Mesos, > Mesos is unaware of that cgroup created in the task context. > Now when the Mesos task terminates: Mesos tries to cleanup all cgroups within > the top level cgroup it knows about. If the cgroup created in the task > context exists when LinuxLauncherProcess::destroy() is called but is > eventually cleaned up by the container before we do a freeze() or thaw() or > remove(), it fails at those stages leading to an incomplete cleanup of the > container. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MESOS-6414) Task cleanup fails when the containers includes cgroups not owned by Mesos
[ https://issues.apache.org/jira/browse/MESOS-6414?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15587469#comment-15587469 ] haosdent commented on MESOS-6414: - I think [~anindya.sinha] means his tasks would create cgroup during running. But those cgroups created by user tasks would be clean up by {{LinuxLauncherProcess::destroy()}}. I think the correct way to fix this problem is to use cgroups namespaces. > Task cleanup fails when the containers includes cgroups not owned by Mesos > -- > > Key: MESOS-6414 > URL: https://issues.apache.org/jira/browse/MESOS-6414 > Project: Mesos > Issue Type: Bug > Components: cgroups >Reporter: Anindya Sinha >Assignee: Anindya Sinha >Priority: Minor > > If a mesos task is launched in a cgroup outside of the context of Mesos, > Mesos is unaware of that cgroup created in the task context. > Now when the Mesos task terminates: Mesos tries to cleanup all cgroups within > the top level cgroup it knows about. If the cgroup created in the task > context exists when LinuxLauncherProcess::destroy() is called but is > eventually cleaned up by the container before we do a freeze() or thaw() or > remove(), it fails at those stages leading to an incomplete cleanup of the > container. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MESOS-6414) Task cleanup fails when the containers includes cgroups not owned by Mesos
[ https://issues.apache.org/jira/browse/MESOS-6414?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15586919#comment-15586919 ] Jie Yu commented on MESOS-6414: --- I cannot fully understand the problem. What do you mean by " a mesos task is launched in a cgroup outside of the context of Mesos"? > Task cleanup fails when the containers includes cgroups not owned by Mesos > -- > > Key: MESOS-6414 > URL: https://issues.apache.org/jira/browse/MESOS-6414 > Project: Mesos > Issue Type: Bug > Components: cgroups >Reporter: Anindya Sinha >Assignee: Anindya Sinha >Priority: Minor > > If a mesos task is launched in a cgroup outside of the context of Mesos, > Mesos is unaware of that cgroup created in the task context. > Now when the Mesos task terminates: Mesos tries to cleanup all cgroups within > the top level cgroup it knows about. If the cgroup created in the task > context exists when LinuxLauncherProcess::destroy() is called but is > eventually cleaned up by the container before we do a freeze() or thaw() or > remove(), it fails at those stages leading to an incomplete cleanup of the > container. -- This message was sent by Atlassian JIRA (v6.3.4#6332)