[jira] [Commented] (MESOS-9319) Move root filesystem creation to the `filesystem/linux` isolator.
[ https://issues.apache.org/jira/browse/MESOS-9319?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16699772#comment-16699772 ] James Peach commented on MESOS-9319: Updated patch series: | [r/69211|https://reviews.apache.org/r/69211] | Improved the code comments for `getContainerDevicesPath`. | | [r/69210|https://reviews.apache.org/r/69210] | Used the MS_SILENT mount flag to elide unwanted logging. | | [r/69086|https://reviews.apache.org/r/69086] | Moved the container root construction to the isolators. | | [r/69450|https://reviews.apache.org/r/69450] | Applied the `ContainerMountInfo` protobuf helper. | > Move root filesystem creation to the `filesystem/linux` isolator. > - > > Key: MESOS-9319 > URL: https://issues.apache.org/jira/browse/MESOS-9319 > Project: Mesos > Issue Type: Bug > Components: containerization >Reporter: James Peach >Assignee: James Peach >Priority: Major > > When using a custom user namespace isolator, the task fails at launch because > opening devices fails with a EPERM error. This problem is described in [this > systemd issue|https://github.com/systemd/systemd/pull/9483] and [this > lxd|https://github.com/lxc/lxd/issues/4950] issue. > The problem arises in the Mesos containerizer due to the order of operations: > # Clone the containerizer with {{CLONE_NEWNS}} > # Mount a tmpfs for the devices > # mknod for the various device nodes > Referring back to the lxc issue, because we do (1) before (2), the tmpfs on > {{/dev}} is marked {{SB_I_NODEV}}. Due to the new 4.18 behavior, the mkdir in > (3) now succeeds (see commit > [55956b59df33|https://github.com/torvalds/linux/commit/55956b59df336f6738da916dbb520b6e37df9fbd]). > Previously it would fail and we would fall back to bind mounting the device. > However, even though we created the device, we can't actually open it due to > the {{SB_I_NODEV}} flag on the tmpfs mount. It appears that the purpose of > allowing mknod is to that containers can create overlayfs whiteouts. > One approach to deal with this in the Mesos containerizer is to complete the > device node cleanup that was begun in with the linux/devices isolator. This > approach involves moving all the responsibility for creating devices back to > the isolators. Then, at containerization time, we simply bind-mount the whole > of /dev from the per-container staging area. Since the isolators create the > devices in the host namespace and on the Mesos work directory, none of the > conditions that trigger the failure would be invoked. > The failure we observed with our tasks was a failure to open {{/dev/null}}, > when redirecting it as standard input to a child process. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (MESOS-9319) Move root filesystem creation to the `filesystem/linux` isolator.
[ https://issues.apache.org/jira/browse/MESOS-9319?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16667915#comment-16667915 ] James Peach commented on MESOS-9319: Retitling, based on a sightly expanded scope from review feedback. Rather than just building /dev in the Linux filesystem isolator, we are going to build the whole root filesystem. | [r/69086|https://reviews.apache.org/r/69086] | Moved container root construction to the isolators. | | [r/69211|https://reviews.apache.org/r/69211] | Improved the code comments for `getContainerDevicesPath`. | | [r/69210|https://reviews.apache.org/r/69210] | Used the MS_SILENT mount flag to elide unwanted logging. | > Move root filesystem creation to the `filesystem/linux` isolator. > - > > Key: MESOS-9319 > URL: https://issues.apache.org/jira/browse/MESOS-9319 > Project: Mesos > Issue Type: Bug > Components: containerization >Reporter: James Peach >Assignee: James Peach >Priority: Major > > When using a custom user namespace isolator, the task fails at launch because > opening devices fails with a EPERM error. This problem is described in [this > systemd issue|https://github.com/systemd/systemd/pull/9483] and [this > lxd|https://github.com/lxc/lxd/issues/4950] issue. > The problem arises in the Mesos containerizer due to the order of operations: > # Clone the containerizer with {{CLONE_NEWNS}} > # Mount a tmpfs for the devices > # mknod for the various device nodes > Referring back to the lxc issue, because we do (1) before (2), the tmpfs on > {{/dev}} is marked {{SB_I_NODEV}}. Due to the new 4.18 behavior, the mkdir in > (3) now succeeds (see commit > [55956b59df33|https://github.com/torvalds/linux/commit/55956b59df336f6738da916dbb520b6e37df9fbd]). > Previously it would fail and we would fall back to bind mounting the device. > However, even though we created the device, we can't actually open it due to > the {{SB_I_NODEV}} flag on the tmpfs mount. It appears that the purpose of > allowing mknod is to that containers can create overlayfs whiteouts. > One approach to deal with this in the Mesos containerizer is to complete the > device node cleanup that was begun in with the linux/devices isolator. This > approach involves moving all the responsibility for creating devices back to > the isolators. Then, at containerization time, we simply bind-mount the whole > of /dev from the per-container staging area. Since the isolators create the > devices in the host namespace and on the Mesos work directory, none of the > conditions that trigger the failure would be invoked. > The failure we observed with our tasks was a failure to open {{/dev/null}}, > when redirecting it as standard input to a child process. -- This message was sent by Atlassian JIRA (v7.6.3#76005)