[jira] [Commented] (MESOS-9319) Create all container devices at isolation time.

2018-10-16 Thread James Peach (JIRA)


[ 
https://issues.apache.org/jira/browse/MESOS-9319?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16652677#comment-16652677
 ] 

James Peach commented on MESOS-9319:


Prototype code looks promising. Currently, /dev is a tmpfs, but in this 
proposal it would be a bind mount to a real filesystem. I'm binding it in 
read-only to prevent disk quota escapes, which seems to work OK.

> Create all container devices at isolation time.
> ---
>
> Key: MESOS-9319
> URL: https://issues.apache.org/jira/browse/MESOS-9319
> Project: Mesos
>  Issue Type: Bug
>  Components: containerization
>Reporter: James Peach
>Assignee: James Peach
>Priority: Major
>
> When using a custom user namespace isolator, the task fails at launch because 
> opening devices fails with a EPERM error. This problem is described in [this 
> system issue|https://github.com/systemd/systemd/pull/9483] and [this 
> lxd|https://github.com/lxc/lxd/issues/4950] issue.
> The problem arises in the Mesos containerizer due to the order of operations:
> # Clone the containerizer with {{CLONE_NEWNS}}
> # Mount a tmpfs for the devices
> # mknod for the various device nodes
> Referring back to the lxc issue, because we do (1) before (2), the tmpfs on 
> {{/dev}} is marked {{SB_I_NODEV}}. Due to the new 4.18 behavior, the mkdir in 
> (3) now succeeds (see commit 
> [55956b59df33|https://github.com/torvalds/linux/commit/55956b59df336f6738da916dbb520b6e37df9fbd]).
>  Previously it would fail and we would fall back to bind mounting the device. 
> However, even though we created the device, we can't actually open it due to 
> the {{SB_I_NODEV}} flag on the tmpfs mount. It appears that the purpose of 
> allowing mknod is to that containers can create overlayfs whiteouts.
> One approach to deal with this in the Mesos containerizer is to complete the 
> device node cleanup that was begun in with the linux/devices isolator. This 
> approach involves moving all the responsibility for creating devices back to 
> the isolators. Then, at containerization time, we simply bind-mount the whole 
> of /dev from the per-container staging area. Since the isolators create the 
> devices in the host namespace and on the Mesos work directory, none of the 
> conditions that trigger the failure would be invoked.
> The failure we observed with our tasks was a failure to open {{/dev/null}}, 
> when redirecting it as standard input to a child process.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (MESOS-9319) Create all container devices at isolation time

2018-10-15 Thread James Peach (JIRA)


[ 
https://issues.apache.org/jira/browse/MESOS-9319?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16650806#comment-16650806
 ] 

James Peach commented on MESOS-9319:


When using a custom user namespace isolator, the task fails at launch because 
opening devices fails with a {{EPERM}} error. This problem is described in 
[this system issue|https://github.com/systemd/systemd/pull/9483] and this [lxd 
issue|https://github.com/lxc/lxd/issues/4950].

The problem arises in the Mesos containerizer due to the order of operations:

# Clone the containerizer with CLONE_NEWNS
# Mount a tmpfs for the devices
# mknod for the various device nodes

Referring back to the lxc issue, because we do (1) before (2), the tmpfs on 
/dev is marked SB_I_NODEV. Due to the new 4.18 behavior, the mkdir in (3) now 
succeeds (see commit 
[55956b59df33|https://github.com/torvalds/linux/commit/55956b59df336f6738da916dbb520b6e37df9fbd]).
 Previously it would fail and we would fall back to bind mounting the device. 
However, even though we created the device, we can't actually open it due to 
the SB_I_NODEV flag on the tmpfs mount. It appears that the purpose of allowing 
mknod is to that containers can create overlayfs whiteouts.

One approach to deal with this in the Mesos containerizer is to complete the 
device node cleanup that was begun in with the linux/devices isolator. This 
approach involves moving all the responsibility for creating devices back to 
the isolators. Then, at containerization time, we simply bind-mount the whole 
of /dev from the per-container staging area. Since the isolators create the 
devices in the host namespace and on the Mesos work directory, none of the 
conditions that trigger the failure would be invoked.


> Create all container devices at isolation time
> --
>
> Key: MESOS-9319
> URL: https://issues.apache.org/jira/browse/MESOS-9319
> Project: Mesos
>  Issue Type: Bug
>  Components: containerization
> Environment: When using a custom user namespace isolator, the task 
> fails at launch because opening devices fails with a {{EPERM}} error. This 
> problem is described in [this system 
> issue|https://github.com/systemd/systemd/pull/9483] and this [lxd 
> issue|https://github.com/lxc/lxd/issues/4950].
> The problem arises in the Mesos containerizer due to the order of operations:
> # Clone the containerizer with CLONE_NEWNS
> # Mount a tmpfs for the devices
> # mknod for the various device nodes
> Referring back to the lxc issue, because we do (1) before (2), the tmpfs on 
> /dev is marked SB_I_NODEV. Due to the new 4.18 behavior, the mkdir in (3) now 
> succeeds (see commit 
> [55956b59df33|https://github.com/torvalds/linux/commit/55956b59df336f6738da916dbb520b6e37df9fbd]).
>  Previously it would fail and we would fall back to bind mounting the device. 
> However, even though we created the device, we can't actually open it due to 
> the SB_I_NODEV flag on the tmpfs mount. It appears that the purpose of 
> allowing mknod is to that containers can create overlayfs whiteouts.
> One approach to deal with this in the Mesos containerizer is to complete the 
> device node cleanup that was begun in with the linux/devices isolator. This 
> approach involves moving all the responsibility for creating devices back to 
> the isolators. Then, at containerization time, we simply bind-mount the whole 
> of /dev from the per-container staging area. Since the isolators create the 
> devices in the host namespace and on the Mesos work directory, none of the 
> conditions that trigger the failure would be invoked.
>Reporter: James Peach
>Priority: Major
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)