On Mon, 27.06.16 16:58, Lee Hambley (lee.hamb...@gmail.com) wrote: > Hi List, > > My company is currently conducting research into the most viable container > technology that fits our stack (CentOS based) and given our already > widespread reliance on systemd, I have a personal stake in preferring not > to introduce other tooling (LXD, the 2nd place leader) into our stack. > > I'd like to know what is required to fulfil our use-case (Docker in > LXD/systemd-nspawn) > > Here's what I (think I) know: > > - Docker can't run in systemd-nspawn because cgroup fs is mounted ro, > and the systemd-nspwan container sees the entire system's cgroupfs (no > namespacing)
There's a patch waiting in github, to add cgroup namespace support to nspawn: https://github.com/systemd/systemd/pull/3589 I am not a Docker guy, but do note that nspawn payloads have write access to the name=systemd hierarchy below their subtree, and can delegate that further, hence Docker could work, if it wanted to, as long as it turns on delegation in its service or asks for a scope with delegation turned on. nspawn itself is actually fine with running inside of nspawn (or at least used to, haven't tested this in a while). Note that delegation of resource controllers is not safe on cgroupsv1 however, and nspawn hence makes all resource controllers (meaning: all of "cpu", "memory", "blkio", …) read-only. This will become safe with cgroupv2. Effectively this means that you can set resource limits on the outermost container, but not on anything inside of it. > - cgroups filesystem normally mounted ro in containers, to protect the > host (or, something related to privileged containers) well, it's not that easy. Today, systemd makes all cgroup controller hierarchies read-only, except for the name=systemd named hierarchy, where everything above the container's cgroup subtree is read-only, but the subtree itself writable. > - When mounted rw it can break the host (not the worst problem in the > world, we're not defending against malice here, but apparently > it's trivial > to brick the host by having systemd fight over ttys, etc) well, if we'd mount all cgroup hierarchies writable, inclduing the various resource controller hiearchies, and everything above the container's subtree in the name=systemd hierarchy, then this would be a major security problem. First of all, controller delegation is not safe on cgroupv1 (as mentioned above), and secondly this would enable the container to interfere with the host's cgroup tree, which is highly problematic. That said, containers on Linux are not a security concept really anyway, there are more holes in the entire model than in swiss cheese. But we should at least close the holes we are aware of. > - it might be fair to say that privilidged containers > - namespaces cgroups are relatively new in linux > - available 4.6 [1] > - backported to 4.4+ on Ubuntu kernels > - We think LXD does something around setns() [2] to make sure that the > container has a correct view of the cgroup "subtree". yes, cgroup namespaces are very new. Also, they only make full sense on cgroupsv2 as delegation isn't safe on cgroupsv1 anyway. > I suspect something can be done in .nspawn files to grant certain > privileges to work around issues related to ro/rw cgroups trees, etc but I > think systemd-nspawn has to know about creating the correct cgroup > hierarchy before passing control to the As mentioned, if Docker wants to it could work just fine inside of an nspawn container, it won't have access to any controllers, but it gets enough write access to delegate things further. Lennart -- Lennart Poettering, Red Hat _______________________________________________ systemd-devel mailing list systemd-devel@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/systemd-devel