Re: [systemd-devel] Use of namespaced cgroups (aka Docker in systemd-nspawn containers)
On Mon, 27.06.16 16:58, Lee Hambley (lee.hamb...@gmail.com) wrote: > Hi List, > > My company is currently conducting research into the most viable container > technology that fits our stack (CentOS based) and given our already > widespread reliance on systemd, I have a personal stake in preferring not > to introduce other tooling (LXD, the 2nd place leader) into our stack. > > I'd like to know what is required to fulfil our use-case (Docker in > LXD/systemd-nspawn) > > Here's what I (think I) know: > >- Docker can't run in systemd-nspawn because cgroup fs is mounted ro, >and the systemd-nspwan container sees the entire system's cgroupfs (no >namespacing) There's a patch waiting in github, to add cgroup namespace support to nspawn: https://github.com/systemd/systemd/pull/3589 I am not a Docker guy, but do note that nspawn payloads have write access to the name=systemd hierarchy below their subtree, and can delegate that further, hence Docker could work, if it wanted to, as long as it turns on delegation in its service or asks for a scope with delegation turned on. nspawn itself is actually fine with running inside of nspawn (or at least used to, haven't tested this in a while). Note that delegation of resource controllers is not safe on cgroupsv1 however, and nspawn hence makes all resource controllers (meaning: all of "cpu", "memory", "blkio", …) read-only. This will become safe with cgroupv2. Effectively this means that you can set resource limits on the outermost container, but not on anything inside of it. >- cgroups filesystem normally mounted ro in containers, to protect the >host (or, something related to privileged containers) well, it's not that easy. Today, systemd makes all cgroup controller hierarchies read-only, except for the name=systemd named hierarchy, where everything above the container's cgroup subtree is read-only, but the subtree itself writable. > - When mounted rw it can break the host (not the worst problem in the > world, we're not defending against malice here, but apparently > it's trivial > to brick the host by having systemd fight over ttys, etc) well, if we'd mount all cgroup hierarchies writable, inclduing the various resource controller hiearchies, and everything above the container's subtree in the name=systemd hierarchy, then this would be a major security problem. First of all, controller delegation is not safe on cgroupv1 (as mentioned above), and secondly this would enable the container to interfere with the host's cgroup tree, which is highly problematic. That said, containers on Linux are not a security concept really anyway, there are more holes in the entire model than in swiss cheese. But we should at least close the holes we are aware of. > - it might be fair to say that privilidged containers >- namespaces cgroups are relatively new in linux > - available 4.6 [1] > - backported to 4.4+ on Ubuntu kernels >- We think LXD does something around setns() [2] to make sure that the >container has a correct view of the cgroup "subtree". yes, cgroup namespaces are very new. Also, they only make full sense on cgroupsv2 as delegation isn't safe on cgroupsv1 anyway. > I suspect something can be done in .nspawn files to grant certain > privileges to work around issues related to ro/rw cgroups trees, etc but I > think systemd-nspawn has to know about creating the correct cgroup > hierarchy before passing control to the As mentioned, if Docker wants to it could work just fine inside of an nspawn container, it won't have access to any controllers, but it gets enough write access to delegate things further. Lennart -- Lennart Poettering, Red Hat ___ systemd-devel mailing list systemd-devel@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/systemd-devel
Re: [systemd-devel] Use of namespaced cgroups (aka Docker in systemd-nspawn containers)
On Mon, Jun 27, 2016 at 4:58 PM, Lee Hambleywrote: > Hi List, > > My company is currently conducting research into the most viable container > technology that fits our stack (CentOS based) and given our already > widespread reliance on systemd, I have a personal stake in preferring not to > introduce other tooling (LXD, the 2nd place leader) into our stack. > > I'd like to know what is required to fulfil our use-case (Docker in > LXD/systemd-nspawn) > Hi Lee, You may want to look into rkt[1] if you're on CentOS 7. By default it uses systemd-nspawn to set up the containerized environment and it's designed to work and integrate well with systemd. If you want to talk more about it, it'd probably be best to take the conversation to the rkt-dev list[2] or the #rkt-dev freennode channel. Cheers, Chris disclaimer: My company contributes to rkt. [1] https://github.com/coreos/rkt [2] https://groups.google.com/forum/#!forum/rkt-dev > Here's what I (think I) know: > > Docker can't run in systemd-nspawn because cgroup fs is mounted ro, and the > systemd-nspwan container sees the entire system's cgroupfs (no namespacing) > cgroups filesystem normally mounted ro in containers, to protect the host > (or, something related to privileged containers) > > When mounted rw it can break the host (not the worst problem in the world, > we're not defending against malice here, but apparently it's trivial to > brick the host by having systemd fight over ttys, etc) > it might be fair to say that privilidged containers > > namespaces cgroups are relatively new in linux > > available 4.6 [1] > backported to 4.4+ on Ubuntu kernels > > We think LXD does something around setns() [2] to make sure that the > container has a correct view of the cgroup "subtree". > > > I suspect something can be done in .nspawn files to grant certain privileges > to work around issues related to ro/rw cgroups trees, etc but I think > systemd-nspawn has to know about creating the correct cgroup hierarchy > before passing control to the > > Please excuse the "idiot knows what he's talking about tone" I'm very deep > into this stuff today, and not in a good way. > > Thanks sincerely, > > --- > > [1]: > https://www.phoronix.com/scan.php?page=news_item=CGroup-Namespaces-Linux-4.6 > [2]: > https://github.com/lxc/lxd/blob/c8a2956fae6d5d2092e17a3229e4640b53c8a854/lxd/nsexec.go#L107-L126 > > Lee Hambley > http://lee.hambley.name/ > +49 (0) 170 298 5667 > > ___ > systemd-devel mailing list > systemd-devel@lists.freedesktop.org > https://lists.freedesktop.org/mailman/listinfo/systemd-devel > ___ systemd-devel mailing list systemd-devel@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/systemd-devel
[systemd-devel] Use of namespaced cgroups (aka Docker in systemd-nspawn containers)
Hi List, My company is currently conducting research into the most viable container technology that fits our stack (CentOS based) and given our already widespread reliance on systemd, I have a personal stake in preferring not to introduce other tooling (LXD, the 2nd place leader) into our stack. I'd like to know what is required to fulfil our use-case (Docker in LXD/systemd-nspawn) Here's what I (think I) know: - Docker can't run in systemd-nspawn because cgroup fs is mounted ro, and the systemd-nspwan container sees the entire system's cgroupfs (no namespacing) - cgroups filesystem normally mounted ro in containers, to protect the host (or, something related to privileged containers) - When mounted rw it can break the host (not the worst problem in the world, we're not defending against malice here, but apparently it's trivial to brick the host by having systemd fight over ttys, etc) - it might be fair to say that privilidged containers - namespaces cgroups are relatively new in linux - available 4.6 [1] - backported to 4.4+ on Ubuntu kernels - We think LXD does something around setns() [2] to make sure that the container has a correct view of the cgroup "subtree". I suspect something can be done in .nspawn files to grant certain privileges to work around issues related to ro/rw cgroups trees, etc but I think systemd-nspawn has to know about creating the correct cgroup hierarchy before passing control to the Please excuse the "idiot knows what he's talking about tone" I'm very deep into this stuff today, and not in a good way. Thanks sincerely, --- [1]: https://www.phoronix.com/scan.php?page=news_item=CGroup-Namespaces-Linux-4.6 [2]: https://github.com/lxc/lxd/blob/c8a2956fae6d5d2092e17a3229e4640b53c8a854/lxd/nsexec.go#L107-L126 Lee Hambley http://lee.hambley.name/ +49 (0) 170 298 5667 ___ systemd-devel mailing list systemd-devel@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/systemd-devel