Re: [systemd-devel] [Q] About supporting nested systemd daemon
On Thu, 30.04.15 15:42, Alban Crequy (al...@endocode.com) wrote: systemd-nspawn nowadays mounts all hierarchies into the container, but mounts all controller hierarchies read-only, and of the name=systemd hierarchy mounts everything read-only, except the subtree the container is allowed to manage. That way only the cgroup tree the container needs access to is writable to it. That solution however does not hide the cgroup tree. A process running inside the container can still go an explore the tree and its attributes. However, all other groups will appear empty to it, since processes not in the container PID namespaces will be suppressed when reading the member process list. To sum up what systemd-nspawn is currently mounting in the container: - /sys/fs/cgroup/systemd/ -- mounted RO - /sys/fs/cgroup/systemd/machine.slice/machine-xxx.scope/ -- mounted RW - /sys/fs/cgroup/cpu,cpuacct/ -- mounted RO - etc. for other cgroup hierarchies -- mounted RO Correct. In order to let systemd in the container restrict cpu, memory, etc. on some of its services (see manpage systemd.resource-control(5)), rkt would like systemd-nspawn to mount a subtree of some hierarchy (cpu,cpuacct, memory) in read-write mode. That's really not a safe thing to do right now... the kernel isn't ready for this, as cgroups access is an all-or-nothing thing currently: if you have access to a cgroup and cane creat child cgroups in it you have access to *all* attributes you like, the dangerous ones as well as the not so dangerous ones. Is there any issues with changing the systemd-nspawn mounts in the following way: - /sys/fs/cgroup/systemd/ -- mounted RO - /sys/fs/cgroup/systemd/machine.slice/machine-xxx.scope/ -- mounted RW - /sys/fs/cgroup/cpu,cpuacct/ -- mounted RO - /sys/fs/cgroup/cpu,cpuacct/machine.slice/machine-xxx.scope/ -- mounted RW - etc. for other cgroup hierarchies. Iago wrote two experimental patches on systemd-nspawn to try that and it worked. Delegate=yes was enabled in systemd-nspawn in order to test this: https://github.com/endocode/systemd/commits/iaguis/delegate But I would like to know what is missing to make this safe (or if it is already safe to do). Well, nspawn does actually not make any guarantees about security currently. Since we pass CAP_SYS_ADMIN by default to the contaienrs people can mount whatever they want and remount things freely from within. Hence, opening this up would not make things much worse. That said, I am a bit concerned about opening this up by default. Even though containers are insecure we should try to be safe wherever we can if it doesn't affect usability too much. Adding a new cmdline switch for all of this sounds not too attractive though, but maybe a --delegate switch would be OK, which would open up all controllers to the containers It would have a similar effect then on the containers as Delegate=yes has for service processes... Lennart -- Lennart Poettering, Red Hat ___ systemd-devel mailing list systemd-devel@lists.freedesktop.org http://lists.freedesktop.org/mailman/listinfo/systemd-devel
Re: [systemd-devel] [Q] About supporting nested systemd daemon
On Wed, Feb 25, 2015 at 6:48 PM, Lennart Poettering lenn...@poettering.net wrote: On Wed, 25.02.15 00:05, Cyrill Gorcunov (gorcu...@gmail.com) wrote: Hi all! I would really appreciate if someone enlighten me if there is some simple solution for the problem we met in OpenVZ: modern containers are mostly systemd based so that once it is started up the systemd daemon mounts own instance of the systemd cgroup (if previously has not been pre-mounted by container startup tools or whatever). To make a strict isolation of nested systemd cgroup (by nested I mean systemd cgroup instance mounted inside container) we've patched the kernel so that container's systemd obtains own instance of cgroup non-intersected anyhow with one present on a host system. And we would really love to get rid of this kind of kernel's hack but be able to isolate nested systemd with own cgroup instance using solely userspace tools. Is there some way to reach this? Not really. cgroupfs doesn't really allow that. First of all the root cgroup has a different set of attributes than child cgroups, hence you cannot mount an arbitrary child to the root cgroup and assume it works. But even worse, /proc/$PID/cgroup actually contains the full cgroup path, and hence mounting only a subtree would break the refernces from that file. systemd-nspawn nowadays mounts all hierarchies into the container, but mounts all controller hierarchies read-only, and of the name=systemd hierarchy mounts everything read-only, except the subtree the container is allowed to manage. That way only the cgroup tree the container needs access to is writable to it. That solution however does not hide the cgroup tree. A process running inside the container can still go an explore the tree and its attributes. However, all other groups will appear empty to it, since processes not in the container PID namespaces will be suppressed when reading the member process list. To sum up what systemd-nspawn is currently mounting in the container: - /sys/fs/cgroup/systemd/ -- mounted RO - /sys/fs/cgroup/systemd/machine.slice/machine-xxx.scope/ -- mounted RW - /sys/fs/cgroup/cpu,cpuacct/ -- mounted RO - etc. for other cgroup hierarchies -- mounted RO In order to let systemd in the container restrict cpu, memory, etc. on some of its services (see manpage systemd.resource-control(5)), rkt would like systemd-nspawn to mount a subtree of some hierarchy (cpu,cpuacct, memory) in read-write mode. Is there any issues with changing the systemd-nspawn mounts in the following way: - /sys/fs/cgroup/systemd/ -- mounted RO - /sys/fs/cgroup/systemd/machine.slice/machine-xxx.scope/ -- mounted RW - /sys/fs/cgroup/cpu,cpuacct/ -- mounted RO - /sys/fs/cgroup/cpu,cpuacct/machine.slice/machine-xxx.scope/ -- mounted RW - etc. for other cgroup hierarchies. Iago wrote two experimental patches on systemd-nspawn to try that and it worked. Delegate=yes was enabled in systemd-nspawn in order to test this: https://github.com/endocode/systemd/commits/iaguis/delegate But I would like to know what is missing to make this safe (or if it is already safe to do). There have been proposals on LKML to add cgroup namespacings, but no idea where that went. LXC created a FUSE emulation of /proc and /sys, called lxcfs to solve this problem. Quite honestly I find this a pretty crazy idea however. If I understand correctly we can provide separate slice to container's systemd leaving the rest of host cgroup in ro mode, right? Yes. If so maybe there a way to hide host cgroup completely from container so it would see only own cgroup in sysfs? I don't see how this could work. I mean, you could overmount all other cgroup siblings with empty directories in the containers, but not realy scalable nor compatible with cgroups being added or removed later on... Lennart -- Lennart Poettering, Red Hat ___ systemd-devel mailing list systemd-devel@lists.freedesktop.org http://lists.freedesktop.org/mailman/listinfo/systemd-devel ___ systemd-devel mailing list systemd-devel@lists.freedesktop.org http://lists.freedesktop.org/mailman/listinfo/systemd-devel
Re: [systemd-devel] [Q] About supporting nested systemd daemon
On Wed, 25.02.15 00:05, Cyrill Gorcunov (gorcu...@gmail.com) wrote: Hi all! I would really appreciate if someone enlighten me if there is some simple solution for the problem we met in OpenVZ: modern containers are mostly systemd based so that once it is started up the systemd daemon mounts own instance of the systemd cgroup (if previously has not been pre-mounted by container startup tools or whatever). To make a strict isolation of nested systemd cgroup (by nested I mean systemd cgroup instance mounted inside container) we've patched the kernel so that container's systemd obtains own instance of cgroup non-intersected anyhow with one present on a host system. And we would really love to get rid of this kind of kernel's hack but be able to isolate nested systemd with own cgroup instance using solely userspace tools. Is there some way to reach this? Not really. cgroupfs doesn't really allow that. First of all the root cgroup has a different set of attributes than child cgroups, hence you cannot mount an arbitrary child to the root cgroup and assume it works. But even worse, /proc/$PID/cgroup actually contains the full cgroup path, and hence mounting only a subtree would break the refernces from that file. systemd-nspawn nowadays mounts all hierarchies into the container, but mounts all controller hierarchies read-only, and of the name=systemd hierarchy mounts everything read-only, except the subtree the container is allowed to manage. That way only the cgroup tree the container needs access to is writable to it. That solution however does not hide the cgroup tree. A process running inside the container can still go an explore the tree and its attributes. However, all other groups will appear empty to it, since processes not in the container PID namespaces will be suppressed when reading the member process list. There have been proposals on LKML to add cgroup namespacings, but no idea where that went. LXC created a FUSE emulation of /proc and /sys, called lxcfs to solve this problem. Quite honestly I find this a pretty crazy idea however. If I understand correctly we can provide separate slice to container's systemd leaving the rest of host cgroup in ro mode, right? Yes. If so maybe there a way to hide host cgroup completely from container so it would see only own cgroup in sysfs? I don't see how this could work. I mean, you could overmount all other cgroup siblings with empty directories in the containers, but not realy scalable nor compatible with cgroups being added or removed later on... Lennart -- Lennart Poettering, Red Hat ___ systemd-devel mailing list systemd-devel@lists.freedesktop.org http://lists.freedesktop.org/mailman/listinfo/systemd-devel
Re: [systemd-devel] [Q] About supporting nested systemd daemon
On Wed, Feb 25, 2015 at 06:48:20PM +0100, Lennart Poettering wrote: ... There have been proposals on LKML to add cgroup namespacings, but no idea where that went. As far as I know they are still being discussed. Thanks a huge for reply, Lennart! Need to figure out if we can use this nspawn facility. Cyrill ___ systemd-devel mailing list systemd-devel@lists.freedesktop.org http://lists.freedesktop.org/mailman/listinfo/systemd-devel