Re: [systemd-devel] [Q] About supporting nested systemd daemon

2015-05-03 Thread Lennart Poettering
On Thu, 30.04.15 15:42, Alban Crequy (al...@endocode.com) wrote:

  systemd-nspawn nowadays mounts all hierarchies into the container, but
  mounts all controller hierarchies read-only, and of the name=systemd
  hierarchy mounts everything read-only, except the subtree the
  container is allowed to manage. That way only the cgroup tree the
  container needs access to is writable to it. That solution however
  does not hide the cgroup tree. A process running inside the container
  can still go an explore the tree and its attributes. However, all
  other groups will appear empty to it, since processes not in the
  container PID namespaces will be suppressed when reading the member
  process list.
 
 To sum up what systemd-nspawn is currently mounting in the container:
 - /sys/fs/cgroup/systemd/  --  mounted RO
 - /sys/fs/cgroup/systemd/machine.slice/machine-xxx.scope/  -- mounted RW
 - /sys/fs/cgroup/cpu,cpuacct/  --  mounted RO
 - etc. for other cgroup hierarchies  --  mounted RO

Correct.

 In order to let systemd in the container restrict cpu, memory, etc. on
 some of its services (see manpage systemd.resource-control(5)), rkt
 would like systemd-nspawn to mount a subtree of some hierarchy
 (cpu,cpuacct, memory) in read-write mode.

That's really not a safe thing to do right now... the kernel isn't
ready for this, as cgroups access is an all-or-nothing thing
currently: if you have access to a cgroup and cane creat child cgroups
in it you have access to *all* attributes you like, the dangerous ones
as well as the not so dangerous ones.

 Is there any issues with changing the systemd-nspawn mounts in the
 following way:
 - /sys/fs/cgroup/systemd/  --  mounted RO
 - /sys/fs/cgroup/systemd/machine.slice/machine-xxx.scope/  -- mounted RW
 - /sys/fs/cgroup/cpu,cpuacct/  --  mounted RO
 - /sys/fs/cgroup/cpu,cpuacct/machine.slice/machine-xxx.scope/  -- mounted RW
 - etc. for other cgroup hierarchies.
 
 Iago wrote two experimental patches on systemd-nspawn to try that and
 it worked. Delegate=yes was enabled in systemd-nspawn in order to test
 this:
 https://github.com/endocode/systemd/commits/iaguis/delegate
 
 But I would like to know what is missing to make this safe (or if it
 is already safe to do).

Well, nspawn does actually not make any guarantees about security
currently. Since we pass CAP_SYS_ADMIN by default to the contaienrs
people can mount whatever they want and remount things freely from
within. Hence, opening this up would not make things much worse.

That said, I am a bit concerned about opening this up by default. Even
though containers are insecure we should try to be safe wherever we
can if it doesn't affect usability too much. 

Adding a new cmdline switch for all of this sounds not too attractive
though, but maybe a --delegate switch would be OK, which would open up
all controllers to the containers It would have a similar effect
then on the containers as Delegate=yes has for service processes...

Lennart

-- 
Lennart Poettering, Red Hat
___
systemd-devel mailing list
systemd-devel@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/systemd-devel


Re: [systemd-devel] [Q] About supporting nested systemd daemon

2015-04-30 Thread Alban Crequy
On Wed, Feb 25, 2015 at 6:48 PM, Lennart Poettering
lenn...@poettering.net wrote:
 On Wed, 25.02.15 00:05, Cyrill Gorcunov (gorcu...@gmail.com) wrote:

 Hi all! I would really appreciate if someone enlighten me if there is some 
 simple
 solution for the problem we met in OpenVZ: modern containers are mostly 
 systemd
 based so that once it is started up the systemd daemon mounts own instance of
 the systemd cgroup (if previously has not been pre-mounted by container 
 startup
 tools or whatever). To make a strict isolation of nested systemd cgroup (by
 nested I mean systemd cgroup instance mounted inside container) we've 
 patched
 the kernel so that container's systemd obtains own instance of cgroup 
 non-intersected
 anyhow with one present on a host system.

 And we would really love to get rid of this kind of kernel's hack but be able
 to isolate nested systemd with own cgroup instance using solely userspace
 tools. Is there some way to reach this?

 Not really. cgroupfs doesn't really allow that. First of all the root
 cgroup has a different set of attributes than child cgroups, hence you
 cannot mount an arbitrary child to the root cgroup and assume it
 works. But even worse, /proc/$PID/cgroup actually contains the full
 cgroup path, and hence mounting only a subtree would break the
 refernces from that file.

 systemd-nspawn nowadays mounts all hierarchies into the container, but
 mounts all controller hierarchies read-only, and of the name=systemd
 hierarchy mounts everything read-only, except the subtree the
 container is allowed to manage. That way only the cgroup tree the
 container needs access to is writable to it. That solution however
 does not hide the cgroup tree. A process running inside the container
 can still go an explore the tree and its attributes. However, all
 other groups will appear empty to it, since processes not in the
 container PID namespaces will be suppressed when reading the member
 process list.

To sum up what systemd-nspawn is currently mounting in the container:
- /sys/fs/cgroup/systemd/  --  mounted RO
- /sys/fs/cgroup/systemd/machine.slice/machine-xxx.scope/  -- mounted RW
- /sys/fs/cgroup/cpu,cpuacct/  --  mounted RO
- etc. for other cgroup hierarchies  --  mounted RO

In order to let systemd in the container restrict cpu, memory, etc. on
some of its services (see manpage systemd.resource-control(5)), rkt
would like systemd-nspawn to mount a subtree of some hierarchy
(cpu,cpuacct, memory) in read-write mode.

Is there any issues with changing the systemd-nspawn mounts in the
following way:
- /sys/fs/cgroup/systemd/  --  mounted RO
- /sys/fs/cgroup/systemd/machine.slice/machine-xxx.scope/  -- mounted RW
- /sys/fs/cgroup/cpu,cpuacct/  --  mounted RO
- /sys/fs/cgroup/cpu,cpuacct/machine.slice/machine-xxx.scope/  -- mounted RW
- etc. for other cgroup hierarchies.

Iago wrote two experimental patches on systemd-nspawn to try that and
it worked. Delegate=yes was enabled in systemd-nspawn in order to test
this:
https://github.com/endocode/systemd/commits/iaguis/delegate

But I would like to know what is missing to make this safe (or if it
is already safe to do).

 There have been proposals on LKML to add cgroup namespacings, but no
 idea where that went.

 LXC created a FUSE emulation of /proc and /sys, called lxcfs to solve
 this problem. Quite honestly I find this a pretty crazy idea however.

 If I understand correctly we can provide separate slice to container's
 systemd leaving the rest of host cgroup in ro mode, right?

 Yes.

 If so maybe there a way to hide host cgroup completely from
 container so it would see only own cgroup in sysfs?

 I don't see how this could work. I mean, you could overmount all other
 cgroup siblings with empty directories in the containers, but not
 realy scalable nor compatible with cgroups being added or removed
 later on...

 Lennart

 --
 Lennart Poettering, Red Hat
 ___
 systemd-devel mailing list
 systemd-devel@lists.freedesktop.org
 http://lists.freedesktop.org/mailman/listinfo/systemd-devel
___
systemd-devel mailing list
systemd-devel@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/systemd-devel


Re: [systemd-devel] [Q] About supporting nested systemd daemon

2015-02-25 Thread Lennart Poettering
On Wed, 25.02.15 00:05, Cyrill Gorcunov (gorcu...@gmail.com) wrote:

 Hi all! I would really appreciate if someone enlighten me if there is some 
 simple
 solution for the problem we met in OpenVZ: modern containers are mostly 
 systemd
 based so that once it is started up the systemd daemon mounts own instance of
 the systemd cgroup (if previously has not been pre-mounted by container 
 startup
 tools or whatever). To make a strict isolation of nested systemd cgroup (by
 nested I mean systemd cgroup instance mounted inside container) we've 
 patched
 the kernel so that container's systemd obtains own instance of cgroup 
 non-intersected
 anyhow with one present on a host system.
 
 And we would really love to get rid of this kind of kernel's hack but be able
 to isolate nested systemd with own cgroup instance using solely userspace
 tools. Is there some way to reach this?

Not really. cgroupfs doesn't really allow that. First of all the root
cgroup has a different set of attributes than child cgroups, hence you
cannot mount an arbitrary child to the root cgroup and assume it
works. But even worse, /proc/$PID/cgroup actually contains the full
cgroup path, and hence mounting only a subtree would break the
refernces from that file.

systemd-nspawn nowadays mounts all hierarchies into the container, but
mounts all controller hierarchies read-only, and of the name=systemd
hierarchy mounts everything read-only, except the subtree the
container is allowed to manage. That way only the cgroup tree the
container needs access to is writable to it. That solution however
does not hide the cgroup tree. A process running inside the container
can still go an explore the tree and its attributes. However, all
other groups will appear empty to it, since processes not in the
container PID namespaces will be suppressed when reading the member
process list.

There have been proposals on LKML to add cgroup namespacings, but no
idea where that went.

LXC created a FUSE emulation of /proc and /sys, called lxcfs to solve
this problem. Quite honestly I find this a pretty crazy idea however.

 If I understand correctly we can provide separate slice to container's
 systemd leaving the rest of host cgroup in ro mode, right?

Yes.

 If so maybe there a way to hide host cgroup completely from
 container so it would see only own cgroup in sysfs?

I don't see how this could work. I mean, you could overmount all other
cgroup siblings with empty directories in the containers, but not
realy scalable nor compatible with cgroups being added or removed
later on...

Lennart

-- 
Lennart Poettering, Red Hat
___
systemd-devel mailing list
systemd-devel@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/systemd-devel


Re: [systemd-devel] [Q] About supporting nested systemd daemon

2015-02-25 Thread Cyrill Gorcunov
On Wed, Feb 25, 2015 at 06:48:20PM +0100, Lennart Poettering wrote:
...
 
 There have been proposals on LKML to add cgroup namespacings, but no
 idea where that went.

As far as I know they are still being discussed. Thanks a huge for reply, 
Lennart!
Need to figure out if we can use this nspawn facility.

Cyrill
___
systemd-devel mailing list
systemd-devel@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/systemd-devel