Re: [systemd-devel] Unable to run systemd in an LXC / cgroup container.

2012-11-09 Thread Lennart Poettering
On Tue, 06.11.12 11:07, Michael H. Warfield (m...@wittsend.com) wrote:

 Here's where we've seen some problems in the past.  It's not just mounts
 that are propagated but remounts as well.  The problem arose that some
 of us had our containers on a separate partition.  When we would shut a
 container down, that container tried to remount its file systems ro
 which then propagated back into the host causing the hosts file system
 to be ro (doesn't happen if you are running on the host's root fs for
 the containers) and from there across into the other containers.
 
 Are you using MS_SHARED or MS_SLAVE for this?  If you are using
 MS_SHARED do you create a potential security problem where actions in
 the container can bleed into the state of the host and into other
 containers.  That's highly undesirable.

The root namespace is MS_SHARED, and nspawn and libvirt-lxc containers
are MS_SLAVE. That ensures mounts from the host propagate to the
containers but not vice versa.

as
soon as it tries to use pivot_root(), as that is incompatible with
shared mount propagation. The needs fixing in LXC: it should use
MS_MOVE
or MS_BIND to place the new root dir in / instead. A short term
 
  Actually not quite sure how this would work.  It should be possible
  to set up a set of conditions to work around this, but the kernel
  checks at do_pivotroot are pretty harsh - mnt-mnt_parent of both
  the new root and current root have to be not shared.  So perhaps
  we actually first chroot into a dir whose parent is non-shared,
  then pivot_root from there?  :)
  
  (Simple chroot in place of pivot_root still does not suffice, not
  only because of chroot escapes, but also different results in
  /proc/pid/mountinfo and friends)
 
 Comments on Serge's points?

Don't use privot_root. Instead use MS_MOVE to move the container root to
/. 

 At this point, we see where this will become problematical in Fedora 18
 but appears to already be problematical in NixOS that another user is
 running and which containers systemd 195 in the host.

THere's nothing really problematical with this. LXC should stop using
pivot_root, and use MS_MOVE instead.

 We've had problems with chroot in the past due to chroot escapes and
 other problems years ago as Serge mentioned.

chroot() is not useful for this. You should invoke chroot() once, to
fix chroot after adjusting the namespace, but that's not the call that
actually shifts the namespace around. That should be done with MS_MOVE.

The code should like this:

http://cgit.freedesktop.org/systemd/systemd/tree/src/nspawn/nspawn.c#n1264

Lennart

-- 
Lennart Poettering - Red Hat, Inc.
___
systemd-devel mailing list
systemd-devel@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/systemd-devel


Re: [systemd-devel] Unable to run systemd in an LXC / cgroup container.

2012-11-06 Thread Michael H. Warfield
On Mon, 2012-10-22 at 16:11 +0200, Lennart Poettering wrote:

 Note that there are reports that LXC has issues with the fact that newer
 systemd enables shared mount propagation for all mounts by default (this
 should actually be beneficial for containers as this ensures that new
 mounts appear in the containers). LXC when run on such a system fails as
 soon as it tries to use pivot_root(), as that is incompatible with
 shared mount propagation. The needs fixing in LXC: it should use MS_MOVE
 or MS_BIND to place the new root dir in / instead. A short term
 work-around is to simply remount the root tree to private before
 invoking LXC.

In another thread, Serge had some heartburn over this shared mount
propagation which then rang a bell in my head about past problems we
have seen.

 On Mon, 2012-11-05 at 08:51 -0600, Serge Hallyn wrote: 
  Quoting Michael H. Warfield (m...@wittsend.com):
  ...
  This was from another threat with the systemd guys.
  
  On Mon, 2012-10-22 at 16:11 +0200, Lennart Poettering wrote:
   Note that there are reports that LXC has issues with the fact that
   newer
   systemd enables shared mount propagation for all mounts by default
   (this
   should actually be beneficial for containers as this ensures that new
   mounts appear in the containers). LXC when run on such a system fails
 
 MS_SLAVE does this as well.  MS_SHARED means container mounts also
 propagate into the host, which is less desirable in most cases.

Here's where we've seen some problems in the past.  It's not just mounts
that are propagated but remounts as well.  The problem arose that some
of us had our containers on a separate partition.  When we would shut a
container down, that container tried to remount its file systems ro
which then propagated back into the host causing the hosts file system
to be ro (doesn't happen if you are running on the host's root fs for
the containers) and from there across into the other containers.

Are you using MS_SHARED or MS_SLAVE for this?  If you are using
MS_SHARED do you create a potential security problem where actions in
the container can bleed into the state of the host and into other
containers.  That's highly undesirable.  If a mount in a propagates back
into the host and is then reflected to another container sharing that
same mount tree (I have shared partitions specific to that sort of
thing) does that create an information disclosure situation of one
container mounts a new file system and the other container sees the new
mount?  I don't know if the mount propagation would reflect back up the
shared tree or not but I have certainly seen remounts do this.  I don't
see that as desirable.  Maybe I'm misunderstand how this is suppose to
work but I intend to test out those scenarios when I have a chance.  I
do know that, when testing that ro problem, I was able to remount a
partition ro in one container and it would switch in the host and the
other container and I could the remount it rw in the other container and
have it propagate back.  Not good.

Can you offer any clarity on this?

   as
   soon as it tries to use pivot_root(), as that is incompatible with
   shared mount propagation. The needs fixing in LXC: it should use
   MS_MOVE
   or MS_BIND to place the new root dir in / instead. A short term

 Actually not quite sure how this would work.  It should be possible
 to set up a set of conditions to work around this, but the kernel
 checks at do_pivotroot are pretty harsh - mnt-mnt_parent of both
 the new root and current root have to be not shared.  So perhaps
 we actually first chroot into a dir whose parent is non-shared,
 then pivot_root from there?  :)
 
 (Simple chroot in place of pivot_root still does not suffice, not
 only because of chroot escapes, but also different results in
 /proc/pid/mountinfo and friends)

Comments on Serge's points?

At this point, we see where this will become problematical in Fedora 18
but appears to already be problematical in NixOS that another user is
running and which containers systemd 195 in the host.

We've had problems with chroot in the past due to chroot escapes and
other problems years ago as Serge mentioned.

 Lennart

 -- 
 Lennart Poettering - Red Hat, Inc.

Regards,
Mike
-- 
Michael H. Warfield (AI4NB) | (770) 985-6132 |  m...@wittsend.com
   /\/\|=mhw=|\/\/  | (678) 463-0932 |  http://www.wittsend.com/mhw/
   NIC whois: MHW9  | An optimist believes we live in the best of all
 PGP Key: 0x674627FF| possible worlds.  A pessimist is sure of it!


signature.asc
Description: This is a digitally signed message part
___
systemd-devel mailing list
systemd-devel@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/systemd-devel


Re: [systemd-devel] Unable to run systemd in an LXC / cgroup container.

2012-10-27 Thread Colin Guthrie
'Twas brillig, and Michael H. Warfield at 26/10/12 18:18 did gyre and
gimble:
 What the hell is this?  /var/run is symlinked to /run and is mounted
 with a tmpfs.

Yup, that's how /var/run and /run is being handled these days.

It provides a consistent space to pass info from the initrd over to the
main system and has various other uses also.

If you want to ensure files are created in this folder, just drop a
config file in to /usr/lib/tmpfiles.d/ in the package in question. See
man systemd-tmpfiles for more info.

Could be some packages are not fully upgraded to this concept in F17. As
a non-fedora user, I can't really comment on that specifically.

Col


-- 

Colin Guthrie
gmane(at)colin.guthr.ie
http://colin.guthr.ie/

Day Job:
  Tribalogic Limited http://www.tribalogic.net/
Open Source:
  Mageia Contributor http://www.mageia.org/
  PulseAudio Hacker http://www.pulseaudio.org/
  Trac Hacker http://trac.edgewall.org/

___
systemd-devel mailing list
systemd-devel@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/systemd-devel


Re: [systemd-devel] Unable to run systemd in an LXC / cgroup container.

2012-10-27 Thread Michael H. Warfield
On Sat, 2012-10-27 at 19:44 +0100, Colin Guthrie wrote:
 'Twas brillig, and Michael H. Warfield at 26/10/12 18:18 did gyre and
 gimble:
  What the hell is this?  /var/run is symlinked to /run and is mounted
  with a tmpfs.

 Yup, that's how /var/run and /run is being handled these days.

 It provides a consistent space to pass info from the initrd over to the
 main system and has various other uses also.

Interesting.  I hadn't considered that aspect of it before.  Very
interesting.

 If you want to ensure files are created in this folder, just drop a
 config file in to /usr/lib/tmpfiles.d/ in the package in question. See
 man systemd-tmpfiles for more info.

NOW THAT is something else I needed to know about!  Thank you very very
much!  Learned something new.  This whole thing has been a massive
learning experience getting this container kick started.

 Could be some packages are not fully upgraded to this concept in F17. As
 a non-fedora user, I can't really comment on that specifically.

As it turns out, the kernel has had some of our patches applied that I
wasn't aware of vis-a-vis reboot/halt and this should no longer be an
issue.  I'm still struggling with the tmpfs on /dev thing and have run
into a catch-22 with regards to that.  I can mount tmpfs on /dev just
fine and can populate it just fine in a post mount hook but, then, we're
trying to mount a devpts file system on /dev/pts before we've had a
chance to populate it and it's then crashing on the mount.  Sigh...  I
think that's going to now have to wait for Serge or Daniel to comment
on.

 Col
 
 
 -- 
 
 Colin Guthrie
 gmane(at)colin.guthr.ie
 http://colin.guthr.ie/
 
 Day Job:
   Tribalogic Limited http://www.tribalogic.net/
 Open Source:
   Mageia Contributor http://www.mageia.org/
   PulseAudio Hacker http://www.pulseaudio.org/
   Trac Hacker http://trac.edgewall.org/

Regards,
Mike
-- 
Michael H. Warfield (AI4NB) | (770) 985-6132 |  m...@wittsend.com
   /\/\|=mhw=|\/\/  | (678) 463-0932 |  http://www.wittsend.com/mhw/
   NIC whois: MHW9  | An optimist believes we live in the best of all
 PGP Key: 0x674627FF| possible worlds.  A pessimist is sure of it!


signature.asc
Description: This is a digitally signed message part
___
systemd-devel mailing list
systemd-devel@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/systemd-devel


Re: [systemd-devel] Unable to run systemd in an LXC / cgroup container.

2012-10-26 Thread Michael H. Warfield
On Thu, 2012-10-25 at 23:38 +0200, Lennart Poettering wrote:
 On Thu, 25.10.12 11:59, Michael H. Warfield (m...@wittsend.com) wrote:

   http://wiki.1tux.org/wiki/Lxc/Installation#Additional_notes
  
   Unfortunately, in our case, merely getting a mount in there is a
   complication in that it also has to be populated but, at least, we
   understand the problem set now.
  
  Ok...  Serge and I were corresponding on the lxc-users list and he had a
  suggestion that worked but I consider to be a bit of a sub-optimal
  workaround.  Ironically, it was to mount devtmpfs on /dev.  We don't
  (currently) have a method to auto-populate a tmpfs mount with the needed
  devices and this provided it.  It does have a problem that makes me
  uncomfortable in that the container now has visibility into the
  hosts /dev system.  I'm a security expert and I'm not comfortable with
  that solution even with the controls we have.  We can control access
  but still, not happy with that.

 That's a pretty bad idea, access control to the device nodes in devtmpfs
 is controlled by the host's udev instance. That means if your group/user
 lists in the container and the host differ you already lost. Also access
 control in udev is dynamic, due to stuff like uaccess and similar. You
 really don't want to to have that into the container, i.e. where device
 change ownership all the time with UIDs/GIDs that make no sense at all
 in the container.

Concur.

 In general I think it's a good idea not to expose any real devices to
 the container, but only the virtual ones that are programming
 APIs. That means: no /dev/sda, or /dev/ttyS0, but /dev/null, /dev/zero,
 /dev/random, /dev/urandom. And creating the latter in a tmpfs is quite
 simple.

  If I run lxc-console (which attaches to one of the vtys) it gives me
  nothing.  Under sysvinit and upstart I get vty login prompts because
  they have started getty on those vtys.  This is important in case
  network access has not started for one reason or another and the
  container was started detached in the background.

 The getty behaviour of systemd in containers is documented here:

 http://www.freedesktop.org/wiki/Software/systemd/ContainerInterface

Sorry.  This is unacceptable.  We need some way that these will be
active and you will be consistent with other containers.

 If LXC mounts ptys on top of the VT devices that's a really bad idea
 too, since /dev/tty1 and friends expose a number of APIs beyond the mere
 tty device that you cannot emulate with that. It includes files in /sys,
 as well as /dev/vcs and /dev/vcsa, various ioctls, and so on. Heck, even
 the most superficial of things, the $TERM variable will be
 incorrect. LXC shouldn't do that.

REGARDLESS.  I'm in this situation now testing what I thought was a hang
condition (which is proving to be something else).  I started a
container detached redirecting the console to a file (a parameter I was
missing) and the log to another file (which I had been doing).  But, for
some reason, sshd is not starting up.  I have no way to attach to the
bloody console of the container and I have no getty's on a vty I can
attach to using lxc-console and I can't remote access a container which,
for all other intents and purposes, appears to be running fine.
Parameterize this bloody thing so we can have control over it.

 LXC really shouldn't pretent a pty was a VT tty, it's not. From the
 libvirt guys it has been proposed that we introduce a new env var to
 pass to PID 1 of the container, that simply lists ptys to start gettys
 on. That way we don't pretend anything about ttys that the ttys can't
 hold and have a clean setup.
 
  I SUSPECT the hang condition is something to do with systemd trying to
  start and interactive console on /dev/console, which sysvinit and
  upstart do not do. 
 
 Yes, this is documented, please see the link I already posted, and which
 I linked above a second time.
 
  I've got some more problems relating to shutting down containers, some
  of which may be related to mounting tmpfs on /run to which /var/run is
  symlinked to.  We're doing halt / restart detection by monitoring utmp
  in that directory but it looks like utmp isn't even in that directory
  anymore and mounting tmpfs on it was always problematical.  We may have
  to have a more generic method to detect when a container has shut down
  or is restarting in that case.
 
 I can't parse this. The system call reboot() is virtualized for
 containers just fine and the container managaer (i.e. LXC) can check for
 that easily.
 
 Lennart
 
 -- 
 Lennart Poettering - Red Hat, Inc.
 

-- 
Michael H. Warfield (AI4NB) | (770) 985-6132 |  m...@wittsend.com
   /\/\|=mhw=|\/\/  | (678) 463-0932 |  http://www.wittsend.com/mhw/
   NIC whois: MHW9  | An optimist believes we live in the best of all
 PGP Key: 0x674627FF| possible worlds.  A pessimist is sure of it!


signature.asc
Description: This is a digitally signed message part

Re: [systemd-devel] Unable to run systemd in an LXC / cgroup container.

2012-10-26 Thread Michael H. Warfield
On Thu, 2012-10-25 at 23:38 +0200, Lennart Poettering wrote:
 On Thu, 25.10.12 11:59, Michael H. Warfield (m...@wittsend.com) wrote:

  I've got some more problems relating to shutting down containers, some
  of which may be related to mounting tmpfs on /run to which /var/run is
  symlinked to.  We're doing halt / restart detection by monitoring utmp
  in that directory but it looks like utmp isn't even in that directory
  anymore and mounting tmpfs on it was always problematical.  We may have
  to have a more generic method to detect when a container has shut down
  or is restarting in that case.

 I can't parse this. The system call reboot() is virtualized for
 containers just fine and the container managaer (i.e. LXC) can check for
 that easily.

I strongly suspect that the condition I'm dealing with (not being able
to restart the container) is an artifact of the devtmpfs kludge.  I'm
seeing some errors relating to /dev/loop* busy that seems to be related
to the hung resources resulting in the inability to remove the zombie
container.  Disregard until I can get further information following a
switch to a template based setup.

 Lennart

Regards,
Mike
-- 
Michael H. Warfield (AI4NB) | (770) 985-6132 |  m...@wittsend.com
   /\/\|=mhw=|\/\/  | (678) 463-0932 |  http://www.wittsend.com/mhw/
   NIC whois: MHW9  | An optimist believes we live in the best of all
 PGP Key: 0x674627FF| possible worlds.  A pessimist is sure of it!


signature.asc
Description: This is a digitally signed message part
___
systemd-devel mailing list
systemd-devel@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/systemd-devel


Re: [systemd-devel] Unable to run systemd in an LXC / cgroup container.

2012-10-26 Thread Michael H. Warfield
On Fri, 2012-10-26 at 11:58 -0400, Michael H. Warfield wrote:
 On Thu, 2012-10-25 at 23:38 +0200, Lennart Poettering wrote:
  On Thu, 25.10.12 11:59, Michael H. Warfield (m...@wittsend.com) wrote:
 
http://wiki.1tux.org/wiki/Lxc/Installation#Additional_notes
   
Unfortunately, in our case, merely getting a mount in there is a
complication in that it also has to be populated but, at least, we
understand the problem set now.
   
   Ok...  Serge and I were corresponding on the lxc-users list and he had a
   suggestion that worked but I consider to be a bit of a sub-optimal
   workaround.  Ironically, it was to mount devtmpfs on /dev.  We don't
   (currently) have a method to auto-populate a tmpfs mount with the needed
   devices and this provided it.  It does have a problem that makes me
   uncomfortable in that the container now has visibility into the
   hosts /dev system.  I'm a security expert and I'm not comfortable with
   that solution even with the controls we have.  We can control access
   but still, not happy with that.
 
  That's a pretty bad idea, access control to the device nodes in devtmpfs
  is controlled by the host's udev instance. That means if your group/user
  lists in the container and the host differ you already lost. Also access
  control in udev is dynamic, due to stuff like uaccess and similar. You
  really don't want to to have that into the container, i.e. where device
  change ownership all the time with UIDs/GIDs that make no sense at all
  in the container.
 
 Concur.
 
  In general I think it's a good idea not to expose any real devices to
  the container, but only the virtual ones that are programming
  APIs. That means: no /dev/sda, or /dev/ttyS0, but /dev/null, /dev/zero,
  /dev/random, /dev/urandom. And creating the latter in a tmpfs is quite
  simple.
 
   If I run lxc-console (which attaches to one of the vtys) it gives me
   nothing.  Under sysvinit and upstart I get vty login prompts because
   they have started getty on those vtys.  This is important in case
   network access has not started for one reason or another and the
   container was started detached in the background.
 
  The getty behaviour of systemd in containers is documented here:
 
  http://www.freedesktop.org/wiki/Software/systemd/ContainerInterface
 
 Sorry.  This is unacceptable.  We need some way that these will be
 active and you will be consistent with other containers.
 
  If LXC mounts ptys on top of the VT devices that's a really bad idea
  too, since /dev/tty1 and friends expose a number of APIs beyond the mere
  tty device that you cannot emulate with that. It includes files in /sys,
  as well as /dev/vcs and /dev/vcsa, various ioctls, and so on. Heck, even
  the most superficial of things, the $TERM variable will be
  incorrect. LXC shouldn't do that.

 REGARDLESS.  I'm in this situation now testing what I thought was a hang
 condition (which is proving to be something else).  I started a
 container detached redirecting the console to a file (a parameter I was
 missing) and the log to another file (which I had been doing).  But, for
 some reason, sshd is not starting up.  I have no way to attach to the
 bloody console of the container and I have no getty's on a vty I can
 attach to using lxc-console and I can't remote access a container which,
 for all other intents and purposes, appears to be running fine.
 Parameterize this bloody thing so we can have control over it.

Here's another weirdism that's in your camp...

The reason that sshd did not start was because the network did not start
(IPv6 was up but IPv4 was not and the startup of several services failed
as a consequence).  Trying to restart the network manually resulted in
this:

[root@alcove mhw]# ifdown eth0
./network-functions: line 237: cd: /var/run/netreport: No such file or directory
[root@alcove mhw]# ifup eth0
./network-functions: line 237: cd: /var/run/netreport: No such file or directory
[root@alcove mhw]# ls /var/run/
dbus  messagebus.pid  rpcbind.sock  systemd  user
log   mount   syslogd.pid   udev

What the hell is this?  /var/run is symlinked to /run and is mounted
with a tmpfs.

So I created that directory and could ifup the the network and start
sshd.  So I did a little check on the run levels...  Hmmm...  F17
container (Alcove) in an F17 host (Forest).  WHAT is going ON here?  Is
this why the network didn't start?

[root@forest mhw]# runlevel
N 5

[root@alcove mhw]# runlevel
unknown

[root@alcove mhw]# chkconfig 

Note: This output shows SysV services only and does not include native
  systemd services. SysV configuration data might be overridden by native
  systemd configuration.

modules_dep 0:off   1:off   2:on3:on4:on5:on6:off
netconsole  0:off   1:off   2:off   3:off   4:off   5:off   6:off
network 0:off   1:off   2:off   3:on4:off   5:off   6:off


Mike
-- 
Michael H. Warfield (AI4NB) | (770) 985-6132 |  m...@wittsend.com
   /\/\|=mhw=|\/\/

Re: [systemd-devel] Unable to run systemd in an LXC / cgroup container.

2012-10-26 Thread Michael H. Warfield
On Fri, 2012-10-26 at 12:11 -0400, Michael H. Warfield wrote:
 On Thu, 2012-10-25 at 23:38 +0200, Lennart Poettering wrote:
  On Thu, 25.10.12 11:59, Michael H. Warfield (m...@wittsend.com) wrote:
 
   I SUSPECT the hang condition is something to do with systemd trying to
   start and interactive console on /dev/console, which sysvinit and
   upstart do not do. 
 
  Yes, this is documented, please see the link I already posted, and which
  I linked above a second time.

 This may have been my fault.  I was using the -o option to lxc-start
 (output logfile) and failed to specify the -c (console output redirect)
 option.  It seems to fire up nicely (albeit with other problems) with
 that additional option.  Continuing my research.

Confirming.  Using the -c option for the console file works.
Unfortunately, thanks to no getty's on the ttys so lxc-console does not
work and no way to connect to that console redirect and the failure of
the network to start, I'm still trying to figure out just what is face
planting in a container I can not access.  :-/=/  Punch out the punch
list one PUNCH at at time here.

   I've got some more problems relating to shutting down containers, some
   of which may be related to mounting tmpfs on /run to which /var/run is
   symlinked to.  We're doing halt / restart detection by monitoring utmp
   in that directory but it looks like utmp isn't even in that directory
   anymore and mounting tmpfs on it was always problematical.  We may have
   to have a more generic method to detect when a container has shut down
   or is restarting in that case.
 
  I can't parse this. The system call reboot() is virtualized for
  containers just fine and the container managaer (i.e. LXC) can check for
  that easily.
 
 Apparently, in recent kernels, we can.  Unfortunately, I'm still finding
 that I can not restart a container I have previously halted.  I have no
 problem with sysvinit and upstart systems on this host, so it is a
 container problem peculiar to systemd containers.  Continuing to
 research that problem.
 
  Lennart
 
  -- 
  Lennart Poettering - Red Hat, Inc.
 
 Regards,
 Mike

-- 
Michael H. Warfield (AI4NB) | (770) 985-6132 |  m...@wittsend.com
   /\/\|=mhw=|\/\/  | (678) 463-0932 |  http://www.wittsend.com/mhw/
   NIC whois: MHW9  | An optimist believes we live in the best of all
 PGP Key: 0x674627FF| possible worlds.  A pessimist is sure of it!


signature.asc
Description: This is a digitally signed message part
___
systemd-devel mailing list
systemd-devel@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/systemd-devel


Re: [systemd-devel] Unable to run systemd in an LXC / cgroup container.

2012-10-25 Thread Lennart Poettering
On Thu, 25.10.12 11:59, Michael H. Warfield (m...@wittsend.com) wrote:

  http://wiki.1tux.org/wiki/Lxc/Installation#Additional_notes
 
  Unfortunately, in our case, merely getting a mount in there is a
  complication in that it also has to be populated but, at least, we
  understand the problem set now.
 
 Ok...  Serge and I were corresponding on the lxc-users list and he had a
 suggestion that worked but I consider to be a bit of a sub-optimal
 workaround.  Ironically, it was to mount devtmpfs on /dev.  We don't
 (currently) have a method to auto-populate a tmpfs mount with the needed
 devices and this provided it.  It does have a problem that makes me
 uncomfortable in that the container now has visibility into the
 hosts /dev system.  I'm a security expert and I'm not comfortable with
 that solution even with the controls we have.  We can control access
 but still, not happy with that.

That's a pretty bad idea, access control to the device nodes in devtmpfs
is controlled by the host's udev instance. That means if your group/user
lists in the container and the host differ you already lost. Also access
control in udev is dynamic, due to stuff like uaccess and similar. You
really don't want to to have that into the container, i.e. where device
change ownership all the time with UIDs/GIDs that make no sense at all
in the container.

In general I think it's a good idea not to expose any real devices to
the container, but only the virtual ones that are programming
APIs. That means: no /dev/sda, or /dev/ttyS0, but /dev/null, /dev/zero,
/dev/random, /dev/urandom. And creating the latter in a tmpfs is quite
simple.

 If I run lxc-console (which attaches to one of the vtys) it gives me
 nothing.  Under sysvinit and upstart I get vty login prompts because
 they have started getty on those vtys.  This is important in case
 network access has not started for one reason or another and the
 container was started detached in the background.

The getty behaviour of systemd in containers is documented here:

http://www.freedesktop.org/wiki/Software/systemd/ContainerInterface

If LXC mounts ptys on top of the VT devices that's a really bad idea
too, since /dev/tty1 and friends expose a number of APIs beyond the mere
tty device that you cannot emulate with that. It includes files in /sys,
as well as /dev/vcs and /dev/vcsa, various ioctls, and so on. Heck, even
the most superficial of things, the $TERM variable will be
incorrect. LXC shouldn't do that.

LXC really shouldn't pretent a pty was a VT tty, it's not. From the
libvirt guys it has been proposed that we introduce a new env var to
pass to PID 1 of the container, that simply lists ptys to start gettys
on. That way we don't pretend anything about ttys that the ttys can't
hold and have a clean setup.

 I SUSPECT the hang condition is something to do with systemd trying to
 start and interactive console on /dev/console, which sysvinit and
 upstart do not do. 

Yes, this is documented, please see the link I already posted, and which
I linked above a second time.

 I've got some more problems relating to shutting down containers, some
 of which may be related to mounting tmpfs on /run to which /var/run is
 symlinked to.  We're doing halt / restart detection by monitoring utmp
 in that directory but it looks like utmp isn't even in that directory
 anymore and mounting tmpfs on it was always problematical.  We may have
 to have a more generic method to detect when a container has shut down
 or is restarting in that case.

I can't parse this. The system call reboot() is virtualized for
containers just fine and the container managaer (i.e. LXC) can check for
that easily.

Lennart

-- 
Lennart Poettering - Red Hat, Inc.
___
systemd-devel mailing list
systemd-devel@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/systemd-devel


Re: [systemd-devel] Unable to run systemd in an LXC / cgroup container.

2012-10-25 Thread Michael H. Warfield
On Thu, 2012-10-25 at 23:38 +0200, Lennart Poettering wrote:
 On Thu, 25.10.12 11:59, Michael H. Warfield (m...@wittsend.com) wrote:

  I've got some more problems relating to shutting down containers, some
  of which may be related to mounting tmpfs on /run to which /var/run is
  symlinked to.  We're doing halt / restart detection by monitoring utmp
  in that directory but it looks like utmp isn't even in that directory
  anymore and mounting tmpfs on it was always problematical.  We may have
  to have a more generic method to detect when a container has shut down
  or is restarting in that case.

 I can't parse this. The system call reboot() is virtualized for
 containers just fine and the container managaer (i.e. LXC) can check for
 that easily.

The problem we have had was with differentiating between reboot and halt
to either shut the container down cold or restarted it.  You say
easily and yet we never came up with an easy solution and monitored
utmp instead for the next runlevel change.  What is your easy solution
for that problem?

 Lennart

 -- 
 Lennart Poettering - Red Hat, Inc.

Regards,
Mike
-- 
Michael H. Warfield (AI4NB) | (770) 985-6132 |  m...@wittsend.com
   /\/\|=mhw=|\/\/  | (678) 463-0932 |  http://www.wittsend.com/mhw/
   NIC whois: MHW9  | An optimist believes we live in the best of all
 PGP Key: 0x674627FF| possible worlds.  A pessimist is sure of it!


signature.asc
Description: This is a digitally signed message part
___
systemd-devel mailing list
systemd-devel@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/systemd-devel


Re: [systemd-devel] Unable to run systemd in an LXC / cgroup container.

2012-10-22 Thread Lennart Poettering
On Sun, 21.10.12 17:25, Michael H. Warfield (m...@wittsend.com) wrote:

 Hello,
 
 This is being directed to the systemd-devel community but I'm cc'ing the
 lxc-users community and the Fedora community on this for their input as
 well.  I know it's not always good to cross post between multiple lists
 but this is of interest to all three communities who may have valuable
 input.
 
 I'm new to this particular list, just having joined after tracking a
 problem down to some systemd internals...
 
 Several people over the last year or two on the lxc-users list have been
 discussions trying to run certain distros (notably Fedora 16 and above,
 recent Arch Linux and possibly others) in LXC containers, virualizing
 entire servers this way.  This is very similar to Virtuoso / OpenVZ only
 it's using the native Linux cgroups for the containers (primary reason I
 dumped OpenVZ was to avoid their custom patched kernels).  These recent
 distros have switched to systemd for the main init process and this has
 proven to be disastrous for those of us using LXC and trying to install
 or update our containers.

Note that it is explicitly our intention to make running systemd inside
of containers as smooth as possibly. The notes Kay linked summarize what
the container manager needs to do for best integration.

 To summarize the problem...  The LXC startup binary sets up various
 things for /dev and /dev/pts for the container to run properly and this
 works perfectly fine for SystemV start-up scripts and/or Upstart.
 Unfortunately, systemd has mounts of devtmpfs on /dev and devpts
 on /dev/pts which then break things horribly.  This is because the
 kernel currently lacks namespaces for devices and won't for some time to
 come (in design).  When devtmpfs gets mounted over top of /dev in the
 container, it then hijacks the hosts console tty and several other
 devices which had been set up through bind mounts by LXC and should have
 been LEFT ALONE.

Please initialize a minimal tmpfs on /dev. systemd will then work fine.

 Yes!  I recognize that this problem with devtmpfs and lack of namespaces
 is a potential security problem anyways that could (and does) cause
 serious container-to-host problems.  We're just not going to get that
 fixed right away in the linux cgroups and namespaces.

No, devtmpfs really doesn't need updating, containers simply shouldn't
use it.

 How do we work around this problem in systemd where it has hard coded
 mounts in the binary that we can't override or configure?  Or is it
 there and I'm just missing it trying to examine the sources?  That's how
 I found where the problem lay.

systemd will make use of pre-existing mounts if they exist, and only
mount something new if they don't exist.

Note that there are reports that LXC has issues with the fact that newer
systemd enables shared mount propagation for all mounts by default (this
should actually be beneficial for containers as this ensures that new
mounts appear in the containers). LXC when run on such a system fails as
soon as it tries to use pivot_root(), as that is incompatible with
shared mount propagation. The needs fixing in LXC: it should use MS_MOVE
or MS_BIND to place the new root dir in / instead. A short term
work-around is to simply remount the root tree to private before
invoking LXC.

Lennart

-- 
Lennart Poettering - Red Hat, Inc.
___
systemd-devel mailing list
systemd-devel@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/systemd-devel


Re: [systemd-devel] Unable to run systemd in an LXC / cgroup container.

2012-10-22 Thread Michael H. Warfield
On Mon, 2012-10-22 at 16:11 +0200, Lennart Poettering wrote:
 On Sun, 21.10.12 17:25, Michael H. Warfield (m...@wittsend.com) wrote:
 
  Hello,
  
  This is being directed to the systemd-devel community but I'm cc'ing the
  lxc-users community and the Fedora community on this for their input as
  well.  I know it's not always good to cross post between multiple lists
  but this is of interest to all three communities who may have valuable
  input.
  
  I'm new to this particular list, just having joined after tracking a
  problem down to some systemd internals...
  
  Several people over the last year or two on the lxc-users list have been
  discussions trying to run certain distros (notably Fedora 16 and above,
  recent Arch Linux and possibly others) in LXC containers, virualizing
  entire servers this way.  This is very similar to Virtuoso / OpenVZ only
  it's using the native Linux cgroups for the containers (primary reason I
  dumped OpenVZ was to avoid their custom patched kernels).  These recent
  distros have switched to systemd for the main init process and this has
  proven to be disastrous for those of us using LXC and trying to install
  or update our containers.

 Note that it is explicitly our intention to make running systemd inside
 of containers as smooth as possibly. The notes Kay linked summarize what
 the container manager needs to do for best integration.

  To summarize the problem...  The LXC startup binary sets up various
  things for /dev and /dev/pts for the container to run properly and this
  works perfectly fine for SystemV start-up scripts and/or Upstart.
  Unfortunately, systemd has mounts of devtmpfs on /dev and devpts
  on /dev/pts which then break things horribly.  This is because the
  kernel currently lacks namespaces for devices and won't for some time to
  come (in design).  When devtmpfs gets mounted over top of /dev in the
  container, it then hijacks the hosts console tty and several other
  devices which had been set up through bind mounts by LXC and should have
  been LEFT ALONE.

 Please initialize a minimal tmpfs on /dev. systemd will then work fine.

My containers have a reasonable /dev that work with Upstart just fine
but they are not on tmpfs.  Is mounting tmpfs on /dev and recreating
that minimal /dev required?

  Yes!  I recognize that this problem with devtmpfs and lack of namespaces
  is a potential security problem anyways that could (and does) cause
  serious container-to-host problems.  We're just not going to get that
  fixed right away in the linux cgroups and namespaces.

 No, devtmpfs really doesn't need updating, containers simply shouldn't
 use it.

Ok, yeah.  That seems to be at the heart of the problem we're trying to
solve.

  How do we work around this problem in systemd where it has hard coded
  mounts in the binary that we can't override or configure?  Or is it
  there and I'm just missing it trying to examine the sources?  That's how
  I found where the problem lay.

 systemd will make use of pre-existing mounts if they exist, and only
 mount something new if they don't exist.

So you're saying that, if we have something mounted on /dev, that's what
prevents systemd from mounting devtmpfs on /dev?  That could be
problematical.  Tested out a couple of options there that didn't work.
That's going to take some effort.

 Note that there are reports that LXC has issues with the fact that newer
 systemd enables shared mount propagation for all mounts by default (this
 should actually be beneficial for containers as this ensures that new
 mounts appear in the containers). LXC when run on such a system fails as
 soon as it tries to use pivot_root(), as that is incompatible with
 shared mount propagation. The needs fixing in LXC: it should use MS_MOVE
 or MS_BIND to place the new root dir in / instead. A short term
 work-around is to simply remount the root tree to private before
 invoking LXC.

But, I have systemd running on my host system (F17) and containers with
sysvinit or upstart inits are all starting just fine.  That sounds like
it should impact all containers as pivot_root() is issued before systemd
in the container is started.  Or am I missing something here?  That
sounds like a problem for Serge and others to investigate further.  I'll
see about trying that workaround though.

 Lennart

 -- 
 Lennart Poettering - Red Hat, Inc.

Regards,
Mike
-- 
Michael H. Warfield (AI4NB) | (770) 985-6132 |  m...@wittsend.com
   /\/\|=mhw=|\/\/  | (678) 463-0932 |  http://www.wittsend.com/mhw/
   NIC whois: MHW9  | An optimist believes we live in the best of all
 PGP Key: 0x674627FF| possible worlds.  A pessimist is sure of it!


signature.asc
Description: This is a digitally signed message part
___
systemd-devel mailing list
systemd-devel@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/systemd-devel


Re: [systemd-devel] Unable to run systemd in an LXC / cgroup container.

2012-10-22 Thread Lennart Poettering
On Mon, 22.10.12 11:48, Michael H. Warfield (m...@wittsend.com) wrote:

   To summarize the problem...  The LXC startup binary sets up various
   things for /dev and /dev/pts for the container to run properly and this
   works perfectly fine for SystemV start-up scripts and/or Upstart.
   Unfortunately, systemd has mounts of devtmpfs on /dev and devpts
   on /dev/pts which then break things horribly.  This is because the
   kernel currently lacks namespaces for devices and won't for some time to
   come (in design).  When devtmpfs gets mounted over top of /dev in the
   container, it then hijacks the hosts console tty and several other
   devices which had been set up through bind mounts by LXC and should have
   been LEFT ALONE.
 
  Please initialize a minimal tmpfs on /dev. systemd will then work fine.
 
 My containers have a reasonable /dev that work with Upstart just fine
 but they are not on tmpfs.  Is mounting tmpfs on /dev and recreating
 that minimal /dev required?

Well, it can be any kind of mount really. Just needs to be a mount. And
the idea is to use tmpfs for this.

What /dev are you currently using? It's probably not a good idea to
reuse the hosts' /dev, since it contains so many device nodes that
should not be accessible/visible to the container.

  systemd will make use of pre-existing mounts if they exist, and only
  mount something new if they don't exist.
 
 So you're saying that, if we have something mounted on /dev, that's what
 prevents systemd from mounting devtmpfs on /dev?  

Yes.

 But, I have systemd running on my host system (F17) and containers with
 sysvinit or upstart inits are all starting just fine.  That sounds like
 it should impact all containers as pivot_root() is issued before systemd
 in the container is started.  Or am I missing something here?  That
 sounds like a problem for Serge and others to investigate further.  I'll
 see about trying that workaround though.

The shared issue is F18, and it's about running LXC on a systemd
system, not about running systemd inside of LXC.

Lennart

-- 
Lennart Poettering - Red Hat, Inc.
___
systemd-devel mailing list
systemd-devel@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/systemd-devel


Re: [systemd-devel] Unable to run systemd in an LXC / cgroup container.

2012-10-22 Thread Michael H. Warfield
On Mon, 2012-10-22 at 22:50 +0200, Lennart Poettering wrote:
 On Mon, 22.10.12 11:48, Michael H. Warfield (m...@wittsend.com) wrote:
 
To summarize the problem...  The LXC startup binary sets up various
things for /dev and /dev/pts for the container to run properly and this
works perfectly fine for SystemV start-up scripts and/or Upstart.
Unfortunately, systemd has mounts of devtmpfs on /dev and devpts
on /dev/pts which then break things horribly.  This is because the
kernel currently lacks namespaces for devices and won't for some time to
come (in design).  When devtmpfs gets mounted over top of /dev in the
container, it then hijacks the hosts console tty and several other
devices which had been set up through bind mounts by LXC and should have
been LEFT ALONE.
  
   Please initialize a minimal tmpfs on /dev. systemd will then work fine.
  
  My containers have a reasonable /dev that work with Upstart just fine
  but they are not on tmpfs.  Is mounting tmpfs on /dev and recreating
  that minimal /dev required?

 Well, it can be any kind of mount really. Just needs to be a mount. And
 the idea is to use tmpfs for this.

 What /dev are you currently using? It's probably not a good idea to
 reuse the hosts' /dev, since it contains so many device nodes that
 should not be accessible/visible to the container.

Got it.  And that explains the problems we're seeing but also what I'm
seeing in some libvirt-lxc related pages, which is a separate and
distinct project in spite of the similarities in the name...

http://wiki.1tux.org/wiki/Lxc/Installation#Additional_notes

Unfortunately, in our case, merely getting a mount in there is a
complication in that it also has to be populated but, at least, we
understand the problem set now.

   systemd will make use of pre-existing mounts if they exist, and only
   mount something new if they don't exist.
  
  So you're saying that, if we have something mounted on /dev, that's what
  prevents systemd from mounting devtmpfs on /dev?  

 Yes.

  But, I have systemd running on my host system (F17) and containers with
  sysvinit or upstart inits are all starting just fine.  That sounds like
  it should impact all containers as pivot_root() is issued before systemd
  in the container is started.  Or am I missing something here?  That
  sounds like a problem for Serge and others to investigate further.  I'll
  see about trying that workaround though.

 The shared issue is F18, and it's about running LXC on a systemd
 system, not about running systemd inside of LXC.

Whew!  I'll deal with F18 when I need to deal with F18.  That explains
why my F17 hosts are running and gives Serge and others a chance to
address this, forewarned.  Thanks for that info.

 Lennart

 -- 
 Lennart Poettering - Red Hat, Inc.

Regards,
Mike
-- 
Michael H. Warfield (AI4NB) | (770) 985-6132 |  m...@wittsend.com
   /\/\|=mhw=|\/\/  | (678) 463-0932 |  http://www.wittsend.com/mhw/
   NIC whois: MHW9  | An optimist believes we live in the best of all
 PGP Key: 0x674627FF| possible worlds.  A pessimist is sure of it!


signature.asc
Description: This is a digitally signed message part
___
systemd-devel mailing list
systemd-devel@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/systemd-devel


[systemd-devel] Unable to run systemd in an LXC / cgroup container.

2012-10-21 Thread Michael H. Warfield
Hello,

This is being directed to the systemd-devel community but I'm cc'ing the
lxc-users community and the Fedora community on this for their input as
well.  I know it's not always good to cross post between multiple lists
but this is of interest to all three communities who may have valuable
input.

I'm new to this particular list, just having joined after tracking a
problem down to some systemd internals...

Several people over the last year or two on the lxc-users list have been
discussions trying to run certain distros (notably Fedora 16 and above,
recent Arch Linux and possibly others) in LXC containers, virualizing
entire servers this way.  This is very similar to Virtuoso / OpenVZ only
it's using the native Linux cgroups for the containers (primary reason I
dumped OpenVZ was to avoid their custom patched kernels).  These recent
distros have switched to systemd for the main init process and this has
proven to be disastrous for those of us using LXC and trying to install
or update our containers.

To put it bluntly, it doesn't work and causes all sorts of problems on
the host.

To summarize the problem...  The LXC startup binary sets up various
things for /dev and /dev/pts for the container to run properly and this
works perfectly fine for SystemV start-up scripts and/or Upstart.
Unfortunately, systemd has mounts of devtmpfs on /dev and devpts
on /dev/pts which then break things horribly.  This is because the
kernel currently lacks namespaces for devices and won't for some time to
come (in design).  When devtmpfs gets mounted over top of /dev in the
container, it then hijacks the hosts console tty and several other
devices which had been set up through bind mounts by LXC and should have
been LEFT ALONE.

Yes!  I recognize that this problem with devtmpfs and lack of namespaces
is a potential security problem anyways that could (and does) cause
serious container-to-host problems.  We're just not going to get that
fixed right away in the linux cgroups and namespaces.

How do we work around this problem in systemd where it has hard coded
mounts in the binary that we can't override or configure?  Or is it
there and I'm just missing it trying to examine the sources?  That's how
I found where the problem lay.

Regards,
Mike
-- 
Michael H. Warfield (AI4NB) | (770) 985-6132 |  m...@wittsend.com
   /\/\|=mhw=|\/\/  | (678) 463-0932 |  http://www.wittsend.com/mhw/
   NIC whois: MHW9  | An optimist believes we live in the best of all
 PGP Key: 0x674627FF| possible worlds.  A pessimist is sure of it!


signature.asc
Description: This is a digitally signed message part
___
systemd-devel mailing list
systemd-devel@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/systemd-devel


Re: [systemd-devel] Unable to run systemd in an LXC / cgroup container.

2012-10-21 Thread Kay Sievers
On Sun, Oct 21, 2012 at 11:25 PM, Michael H. Warfield m...@wittsend.com wrote:
 This is being directed to the systemd-devel community but I'm cc'ing the
 lxc-users community and the Fedora community on this for their input as
 well.  I know it's not always good to cross post between multiple lists
 but this is of interest to all three communities who may have valuable
 input.

 I'm new to this particular list, just having joined after tracking a
 problem down to some systemd internals...

 Several people over the last year or two on the lxc-users list have been
 discussions trying to run certain distros (notably Fedora 16 and above,
 recent Arch Linux and possibly others) in LXC containers, virualizing
 entire servers this way.  This is very similar to Virtuoso / OpenVZ only
 it's using the native Linux cgroups for the containers (primary reason I
 dumped OpenVZ was to avoid their custom patched kernels).  These recent
 distros have switched to systemd for the main init process and this has
 proven to be disastrous for those of us using LXC and trying to install
 or update our containers.

 To put it bluntly, it doesn't work and causes all sorts of problems on
 the host.

 To summarize the problem...  The LXC startup binary sets up various
 things for /dev and /dev/pts for the container to run properly and this
 works perfectly fine for SystemV start-up scripts and/or Upstart.
 Unfortunately, systemd has mounts of devtmpfs on /dev and devpts
 on /dev/pts which then break things horribly.  This is because the
 kernel currently lacks namespaces for devices and won't for some time to
 come (in design).  When devtmpfs gets mounted over top of /dev in the
 container, it then hijacks the hosts console tty and several other
 devices which had been set up through bind mounts by LXC and should have
 been LEFT ALONE.

 Yes!  I recognize that this problem with devtmpfs and lack of namespaces
 is a potential security problem anyways that could (and does) cause
 serious container-to-host problems.  We're just not going to get that
 fixed right away in the linux cgroups and namespaces.

 How do we work around this problem in systemd where it has hard coded
 mounts in the binary that we can't override or configure?  Or is it
 there and I'm just missing it trying to examine the sources?  That's how
 I found where the problem lay.

As a first step, this probably explains most of it:
  http://www.freedesktop.org/wiki/Software/systemd/ContainerInterface

Kay
___
systemd-devel mailing list
systemd-devel@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/systemd-devel