Re: [systemd-devel] Unable to run systemd in an LXC / cgroup container.
On Tue, 06.11.12 11:07, Michael H. Warfield (m...@wittsend.com) wrote: Here's where we've seen some problems in the past. It's not just mounts that are propagated but remounts as well. The problem arose that some of us had our containers on a separate partition. When we would shut a container down, that container tried to remount its file systems ro which then propagated back into the host causing the hosts file system to be ro (doesn't happen if you are running on the host's root fs for the containers) and from there across into the other containers. Are you using MS_SHARED or MS_SLAVE for this? If you are using MS_SHARED do you create a potential security problem where actions in the container can bleed into the state of the host and into other containers. That's highly undesirable. The root namespace is MS_SHARED, and nspawn and libvirt-lxc containers are MS_SLAVE. That ensures mounts from the host propagate to the containers but not vice versa. as soon as it tries to use pivot_root(), as that is incompatible with shared mount propagation. The needs fixing in LXC: it should use MS_MOVE or MS_BIND to place the new root dir in / instead. A short term Actually not quite sure how this would work. It should be possible to set up a set of conditions to work around this, but the kernel checks at do_pivotroot are pretty harsh - mnt-mnt_parent of both the new root and current root have to be not shared. So perhaps we actually first chroot into a dir whose parent is non-shared, then pivot_root from there? :) (Simple chroot in place of pivot_root still does not suffice, not only because of chroot escapes, but also different results in /proc/pid/mountinfo and friends) Comments on Serge's points? Don't use privot_root. Instead use MS_MOVE to move the container root to /. At this point, we see where this will become problematical in Fedora 18 but appears to already be problematical in NixOS that another user is running and which containers systemd 195 in the host. THere's nothing really problematical with this. LXC should stop using pivot_root, and use MS_MOVE instead. We've had problems with chroot in the past due to chroot escapes and other problems years ago as Serge mentioned. chroot() is not useful for this. You should invoke chroot() once, to fix chroot after adjusting the namespace, but that's not the call that actually shifts the namespace around. That should be done with MS_MOVE. The code should like this: http://cgit.freedesktop.org/systemd/systemd/tree/src/nspawn/nspawn.c#n1264 Lennart -- Lennart Poettering - Red Hat, Inc. ___ systemd-devel mailing list systemd-devel@lists.freedesktop.org http://lists.freedesktop.org/mailman/listinfo/systemd-devel
Re: [systemd-devel] Unable to run systemd in an LXC / cgroup container.
On Mon, 2012-10-22 at 16:11 +0200, Lennart Poettering wrote: Note that there are reports that LXC has issues with the fact that newer systemd enables shared mount propagation for all mounts by default (this should actually be beneficial for containers as this ensures that new mounts appear in the containers). LXC when run on such a system fails as soon as it tries to use pivot_root(), as that is incompatible with shared mount propagation. The needs fixing in LXC: it should use MS_MOVE or MS_BIND to place the new root dir in / instead. A short term work-around is to simply remount the root tree to private before invoking LXC. In another thread, Serge had some heartburn over this shared mount propagation which then rang a bell in my head about past problems we have seen. On Mon, 2012-11-05 at 08:51 -0600, Serge Hallyn wrote: Quoting Michael H. Warfield (m...@wittsend.com): ... This was from another threat with the systemd guys. On Mon, 2012-10-22 at 16:11 +0200, Lennart Poettering wrote: Note that there are reports that LXC has issues with the fact that newer systemd enables shared mount propagation for all mounts by default (this should actually be beneficial for containers as this ensures that new mounts appear in the containers). LXC when run on such a system fails MS_SLAVE does this as well. MS_SHARED means container mounts also propagate into the host, which is less desirable in most cases. Here's where we've seen some problems in the past. It's not just mounts that are propagated but remounts as well. The problem arose that some of us had our containers on a separate partition. When we would shut a container down, that container tried to remount its file systems ro which then propagated back into the host causing the hosts file system to be ro (doesn't happen if you are running on the host's root fs for the containers) and from there across into the other containers. Are you using MS_SHARED or MS_SLAVE for this? If you are using MS_SHARED do you create a potential security problem where actions in the container can bleed into the state of the host and into other containers. That's highly undesirable. If a mount in a propagates back into the host and is then reflected to another container sharing that same mount tree (I have shared partitions specific to that sort of thing) does that create an information disclosure situation of one container mounts a new file system and the other container sees the new mount? I don't know if the mount propagation would reflect back up the shared tree or not but I have certainly seen remounts do this. I don't see that as desirable. Maybe I'm misunderstand how this is suppose to work but I intend to test out those scenarios when I have a chance. I do know that, when testing that ro problem, I was able to remount a partition ro in one container and it would switch in the host and the other container and I could the remount it rw in the other container and have it propagate back. Not good. Can you offer any clarity on this? as soon as it tries to use pivot_root(), as that is incompatible with shared mount propagation. The needs fixing in LXC: it should use MS_MOVE or MS_BIND to place the new root dir in / instead. A short term Actually not quite sure how this would work. It should be possible to set up a set of conditions to work around this, but the kernel checks at do_pivotroot are pretty harsh - mnt-mnt_parent of both the new root and current root have to be not shared. So perhaps we actually first chroot into a dir whose parent is non-shared, then pivot_root from there? :) (Simple chroot in place of pivot_root still does not suffice, not only because of chroot escapes, but also different results in /proc/pid/mountinfo and friends) Comments on Serge's points? At this point, we see where this will become problematical in Fedora 18 but appears to already be problematical in NixOS that another user is running and which containers systemd 195 in the host. We've had problems with chroot in the past due to chroot escapes and other problems years ago as Serge mentioned. Lennart -- Lennart Poettering - Red Hat, Inc. Regards, Mike -- Michael H. Warfield (AI4NB) | (770) 985-6132 | m...@wittsend.com /\/\|=mhw=|\/\/ | (678) 463-0932 | http://www.wittsend.com/mhw/ NIC whois: MHW9 | An optimist believes we live in the best of all PGP Key: 0x674627FF| possible worlds. A pessimist is sure of it! signature.asc Description: This is a digitally signed message part ___ systemd-devel mailing list systemd-devel@lists.freedesktop.org http://lists.freedesktop.org/mailman/listinfo/systemd-devel
Re: [systemd-devel] Unable to run systemd in an LXC / cgroup container.
'Twas brillig, and Michael H. Warfield at 26/10/12 18:18 did gyre and gimble: What the hell is this? /var/run is symlinked to /run and is mounted with a tmpfs. Yup, that's how /var/run and /run is being handled these days. It provides a consistent space to pass info from the initrd over to the main system and has various other uses also. If you want to ensure files are created in this folder, just drop a config file in to /usr/lib/tmpfiles.d/ in the package in question. See man systemd-tmpfiles for more info. Could be some packages are not fully upgraded to this concept in F17. As a non-fedora user, I can't really comment on that specifically. Col -- Colin Guthrie gmane(at)colin.guthr.ie http://colin.guthr.ie/ Day Job: Tribalogic Limited http://www.tribalogic.net/ Open Source: Mageia Contributor http://www.mageia.org/ PulseAudio Hacker http://www.pulseaudio.org/ Trac Hacker http://trac.edgewall.org/ ___ systemd-devel mailing list systemd-devel@lists.freedesktop.org http://lists.freedesktop.org/mailman/listinfo/systemd-devel
Re: [systemd-devel] Unable to run systemd in an LXC / cgroup container.
On Sat, 2012-10-27 at 19:44 +0100, Colin Guthrie wrote: 'Twas brillig, and Michael H. Warfield at 26/10/12 18:18 did gyre and gimble: What the hell is this? /var/run is symlinked to /run and is mounted with a tmpfs. Yup, that's how /var/run and /run is being handled these days. It provides a consistent space to pass info from the initrd over to the main system and has various other uses also. Interesting. I hadn't considered that aspect of it before. Very interesting. If you want to ensure files are created in this folder, just drop a config file in to /usr/lib/tmpfiles.d/ in the package in question. See man systemd-tmpfiles for more info. NOW THAT is something else I needed to know about! Thank you very very much! Learned something new. This whole thing has been a massive learning experience getting this container kick started. Could be some packages are not fully upgraded to this concept in F17. As a non-fedora user, I can't really comment on that specifically. As it turns out, the kernel has had some of our patches applied that I wasn't aware of vis-a-vis reboot/halt and this should no longer be an issue. I'm still struggling with the tmpfs on /dev thing and have run into a catch-22 with regards to that. I can mount tmpfs on /dev just fine and can populate it just fine in a post mount hook but, then, we're trying to mount a devpts file system on /dev/pts before we've had a chance to populate it and it's then crashing on the mount. Sigh... I think that's going to now have to wait for Serge or Daniel to comment on. Col -- Colin Guthrie gmane(at)colin.guthr.ie http://colin.guthr.ie/ Day Job: Tribalogic Limited http://www.tribalogic.net/ Open Source: Mageia Contributor http://www.mageia.org/ PulseAudio Hacker http://www.pulseaudio.org/ Trac Hacker http://trac.edgewall.org/ Regards, Mike -- Michael H. Warfield (AI4NB) | (770) 985-6132 | m...@wittsend.com /\/\|=mhw=|\/\/ | (678) 463-0932 | http://www.wittsend.com/mhw/ NIC whois: MHW9 | An optimist believes we live in the best of all PGP Key: 0x674627FF| possible worlds. A pessimist is sure of it! signature.asc Description: This is a digitally signed message part ___ systemd-devel mailing list systemd-devel@lists.freedesktop.org http://lists.freedesktop.org/mailman/listinfo/systemd-devel
Re: [systemd-devel] Unable to run systemd in an LXC / cgroup container.
On Thu, 2012-10-25 at 23:38 +0200, Lennart Poettering wrote: On Thu, 25.10.12 11:59, Michael H. Warfield (m...@wittsend.com) wrote: http://wiki.1tux.org/wiki/Lxc/Installation#Additional_notes Unfortunately, in our case, merely getting a mount in there is a complication in that it also has to be populated but, at least, we understand the problem set now. Ok... Serge and I were corresponding on the lxc-users list and he had a suggestion that worked but I consider to be a bit of a sub-optimal workaround. Ironically, it was to mount devtmpfs on /dev. We don't (currently) have a method to auto-populate a tmpfs mount with the needed devices and this provided it. It does have a problem that makes me uncomfortable in that the container now has visibility into the hosts /dev system. I'm a security expert and I'm not comfortable with that solution even with the controls we have. We can control access but still, not happy with that. That's a pretty bad idea, access control to the device nodes in devtmpfs is controlled by the host's udev instance. That means if your group/user lists in the container and the host differ you already lost. Also access control in udev is dynamic, due to stuff like uaccess and similar. You really don't want to to have that into the container, i.e. where device change ownership all the time with UIDs/GIDs that make no sense at all in the container. Concur. In general I think it's a good idea not to expose any real devices to the container, but only the virtual ones that are programming APIs. That means: no /dev/sda, or /dev/ttyS0, but /dev/null, /dev/zero, /dev/random, /dev/urandom. And creating the latter in a tmpfs is quite simple. If I run lxc-console (which attaches to one of the vtys) it gives me nothing. Under sysvinit and upstart I get vty login prompts because they have started getty on those vtys. This is important in case network access has not started for one reason or another and the container was started detached in the background. The getty behaviour of systemd in containers is documented here: http://www.freedesktop.org/wiki/Software/systemd/ContainerInterface Sorry. This is unacceptable. We need some way that these will be active and you will be consistent with other containers. If LXC mounts ptys on top of the VT devices that's a really bad idea too, since /dev/tty1 and friends expose a number of APIs beyond the mere tty device that you cannot emulate with that. It includes files in /sys, as well as /dev/vcs and /dev/vcsa, various ioctls, and so on. Heck, even the most superficial of things, the $TERM variable will be incorrect. LXC shouldn't do that. REGARDLESS. I'm in this situation now testing what I thought was a hang condition (which is proving to be something else). I started a container detached redirecting the console to a file (a parameter I was missing) and the log to another file (which I had been doing). But, for some reason, sshd is not starting up. I have no way to attach to the bloody console of the container and I have no getty's on a vty I can attach to using lxc-console and I can't remote access a container which, for all other intents and purposes, appears to be running fine. Parameterize this bloody thing so we can have control over it. LXC really shouldn't pretent a pty was a VT tty, it's not. From the libvirt guys it has been proposed that we introduce a new env var to pass to PID 1 of the container, that simply lists ptys to start gettys on. That way we don't pretend anything about ttys that the ttys can't hold and have a clean setup. I SUSPECT the hang condition is something to do with systemd trying to start and interactive console on /dev/console, which sysvinit and upstart do not do. Yes, this is documented, please see the link I already posted, and which I linked above a second time. I've got some more problems relating to shutting down containers, some of which may be related to mounting tmpfs on /run to which /var/run is symlinked to. We're doing halt / restart detection by monitoring utmp in that directory but it looks like utmp isn't even in that directory anymore and mounting tmpfs on it was always problematical. We may have to have a more generic method to detect when a container has shut down or is restarting in that case. I can't parse this. The system call reboot() is virtualized for containers just fine and the container managaer (i.e. LXC) can check for that easily. Lennart -- Lennart Poettering - Red Hat, Inc. -- Michael H. Warfield (AI4NB) | (770) 985-6132 | m...@wittsend.com /\/\|=mhw=|\/\/ | (678) 463-0932 | http://www.wittsend.com/mhw/ NIC whois: MHW9 | An optimist believes we live in the best of all PGP Key: 0x674627FF| possible worlds. A pessimist is sure of it! signature.asc Description: This is a digitally signed message part
Re: [systemd-devel] Unable to run systemd in an LXC / cgroup container.
On Thu, 2012-10-25 at 23:38 +0200, Lennart Poettering wrote: On Thu, 25.10.12 11:59, Michael H. Warfield (m...@wittsend.com) wrote: I've got some more problems relating to shutting down containers, some of which may be related to mounting tmpfs on /run to which /var/run is symlinked to. We're doing halt / restart detection by monitoring utmp in that directory but it looks like utmp isn't even in that directory anymore and mounting tmpfs on it was always problematical. We may have to have a more generic method to detect when a container has shut down or is restarting in that case. I can't parse this. The system call reboot() is virtualized for containers just fine and the container managaer (i.e. LXC) can check for that easily. I strongly suspect that the condition I'm dealing with (not being able to restart the container) is an artifact of the devtmpfs kludge. I'm seeing some errors relating to /dev/loop* busy that seems to be related to the hung resources resulting in the inability to remove the zombie container. Disregard until I can get further information following a switch to a template based setup. Lennart Regards, Mike -- Michael H. Warfield (AI4NB) | (770) 985-6132 | m...@wittsend.com /\/\|=mhw=|\/\/ | (678) 463-0932 | http://www.wittsend.com/mhw/ NIC whois: MHW9 | An optimist believes we live in the best of all PGP Key: 0x674627FF| possible worlds. A pessimist is sure of it! signature.asc Description: This is a digitally signed message part ___ systemd-devel mailing list systemd-devel@lists.freedesktop.org http://lists.freedesktop.org/mailman/listinfo/systemd-devel
Re: [systemd-devel] Unable to run systemd in an LXC / cgroup container.
On Fri, 2012-10-26 at 11:58 -0400, Michael H. Warfield wrote: On Thu, 2012-10-25 at 23:38 +0200, Lennart Poettering wrote: On Thu, 25.10.12 11:59, Michael H. Warfield (m...@wittsend.com) wrote: http://wiki.1tux.org/wiki/Lxc/Installation#Additional_notes Unfortunately, in our case, merely getting a mount in there is a complication in that it also has to be populated but, at least, we understand the problem set now. Ok... Serge and I were corresponding on the lxc-users list and he had a suggestion that worked but I consider to be a bit of a sub-optimal workaround. Ironically, it was to mount devtmpfs on /dev. We don't (currently) have a method to auto-populate a tmpfs mount with the needed devices and this provided it. It does have a problem that makes me uncomfortable in that the container now has visibility into the hosts /dev system. I'm a security expert and I'm not comfortable with that solution even with the controls we have. We can control access but still, not happy with that. That's a pretty bad idea, access control to the device nodes in devtmpfs is controlled by the host's udev instance. That means if your group/user lists in the container and the host differ you already lost. Also access control in udev is dynamic, due to stuff like uaccess and similar. You really don't want to to have that into the container, i.e. where device change ownership all the time with UIDs/GIDs that make no sense at all in the container. Concur. In general I think it's a good idea not to expose any real devices to the container, but only the virtual ones that are programming APIs. That means: no /dev/sda, or /dev/ttyS0, but /dev/null, /dev/zero, /dev/random, /dev/urandom. And creating the latter in a tmpfs is quite simple. If I run lxc-console (which attaches to one of the vtys) it gives me nothing. Under sysvinit and upstart I get vty login prompts because they have started getty on those vtys. This is important in case network access has not started for one reason or another and the container was started detached in the background. The getty behaviour of systemd in containers is documented here: http://www.freedesktop.org/wiki/Software/systemd/ContainerInterface Sorry. This is unacceptable. We need some way that these will be active and you will be consistent with other containers. If LXC mounts ptys on top of the VT devices that's a really bad idea too, since /dev/tty1 and friends expose a number of APIs beyond the mere tty device that you cannot emulate with that. It includes files in /sys, as well as /dev/vcs and /dev/vcsa, various ioctls, and so on. Heck, even the most superficial of things, the $TERM variable will be incorrect. LXC shouldn't do that. REGARDLESS. I'm in this situation now testing what I thought was a hang condition (which is proving to be something else). I started a container detached redirecting the console to a file (a parameter I was missing) and the log to another file (which I had been doing). But, for some reason, sshd is not starting up. I have no way to attach to the bloody console of the container and I have no getty's on a vty I can attach to using lxc-console and I can't remote access a container which, for all other intents and purposes, appears to be running fine. Parameterize this bloody thing so we can have control over it. Here's another weirdism that's in your camp... The reason that sshd did not start was because the network did not start (IPv6 was up but IPv4 was not and the startup of several services failed as a consequence). Trying to restart the network manually resulted in this: [root@alcove mhw]# ifdown eth0 ./network-functions: line 237: cd: /var/run/netreport: No such file or directory [root@alcove mhw]# ifup eth0 ./network-functions: line 237: cd: /var/run/netreport: No such file or directory [root@alcove mhw]# ls /var/run/ dbus messagebus.pid rpcbind.sock systemd user log mount syslogd.pid udev What the hell is this? /var/run is symlinked to /run and is mounted with a tmpfs. So I created that directory and could ifup the the network and start sshd. So I did a little check on the run levels... Hmmm... F17 container (Alcove) in an F17 host (Forest). WHAT is going ON here? Is this why the network didn't start? [root@forest mhw]# runlevel N 5 [root@alcove mhw]# runlevel unknown [root@alcove mhw]# chkconfig Note: This output shows SysV services only and does not include native systemd services. SysV configuration data might be overridden by native systemd configuration. modules_dep 0:off 1:off 2:on3:on4:on5:on6:off netconsole 0:off 1:off 2:off 3:off 4:off 5:off 6:off network 0:off 1:off 2:off 3:on4:off 5:off 6:off Mike -- Michael H. Warfield (AI4NB) | (770) 985-6132 | m...@wittsend.com /\/\|=mhw=|\/\/
Re: [systemd-devel] Unable to run systemd in an LXC / cgroup container.
On Fri, 2012-10-26 at 12:11 -0400, Michael H. Warfield wrote: On Thu, 2012-10-25 at 23:38 +0200, Lennart Poettering wrote: On Thu, 25.10.12 11:59, Michael H. Warfield (m...@wittsend.com) wrote: I SUSPECT the hang condition is something to do with systemd trying to start and interactive console on /dev/console, which sysvinit and upstart do not do. Yes, this is documented, please see the link I already posted, and which I linked above a second time. This may have been my fault. I was using the -o option to lxc-start (output logfile) and failed to specify the -c (console output redirect) option. It seems to fire up nicely (albeit with other problems) with that additional option. Continuing my research. Confirming. Using the -c option for the console file works. Unfortunately, thanks to no getty's on the ttys so lxc-console does not work and no way to connect to that console redirect and the failure of the network to start, I'm still trying to figure out just what is face planting in a container I can not access. :-/=/ Punch out the punch list one PUNCH at at time here. I've got some more problems relating to shutting down containers, some of which may be related to mounting tmpfs on /run to which /var/run is symlinked to. We're doing halt / restart detection by monitoring utmp in that directory but it looks like utmp isn't even in that directory anymore and mounting tmpfs on it was always problematical. We may have to have a more generic method to detect when a container has shut down or is restarting in that case. I can't parse this. The system call reboot() is virtualized for containers just fine and the container managaer (i.e. LXC) can check for that easily. Apparently, in recent kernels, we can. Unfortunately, I'm still finding that I can not restart a container I have previously halted. I have no problem with sysvinit and upstart systems on this host, so it is a container problem peculiar to systemd containers. Continuing to research that problem. Lennart -- Lennart Poettering - Red Hat, Inc. Regards, Mike -- Michael H. Warfield (AI4NB) | (770) 985-6132 | m...@wittsend.com /\/\|=mhw=|\/\/ | (678) 463-0932 | http://www.wittsend.com/mhw/ NIC whois: MHW9 | An optimist believes we live in the best of all PGP Key: 0x674627FF| possible worlds. A pessimist is sure of it! signature.asc Description: This is a digitally signed message part ___ systemd-devel mailing list systemd-devel@lists.freedesktop.org http://lists.freedesktop.org/mailman/listinfo/systemd-devel
Re: [systemd-devel] Unable to run systemd in an LXC / cgroup container.
On Thu, 25.10.12 11:59, Michael H. Warfield (m...@wittsend.com) wrote: http://wiki.1tux.org/wiki/Lxc/Installation#Additional_notes Unfortunately, in our case, merely getting a mount in there is a complication in that it also has to be populated but, at least, we understand the problem set now. Ok... Serge and I were corresponding on the lxc-users list and he had a suggestion that worked but I consider to be a bit of a sub-optimal workaround. Ironically, it was to mount devtmpfs on /dev. We don't (currently) have a method to auto-populate a tmpfs mount with the needed devices and this provided it. It does have a problem that makes me uncomfortable in that the container now has visibility into the hosts /dev system. I'm a security expert and I'm not comfortable with that solution even with the controls we have. We can control access but still, not happy with that. That's a pretty bad idea, access control to the device nodes in devtmpfs is controlled by the host's udev instance. That means if your group/user lists in the container and the host differ you already lost. Also access control in udev is dynamic, due to stuff like uaccess and similar. You really don't want to to have that into the container, i.e. where device change ownership all the time with UIDs/GIDs that make no sense at all in the container. In general I think it's a good idea not to expose any real devices to the container, but only the virtual ones that are programming APIs. That means: no /dev/sda, or /dev/ttyS0, but /dev/null, /dev/zero, /dev/random, /dev/urandom. And creating the latter in a tmpfs is quite simple. If I run lxc-console (which attaches to one of the vtys) it gives me nothing. Under sysvinit and upstart I get vty login prompts because they have started getty on those vtys. This is important in case network access has not started for one reason or another and the container was started detached in the background. The getty behaviour of systemd in containers is documented here: http://www.freedesktop.org/wiki/Software/systemd/ContainerInterface If LXC mounts ptys on top of the VT devices that's a really bad idea too, since /dev/tty1 and friends expose a number of APIs beyond the mere tty device that you cannot emulate with that. It includes files in /sys, as well as /dev/vcs and /dev/vcsa, various ioctls, and so on. Heck, even the most superficial of things, the $TERM variable will be incorrect. LXC shouldn't do that. LXC really shouldn't pretent a pty was a VT tty, it's not. From the libvirt guys it has been proposed that we introduce a new env var to pass to PID 1 of the container, that simply lists ptys to start gettys on. That way we don't pretend anything about ttys that the ttys can't hold and have a clean setup. I SUSPECT the hang condition is something to do with systemd trying to start and interactive console on /dev/console, which sysvinit and upstart do not do. Yes, this is documented, please see the link I already posted, and which I linked above a second time. I've got some more problems relating to shutting down containers, some of which may be related to mounting tmpfs on /run to which /var/run is symlinked to. We're doing halt / restart detection by monitoring utmp in that directory but it looks like utmp isn't even in that directory anymore and mounting tmpfs on it was always problematical. We may have to have a more generic method to detect when a container has shut down or is restarting in that case. I can't parse this. The system call reboot() is virtualized for containers just fine and the container managaer (i.e. LXC) can check for that easily. Lennart -- Lennart Poettering - Red Hat, Inc. ___ systemd-devel mailing list systemd-devel@lists.freedesktop.org http://lists.freedesktop.org/mailman/listinfo/systemd-devel
Re: [systemd-devel] Unable to run systemd in an LXC / cgroup container.
On Thu, 2012-10-25 at 23:38 +0200, Lennart Poettering wrote: On Thu, 25.10.12 11:59, Michael H. Warfield (m...@wittsend.com) wrote: I've got some more problems relating to shutting down containers, some of which may be related to mounting tmpfs on /run to which /var/run is symlinked to. We're doing halt / restart detection by monitoring utmp in that directory but it looks like utmp isn't even in that directory anymore and mounting tmpfs on it was always problematical. We may have to have a more generic method to detect when a container has shut down or is restarting in that case. I can't parse this. The system call reboot() is virtualized for containers just fine and the container managaer (i.e. LXC) can check for that easily. The problem we have had was with differentiating between reboot and halt to either shut the container down cold or restarted it. You say easily and yet we never came up with an easy solution and monitored utmp instead for the next runlevel change. What is your easy solution for that problem? Lennart -- Lennart Poettering - Red Hat, Inc. Regards, Mike -- Michael H. Warfield (AI4NB) | (770) 985-6132 | m...@wittsend.com /\/\|=mhw=|\/\/ | (678) 463-0932 | http://www.wittsend.com/mhw/ NIC whois: MHW9 | An optimist believes we live in the best of all PGP Key: 0x674627FF| possible worlds. A pessimist is sure of it! signature.asc Description: This is a digitally signed message part ___ systemd-devel mailing list systemd-devel@lists.freedesktop.org http://lists.freedesktop.org/mailman/listinfo/systemd-devel
Re: [systemd-devel] Unable to run systemd in an LXC / cgroup container.
On Sun, 21.10.12 17:25, Michael H. Warfield (m...@wittsend.com) wrote: Hello, This is being directed to the systemd-devel community but I'm cc'ing the lxc-users community and the Fedora community on this for their input as well. I know it's not always good to cross post between multiple lists but this is of interest to all three communities who may have valuable input. I'm new to this particular list, just having joined after tracking a problem down to some systemd internals... Several people over the last year or two on the lxc-users list have been discussions trying to run certain distros (notably Fedora 16 and above, recent Arch Linux and possibly others) in LXC containers, virualizing entire servers this way. This is very similar to Virtuoso / OpenVZ only it's using the native Linux cgroups for the containers (primary reason I dumped OpenVZ was to avoid their custom patched kernels). These recent distros have switched to systemd for the main init process and this has proven to be disastrous for those of us using LXC and trying to install or update our containers. Note that it is explicitly our intention to make running systemd inside of containers as smooth as possibly. The notes Kay linked summarize what the container manager needs to do for best integration. To summarize the problem... The LXC startup binary sets up various things for /dev and /dev/pts for the container to run properly and this works perfectly fine for SystemV start-up scripts and/or Upstart. Unfortunately, systemd has mounts of devtmpfs on /dev and devpts on /dev/pts which then break things horribly. This is because the kernel currently lacks namespaces for devices and won't for some time to come (in design). When devtmpfs gets mounted over top of /dev in the container, it then hijacks the hosts console tty and several other devices which had been set up through bind mounts by LXC and should have been LEFT ALONE. Please initialize a minimal tmpfs on /dev. systemd will then work fine. Yes! I recognize that this problem with devtmpfs and lack of namespaces is a potential security problem anyways that could (and does) cause serious container-to-host problems. We're just not going to get that fixed right away in the linux cgroups and namespaces. No, devtmpfs really doesn't need updating, containers simply shouldn't use it. How do we work around this problem in systemd where it has hard coded mounts in the binary that we can't override or configure? Or is it there and I'm just missing it trying to examine the sources? That's how I found where the problem lay. systemd will make use of pre-existing mounts if they exist, and only mount something new if they don't exist. Note that there are reports that LXC has issues with the fact that newer systemd enables shared mount propagation for all mounts by default (this should actually be beneficial for containers as this ensures that new mounts appear in the containers). LXC when run on such a system fails as soon as it tries to use pivot_root(), as that is incompatible with shared mount propagation. The needs fixing in LXC: it should use MS_MOVE or MS_BIND to place the new root dir in / instead. A short term work-around is to simply remount the root tree to private before invoking LXC. Lennart -- Lennart Poettering - Red Hat, Inc. ___ systemd-devel mailing list systemd-devel@lists.freedesktop.org http://lists.freedesktop.org/mailman/listinfo/systemd-devel
Re: [systemd-devel] Unable to run systemd in an LXC / cgroup container.
On Mon, 2012-10-22 at 16:11 +0200, Lennart Poettering wrote: On Sun, 21.10.12 17:25, Michael H. Warfield (m...@wittsend.com) wrote: Hello, This is being directed to the systemd-devel community but I'm cc'ing the lxc-users community and the Fedora community on this for their input as well. I know it's not always good to cross post between multiple lists but this is of interest to all three communities who may have valuable input. I'm new to this particular list, just having joined after tracking a problem down to some systemd internals... Several people over the last year or two on the lxc-users list have been discussions trying to run certain distros (notably Fedora 16 and above, recent Arch Linux and possibly others) in LXC containers, virualizing entire servers this way. This is very similar to Virtuoso / OpenVZ only it's using the native Linux cgroups for the containers (primary reason I dumped OpenVZ was to avoid their custom patched kernels). These recent distros have switched to systemd for the main init process and this has proven to be disastrous for those of us using LXC and trying to install or update our containers. Note that it is explicitly our intention to make running systemd inside of containers as smooth as possibly. The notes Kay linked summarize what the container manager needs to do for best integration. To summarize the problem... The LXC startup binary sets up various things for /dev and /dev/pts for the container to run properly and this works perfectly fine for SystemV start-up scripts and/or Upstart. Unfortunately, systemd has mounts of devtmpfs on /dev and devpts on /dev/pts which then break things horribly. This is because the kernel currently lacks namespaces for devices and won't for some time to come (in design). When devtmpfs gets mounted over top of /dev in the container, it then hijacks the hosts console tty and several other devices which had been set up through bind mounts by LXC and should have been LEFT ALONE. Please initialize a minimal tmpfs on /dev. systemd will then work fine. My containers have a reasonable /dev that work with Upstart just fine but they are not on tmpfs. Is mounting tmpfs on /dev and recreating that minimal /dev required? Yes! I recognize that this problem with devtmpfs and lack of namespaces is a potential security problem anyways that could (and does) cause serious container-to-host problems. We're just not going to get that fixed right away in the linux cgroups and namespaces. No, devtmpfs really doesn't need updating, containers simply shouldn't use it. Ok, yeah. That seems to be at the heart of the problem we're trying to solve. How do we work around this problem in systemd where it has hard coded mounts in the binary that we can't override or configure? Or is it there and I'm just missing it trying to examine the sources? That's how I found where the problem lay. systemd will make use of pre-existing mounts if they exist, and only mount something new if they don't exist. So you're saying that, if we have something mounted on /dev, that's what prevents systemd from mounting devtmpfs on /dev? That could be problematical. Tested out a couple of options there that didn't work. That's going to take some effort. Note that there are reports that LXC has issues with the fact that newer systemd enables shared mount propagation for all mounts by default (this should actually be beneficial for containers as this ensures that new mounts appear in the containers). LXC when run on such a system fails as soon as it tries to use pivot_root(), as that is incompatible with shared mount propagation. The needs fixing in LXC: it should use MS_MOVE or MS_BIND to place the new root dir in / instead. A short term work-around is to simply remount the root tree to private before invoking LXC. But, I have systemd running on my host system (F17) and containers with sysvinit or upstart inits are all starting just fine. That sounds like it should impact all containers as pivot_root() is issued before systemd in the container is started. Or am I missing something here? That sounds like a problem for Serge and others to investigate further. I'll see about trying that workaround though. Lennart -- Lennart Poettering - Red Hat, Inc. Regards, Mike -- Michael H. Warfield (AI4NB) | (770) 985-6132 | m...@wittsend.com /\/\|=mhw=|\/\/ | (678) 463-0932 | http://www.wittsend.com/mhw/ NIC whois: MHW9 | An optimist believes we live in the best of all PGP Key: 0x674627FF| possible worlds. A pessimist is sure of it! signature.asc Description: This is a digitally signed message part ___ systemd-devel mailing list systemd-devel@lists.freedesktop.org http://lists.freedesktop.org/mailman/listinfo/systemd-devel
Re: [systemd-devel] Unable to run systemd in an LXC / cgroup container.
On Mon, 22.10.12 11:48, Michael H. Warfield (m...@wittsend.com) wrote: To summarize the problem... The LXC startup binary sets up various things for /dev and /dev/pts for the container to run properly and this works perfectly fine for SystemV start-up scripts and/or Upstart. Unfortunately, systemd has mounts of devtmpfs on /dev and devpts on /dev/pts which then break things horribly. This is because the kernel currently lacks namespaces for devices and won't for some time to come (in design). When devtmpfs gets mounted over top of /dev in the container, it then hijacks the hosts console tty and several other devices which had been set up through bind mounts by LXC and should have been LEFT ALONE. Please initialize a minimal tmpfs on /dev. systemd will then work fine. My containers have a reasonable /dev that work with Upstart just fine but they are not on tmpfs. Is mounting tmpfs on /dev and recreating that minimal /dev required? Well, it can be any kind of mount really. Just needs to be a mount. And the idea is to use tmpfs for this. What /dev are you currently using? It's probably not a good idea to reuse the hosts' /dev, since it contains so many device nodes that should not be accessible/visible to the container. systemd will make use of pre-existing mounts if they exist, and only mount something new if they don't exist. So you're saying that, if we have something mounted on /dev, that's what prevents systemd from mounting devtmpfs on /dev? Yes. But, I have systemd running on my host system (F17) and containers with sysvinit or upstart inits are all starting just fine. That sounds like it should impact all containers as pivot_root() is issued before systemd in the container is started. Or am I missing something here? That sounds like a problem for Serge and others to investigate further. I'll see about trying that workaround though. The shared issue is F18, and it's about running LXC on a systemd system, not about running systemd inside of LXC. Lennart -- Lennart Poettering - Red Hat, Inc. ___ systemd-devel mailing list systemd-devel@lists.freedesktop.org http://lists.freedesktop.org/mailman/listinfo/systemd-devel
Re: [systemd-devel] Unable to run systemd in an LXC / cgroup container.
On Mon, 2012-10-22 at 22:50 +0200, Lennart Poettering wrote: On Mon, 22.10.12 11:48, Michael H. Warfield (m...@wittsend.com) wrote: To summarize the problem... The LXC startup binary sets up various things for /dev and /dev/pts for the container to run properly and this works perfectly fine for SystemV start-up scripts and/or Upstart. Unfortunately, systemd has mounts of devtmpfs on /dev and devpts on /dev/pts which then break things horribly. This is because the kernel currently lacks namespaces for devices and won't for some time to come (in design). When devtmpfs gets mounted over top of /dev in the container, it then hijacks the hosts console tty and several other devices which had been set up through bind mounts by LXC and should have been LEFT ALONE. Please initialize a minimal tmpfs on /dev. systemd will then work fine. My containers have a reasonable /dev that work with Upstart just fine but they are not on tmpfs. Is mounting tmpfs on /dev and recreating that minimal /dev required? Well, it can be any kind of mount really. Just needs to be a mount. And the idea is to use tmpfs for this. What /dev are you currently using? It's probably not a good idea to reuse the hosts' /dev, since it contains so many device nodes that should not be accessible/visible to the container. Got it. And that explains the problems we're seeing but also what I'm seeing in some libvirt-lxc related pages, which is a separate and distinct project in spite of the similarities in the name... http://wiki.1tux.org/wiki/Lxc/Installation#Additional_notes Unfortunately, in our case, merely getting a mount in there is a complication in that it also has to be populated but, at least, we understand the problem set now. systemd will make use of pre-existing mounts if they exist, and only mount something new if they don't exist. So you're saying that, if we have something mounted on /dev, that's what prevents systemd from mounting devtmpfs on /dev? Yes. But, I have systemd running on my host system (F17) and containers with sysvinit or upstart inits are all starting just fine. That sounds like it should impact all containers as pivot_root() is issued before systemd in the container is started. Or am I missing something here? That sounds like a problem for Serge and others to investigate further. I'll see about trying that workaround though. The shared issue is F18, and it's about running LXC on a systemd system, not about running systemd inside of LXC. Whew! I'll deal with F18 when I need to deal with F18. That explains why my F17 hosts are running and gives Serge and others a chance to address this, forewarned. Thanks for that info. Lennart -- Lennart Poettering - Red Hat, Inc. Regards, Mike -- Michael H. Warfield (AI4NB) | (770) 985-6132 | m...@wittsend.com /\/\|=mhw=|\/\/ | (678) 463-0932 | http://www.wittsend.com/mhw/ NIC whois: MHW9 | An optimist believes we live in the best of all PGP Key: 0x674627FF| possible worlds. A pessimist is sure of it! signature.asc Description: This is a digitally signed message part ___ systemd-devel mailing list systemd-devel@lists.freedesktop.org http://lists.freedesktop.org/mailman/listinfo/systemd-devel
[systemd-devel] Unable to run systemd in an LXC / cgroup container.
Hello, This is being directed to the systemd-devel community but I'm cc'ing the lxc-users community and the Fedora community on this for their input as well. I know it's not always good to cross post between multiple lists but this is of interest to all three communities who may have valuable input. I'm new to this particular list, just having joined after tracking a problem down to some systemd internals... Several people over the last year or two on the lxc-users list have been discussions trying to run certain distros (notably Fedora 16 and above, recent Arch Linux and possibly others) in LXC containers, virualizing entire servers this way. This is very similar to Virtuoso / OpenVZ only it's using the native Linux cgroups for the containers (primary reason I dumped OpenVZ was to avoid their custom patched kernels). These recent distros have switched to systemd for the main init process and this has proven to be disastrous for those of us using LXC and trying to install or update our containers. To put it bluntly, it doesn't work and causes all sorts of problems on the host. To summarize the problem... The LXC startup binary sets up various things for /dev and /dev/pts for the container to run properly and this works perfectly fine for SystemV start-up scripts and/or Upstart. Unfortunately, systemd has mounts of devtmpfs on /dev and devpts on /dev/pts which then break things horribly. This is because the kernel currently lacks namespaces for devices and won't for some time to come (in design). When devtmpfs gets mounted over top of /dev in the container, it then hijacks the hosts console tty and several other devices which had been set up through bind mounts by LXC and should have been LEFT ALONE. Yes! I recognize that this problem with devtmpfs and lack of namespaces is a potential security problem anyways that could (and does) cause serious container-to-host problems. We're just not going to get that fixed right away in the linux cgroups and namespaces. How do we work around this problem in systemd where it has hard coded mounts in the binary that we can't override or configure? Or is it there and I'm just missing it trying to examine the sources? That's how I found where the problem lay. Regards, Mike -- Michael H. Warfield (AI4NB) | (770) 985-6132 | m...@wittsend.com /\/\|=mhw=|\/\/ | (678) 463-0932 | http://www.wittsend.com/mhw/ NIC whois: MHW9 | An optimist believes we live in the best of all PGP Key: 0x674627FF| possible worlds. A pessimist is sure of it! signature.asc Description: This is a digitally signed message part ___ systemd-devel mailing list systemd-devel@lists.freedesktop.org http://lists.freedesktop.org/mailman/listinfo/systemd-devel
Re: [systemd-devel] Unable to run systemd in an LXC / cgroup container.
On Sun, Oct 21, 2012 at 11:25 PM, Michael H. Warfield m...@wittsend.com wrote: This is being directed to the systemd-devel community but I'm cc'ing the lxc-users community and the Fedora community on this for their input as well. I know it's not always good to cross post between multiple lists but this is of interest to all three communities who may have valuable input. I'm new to this particular list, just having joined after tracking a problem down to some systemd internals... Several people over the last year or two on the lxc-users list have been discussions trying to run certain distros (notably Fedora 16 and above, recent Arch Linux and possibly others) in LXC containers, virualizing entire servers this way. This is very similar to Virtuoso / OpenVZ only it's using the native Linux cgroups for the containers (primary reason I dumped OpenVZ was to avoid their custom patched kernels). These recent distros have switched to systemd for the main init process and this has proven to be disastrous for those of us using LXC and trying to install or update our containers. To put it bluntly, it doesn't work and causes all sorts of problems on the host. To summarize the problem... The LXC startup binary sets up various things for /dev and /dev/pts for the container to run properly and this works perfectly fine for SystemV start-up scripts and/or Upstart. Unfortunately, systemd has mounts of devtmpfs on /dev and devpts on /dev/pts which then break things horribly. This is because the kernel currently lacks namespaces for devices and won't for some time to come (in design). When devtmpfs gets mounted over top of /dev in the container, it then hijacks the hosts console tty and several other devices which had been set up through bind mounts by LXC and should have been LEFT ALONE. Yes! I recognize that this problem with devtmpfs and lack of namespaces is a potential security problem anyways that could (and does) cause serious container-to-host problems. We're just not going to get that fixed right away in the linux cgroups and namespaces. How do we work around this problem in systemd where it has hard coded mounts in the binary that we can't override or configure? Or is it there and I'm just missing it trying to examine the sources? That's how I found where the problem lay. As a first step, this probably explains most of it: http://www.freedesktop.org/wiki/Software/systemd/ContainerInterface Kay ___ systemd-devel mailing list systemd-devel@lists.freedesktop.org http://lists.freedesktop.org/mailman/listinfo/systemd-devel