Snippet of a conversation on linux-kernel relevant to implementing container support in toybox. Since Greg KH refuses to namespace devtmpfs (so each container can mount its own and see just the container's devices), this suggestion is to make a /dev/container directory within which you create subdirectory each container can bind mount on /dev, and then the host tool can manage the devices for the container.

This is part of the reason I've been holding off on an mdev rewrite: still not sure what exactly it should _do_.

On 09/30/2013 10:36:50 AM, Michael H. Warfield wrote:
On Sun, 2013-09-29 at 13:06 -0700, Greg Kroah-Hartman wrote:
> On Sun, Sep 29, 2013 at 10:28:55PM +0300, Amir Goldstein wrote:
[snip]
You're right about the user space problem.  Something needs to manage
the devices in a coherent manner as devices come and go and as
containers come and go in asynchronous manner.  In my mind, the only
place for that is in the host.  "Non trivial" is a jaw dropping
understatement and I can see where you feel it would be impossible to
manage in applying namespaces to devtmpfs.  That leaves the user space
in the host.  I can see where it would be intractable in the kernel.

I may get beat mercilessly for suggesting this but, just as with
cgroups, if we create a subdirectory in devtmpfs for subsystem (LXC) and
container, we can then bind mount that subtree off of devtmpfs to the
container and then the host can map and manipulate the device subtree
into the container (even if the container is denied mknod capability).
That leaves the host to manage all the devices, which actually makes a
LOT of sense (to me) since it should be responsible for the devices and the overall kernel operations. That would be no different than needing
to configure device passthroughs for KVM / VirtualBox / VMware
hypervisors.

Example...  In the host I would have something like this...

/dev/lxc/
romulus
remus
gemini
janus

And then bind mount each of those subdirectories
to /var/lib/lxc/${Container}/rootfs/dev directory. Then map the devices
from the host /dev to the container /dev with mknod in the host and
relative symlinks.

That also (I think) helps me deal with some of the (mis)behavior of
systemd where it contains unconfigurable behavior (mounting devtmpfs)
controlled by "magic cookies" (/dev mounted on another major/minor
from / to disable it mounting devtmpfs). I initially recoiled in horror of the thought of overloading the devtmpfs subtree with container based subdirectories, devices, and symlinks but the idea grew on me that this
might be better than what we're dealing with now of mounting tmpfs on
the /dev mount point in all theses containers and then having to
populate them just to prevent systemd from creating collisions with
devtmpfs and the resulting violation of the container isolation.

It DOES still leave the problem of dealing with udev rules in the
container and subsidiary device syslinks in the container which may not
correspond to the rules in the host.  That's still problem in my mind
(but already present and miniscule to what we would be solving).  I
could pattern match everything coming out of udev in a trigger and map
devices and symlinks into the new subtree in the host but I have no way to manage propagating the rules in the container down into the processor
in the host or a way to trigger those udev rules in the containers.
Suggestions there might be nice (as well as the cat calls).  I'm not
sure I have it clear in my head yet how I would deal with bringing up a container and then mapping all the required existing devices over to it.
That's your user space problem in a nutshell.  That's easy to handle
with udev as things come and go but, when the user space comes after and
udev isn't processing triggers, how do I handle the mappings.  That's
also non-trivial in my mind.

Device creation would seem to be pretty trivial. Device removal, not so
much.  If I create another node on devtmpfs and that major/minor gets
removed, will it also get removed? I also have to remove the symlinks.
The removal process just feels more complicated in my mind.

Greg, I think you are absolutely right, this needs to be managed in user
space and not in kernel space and we do have the tools to do it.  I
think I can do some of it in a way that will suck less compared to how
we're (LXC is) doing it now. I'm just not so sure how comprehensive the
solution will be or how well it will work.

I've still got several other takeaways from that session to put a bow on
before really testing this idea further.  I really have not fully
fleshed this idea out and it's going to take me some time.  There may
also me some other corner cases I haven't considered. And then there's
Android.  Sigh...

And maybe I'm just totally off base and crazy.  Wouldn't be the first
time, won't be the last time.

> greg k-h

Regards,
Mike
--
Michael H. Warfield (AI4NB) | (770) 985-6132 |  [email protected]
/\/\|=mhw=|\/\/ | (678) 463-0932 | http://www.wittsend.com/mhw/ NIC whois: MHW9 | An optimist believes we live in the best of all PGP Key: 0x674627FF | possible worlds. A pessimist is sure of it!

_______________________________________________
Toybox mailing list
[email protected]
http://lists.landley.net/listinfo.cgi/toybox-landley.net

Reply via email to