Re: [Toybox] [lxc-devel] Device Namespaces

Rob Landley Sun, 13 Oct 2013 18:10:41 -0700

Snippet of a conversation on linux-kernel relevant to implementingcontainer support in toybox. Since Greg KH refuses to namespacedevtmpfs (so each container can mount its own and see just thecontainer's devices), this suggestion is to make a /dev/containerdirectory within which you create subdirectory each container can bindmount on /dev, and then the host tool can manage the devices for thecontainer.

This is part of the reason I've been holding off on an mdev rewrite:still not sure what exactly it should _do_.


On 09/30/2013 10:36:50 AM, Michael H. Warfield wrote:

On Sun, 2013-09-29 at 13:06 -0700, Greg Kroah-Hartman wrote:
> On Sun, Sep 29, 2013 at 10:28:55PM +0300, Amir Goldstein wrote:

[snip]

You're right about the user space problem.  Something needs to manage
the devices in a coherent manner as devices come and go and as
containers come and go in asynchronous manner.  In my mind, the only
place for that is in the host.  "Non trivial" is a jaw dropping
understatement and I can see where you feel it would be impossible to
manage in applying namespaces to devtmpfs.  That leaves the user space
in the host.  I can see where it would be intractable in the kernel.

I may get beat mercilessly for suggesting this but, just as with
cgroups, if we create a subdirectory in devtmpfs for subsystem (LXC)and
container, we can then bind mount that subtree off of devtmpfs to the
container and then the host can map and manipulate the device subtree
into the container (even if the container is denied mknod capability).
That leaves the host to manage all the devices, which actually makes a
LOT of sense (to me) since it should be responsible for the devicesandthe overall kernel operations. That would be no different thanneeding
to configure device passthroughs for KVM / VirtualBox / VMware
hypervisors.

Example...  In the host I would have something like this...

/dev/lxc/
romulus
remus
gemini
janus

And then bind mount each of those subdirectories
to /var/lib/lxc/${Container}/rootfs/dev directory. Then map thedevices
from the host /dev to the container /dev with mknod in the host and
relative symlinks.

That also (I think) helps me deal with some of the (mis)behavior of
systemd where it contains unconfigurable behavior (mounting devtmpfs)
controlled by "magic cookies" (/dev mounted on another major/minor
from / to disable it mounting devtmpfs). I initially recoiled inhorrorof the thought of overloading the devtmpfs subtree with containerbasedsubdirectories, devices, and symlinks but the idea grew on me thatthis
might be better than what we're dealing with now of mounting tmpfs on
the /dev mount point in all theses containers and then having to
populate them just to prevent systemd from creating collisions with
devtmpfs and the resulting violation of the container isolation.

It DOES still leave the problem of dealing with udev rules in the
container and subsidiary device syslinks in the container which maynot
correspond to the rules in the host.  That's still problem in my mind
(but already present and miniscule to what we would be solving).  I
could pattern match everything coming out of udev in a trigger and map
devices and symlinks into the new subtree in the host but I have nowayto manage propagating the rules in the container down into theprocessor
in the host or a way to trigger those udev rules in the containers.
Suggestions there might be nice (as well as the cat calls).  I'm not
sure I have it clear in my head yet how I would deal with bringing upacontainer and then mapping all the required existing devices over toit.
That's your user space problem in a nutshell.  That's easy to handle
with udev as things come and go but, when the user space comes afterand
udev isn't processing triggers, how do I handle the mappings.  That's
also non-trivial in my mind.
Device creation would seem to be pretty trivial. Device removal, notso
much.  If I create another node on devtmpfs and that major/minor gets
removed, will it also get removed? I also have to remove thesymlinks.
The removal process just feels more complicated in my mind.
Greg, I think you are absolutely right, this needs to be managed inuser
space and not in kernel space and we do have the tools to do it.  I
think I can do some of it in a way that will suck less compared to how
we're (LXC is) doing it now. I'm just not so sure how comprehensivethe
solution will be or how well it will work.
I've still got several other takeaways from that session to put a bowon
before really testing this idea further.  I really have not fully
fleshed this idea out and it's going to take me some time.  There may
also me some other corner cases I haven't considered. And thenthere's
Android.  Sigh...

And maybe I'm just totally off base and crazy.  Wouldn't be the first
time, won't be the last time.

> greg k-h

Regards,
Mike
--
Michael H. Warfield (AI4NB) | (770) 985-6132 |  [email protected]
/\/\|=mhw=|\/\/ | (678) 463-0932 |http://www.wittsend.com/mhw/NIC whois: MHW9 | An optimist believes we live in thebest of allPGP Key: 0x674627FF | possible worlds. A pessimist is sureof it!

_______________________________________________
Toybox mailing list
[email protected]
http://lists.landley.net/listinfo.cgi/toybox-landley.net

Re: [Toybox] [lxc-devel] Device Namespaces

Reply via email to