On Tue, 03.02.15 16:34, Serge Hallyn (serge.hal...@ubuntu.com) wrote: > > > the UID/GID on entire filesystem sub-trees given to containers with > > > userns is a real unpleasant thing to have to deal with. I'd not want > > Of course you would *not* want to take a stock rootfs where uid == 0 > and shift that into the container, as that would give root in the > container a chance to write root-owned files on the host to leverage > later in a convoluted attack :)
Is this really a problem? I mean, the only way how this could be exploitable is if people make the container hierarchy accessible to other users, but that should be easy to prohibit by making the container's parent dir 0700, which we already do for nspawn's container in /var/lib/machines... The only other risk I can see here is that if people use traditional ext4 quota, then the container's disk usage will be added to the host's usage. But that's easy to avoid, by simply never placing container images and the host on the same quota device... Also, in the case of systemd-nspawn we strongly emphasize usage with loopback devices. In that case there's no vulnerability at all, since the device is completely seperate from the host fs, and it will only be mounted in the container, but not in the host... > We might want to come up with a containers concensus that container > rootfs's are always shipped with uid range 0-65535 -> 100000-165535. > That still leaves a chance for container A (mapped to 200000-265535) > to write valid setuid-root binary for container B (mapped to > 300000-365535), which isn't possible otherwise. But that's better > than doing so for host-root. Well, ultimately I'd recommend an automatism like this for container managers: a) if not otherwise configured, let's give each container their own 16bit of uids. This would mean each 32bit uid could be neatly split into the upper 16bit that would become a "container" id, plus the lower 16bit for the actual "virtual" UID. b) we will never set up UID ranges orthogonal from GID ranges. c) when a container image is started, the container manager first checks the UID/GID owner of the root of the root file system. It masks the lower 16bit away, and only looks for the upper 16bit. d) It will then look for an unused container id (which means, an unused range of 64K UIDs), and then shifts the offset it identified following c) to this new container id. With that in place it doesn't really matter which base people use in their containers, the container manager would do the right thing, and shift everything into the right place. Paranoid people could ship their container images shifted to some ID of their choice, and lazy folks could just ship their container images with base 0, but then must make sure they don't give anybody else access to the hierarchy, and don't confuse quota... > > > [1] Using a separate disk image per container means a container can't > > > DOS other containers by exhausting inodes for example with $millions > > > of small files. > > > > Indeed. I'd claim that without such a concept of mount point uid > > shifting the whole userns story is not very useful IRL... > > I had always thought this would eventually be done using a stackable > filesystem, but doing it at bind mount time would be neat too, and > less objectionable to the kernel folks. (Though overlayfs is in now, > so <shrug>) > > I'm actually quite surprised noone has sat down and written a > stackable uid-shifting fs yet. If it's done as part of bind mounts, or as an extension of overlayfs, or in a completely new fs, doesn't really matter to me. I'd certainly welcome a solution based on any of these options! Lennart -- Lennart Poettering, Red Hat _______________________________________________ systemd-devel mailing list systemd-devel@lists.freedesktop.org http://lists.freedesktop.org/mailman/listinfo/systemd-devel