Re: [PATCH] [RFC] mnt: add ability to clone mntns starting with the current root
On Wed, Oct 08, 2014 at 12:23:52PM -0700, Eric W. Biederman wrote: > Andy Lutomirski writes: > > > On Wed, Oct 8, 2014 at 4:08 AM, Andrew Vagin wrote: > >> On Tue, Oct 07, 2014 at 01:45:22PM -0700, Eric W. Biederman wrote: > >>> Andrey Vagin writes: > >>> > >>> > From: Andrey Vagin > >>> > > >>> > Currently when we create a new container with a separate root, > >>> > we need to clone the current mount namespace with all mounts and then > >>> > clean up it by using pivot_root(). A big part of mountpoints are cloned > >>> > only to be umounted. > >>> > >>> Is the motivation performance? Because if that is the motivation we > >>> need numbers. > >> > >> The major motivation to create a clean mount namespace which contains > >> only required mounts. > >> > >> Now you want to convince us that there is nothing wrong if we use > >> userns, because all inherited mounts are locked. My point is that all > >> useless mounts should be umounted. If the current root isn't on rootfs, > >> pivot_root() allows us to umount all useless points. But pivot_root() > >> doesn't work, if the current root is on rootfs. How can we umount > >> useless points in this case? > > One of your justifications for a new system call was so you could do > less. Doing less to get to where you want to go is only justified when > your doing less to get better performance. > > >> Maybe we want to say that rootfs should not be used if we are going to > >> create containers... > > Today it is an assumption of the vfs that rootfs is mounted. With > rootfs mounted and pivot_root at the base of the mount stack you can > make as minimal of a set of mounts as the vfs allows. You have misunderstood me. For most system /proc/self/mountinfo looks like this: [root@dhcp-10-30-23-214 ~]# cat /proc/self/mountinfo 17 22 0:3 / /proc rw,relatime - proc proc rw 18 22 0:0 / /sys rw,relatime - sysfs sysfs rw 19 22 0:5 / /dev rw,relatime - devtmpfs devtmpfs rw,size=502324k,nr_inodes=125581,mode=755 20 19 0:11 / /dev/pts rw,relatime - devpts devpts rw,gid=5,mode=620,ptmxmode=000 21 19 0:17 / /dev/shm rw,nosuid,nodev,noexec,relatime - tmpfs tmpfs rw 22 1 253:2 / / rw,relatime - ext4 /dev/vda2 rw,barrier=1,data=ordered 24 22 253:1 / /boot rw,relatime - ext3 /dev/vda1 rw,errors=continue,user_xattr,acl,barrier=1,data=ordered / isn't a rootfs mount here and pivot_root() works fine in this case. Here is no problem for such system. Now look at the second case: hell@android:/ $ cat /proc/self/mountinfo 1 1 0:1 / / ro,relatime - rootfs rootfs ro 11 1 0:11 / /dev rw,nosuid,relatime - tmpfs tmpfs rw,mode=755 12 11 0:9 / /dev/pts rw,relatime - devpts devpts rw,mode=600 13 1 0:3 / /proc rw,relatime - proc proc rw 14 1 0:12 / /sys rw,relatime - sysfs sysfs rw Now / is an rootfs mount. pivot_root() doesn't work in this case and we need to do some tricks to get a minimal set of mounts. Thanks, Andrew > > Removing rootfs from the vfs requires an audit of everything that > manipulates mounts. It is not remotely a local excercise. > > One of the things that needs to be considered is that if you really want > to audit mounts is the code that needs manipulates them needs to be > audited every bit as much as the mounts themselves. > > > Could we have an extra rootfs-like fs that is always completely empty, > > doesn't allow any writes, and can sit at the bottom of container > > namespace hierarchies? If so, and if we add a new syscall that's like > > pivot_root (or unshare) but prunes the hierarchy, then we could switch > > to that rootfs then. > > Or equally have something that guarantees that rootfs is empty and > read-only at the time the normal root filesystem is mounted. That is > certainly a much more localized change if we want to go there. > > I am half tempted to suggest that mount --move /some/path / be updated > to make the old / just go away (perhaps to be replaced with a read-only > empty rootfs). That gets us into figuring out if we break userspace > which is a big challenge. > > Eric -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] [RFC] mnt: add ability to clone mntns starting with the current root
On Wed, Oct 08, 2014 at 12:23:52PM -0700, Eric W. Biederman wrote: Andy Lutomirski l...@amacapital.net writes: On Wed, Oct 8, 2014 at 4:08 AM, Andrew Vagin ava...@parallels.com wrote: On Tue, Oct 07, 2014 at 01:45:22PM -0700, Eric W. Biederman wrote: Andrey Vagin ava...@openvz.org writes: From: Andrey Vagin ava...@gmail.com Currently when we create a new container with a separate root, we need to clone the current mount namespace with all mounts and then clean up it by using pivot_root(). A big part of mountpoints are cloned only to be umounted. Is the motivation performance? Because if that is the motivation we need numbers. The major motivation to create a clean mount namespace which contains only required mounts. Now you want to convince us that there is nothing wrong if we use userns, because all inherited mounts are locked. My point is that all useless mounts should be umounted. If the current root isn't on rootfs, pivot_root() allows us to umount all useless points. But pivot_root() doesn't work, if the current root is on rootfs. How can we umount useless points in this case? One of your justifications for a new system call was so you could do less. Doing less to get to where you want to go is only justified when your doing less to get better performance. Maybe we want to say that rootfs should not be used if we are going to create containers... Today it is an assumption of the vfs that rootfs is mounted. With rootfs mounted and pivot_root at the base of the mount stack you can make as minimal of a set of mounts as the vfs allows. You have misunderstood me. For most system /proc/self/mountinfo looks like this: [root@dhcp-10-30-23-214 ~]# cat /proc/self/mountinfo 17 22 0:3 / /proc rw,relatime - proc proc rw 18 22 0:0 / /sys rw,relatime - sysfs sysfs rw 19 22 0:5 / /dev rw,relatime - devtmpfs devtmpfs rw,size=502324k,nr_inodes=125581,mode=755 20 19 0:11 / /dev/pts rw,relatime - devpts devpts rw,gid=5,mode=620,ptmxmode=000 21 19 0:17 / /dev/shm rw,nosuid,nodev,noexec,relatime - tmpfs tmpfs rw 22 1 253:2 / / rw,relatime - ext4 /dev/vda2 rw,barrier=1,data=ordered 24 22 253:1 / /boot rw,relatime - ext3 /dev/vda1 rw,errors=continue,user_xattr,acl,barrier=1,data=ordered / isn't a rootfs mount here and pivot_root() works fine in this case. Here is no problem for such system. Now look at the second case: hell@android:/ $ cat /proc/self/mountinfo 1 1 0:1 / / ro,relatime - rootfs rootfs ro 11 1 0:11 / /dev rw,nosuid,relatime - tmpfs tmpfs rw,mode=755 12 11 0:9 / /dev/pts rw,relatime - devpts devpts rw,mode=600 13 1 0:3 / /proc rw,relatime - proc proc rw 14 1 0:12 / /sys rw,relatime - sysfs sysfs rw Now / is an rootfs mount. pivot_root() doesn't work in this case and we need to do some tricks to get a minimal set of mounts. Thanks, Andrew Removing rootfs from the vfs requires an audit of everything that manipulates mounts. It is not remotely a local excercise. One of the things that needs to be considered is that if you really want to audit mounts is the code that needs manipulates them needs to be audited every bit as much as the mounts themselves. Could we have an extra rootfs-like fs that is always completely empty, doesn't allow any writes, and can sit at the bottom of container namespace hierarchies? If so, and if we add a new syscall that's like pivot_root (or unshare) but prunes the hierarchy, then we could switch to that rootfs then. Or equally have something that guarantees that rootfs is empty and read-only at the time the normal root filesystem is mounted. That is certainly a much more localized change if we want to go there. I am half tempted to suggest that mount --move /some/path / be updated to make the old / just go away (perhaps to be replaced with a read-only empty rootfs). That gets us into figuring out if we break userspace which is a big challenge. Eric -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] [RFC] mnt: add ability to clone mntns starting with the current root
On Wed, Oct 8, 2014 at 4:38 PM, Serge Hallyn wrote: > Quoting Andy Lutomirski (l...@amacapital.net): >> On Wed, Oct 8, 2014 at 2:36 PM, Rob Landley wrote: >> > On 10/08/14 14:31, Andy Lutomirski wrote: >> >> On Wed, Oct 8, 2014 at 12:23 PM, Eric W. Biederman >> >> wrote: >> >>> Andy Lutomirski writes: >> > Maybe we want to say that rootfs should not be used if we are going to >> > create containers... >> >>> >> >>> Today it is an assumption of the vfs that rootfs is mounted. With >> >>> rootfs mounted and pivot_root at the base of the mount stack you can >> >>> make as minimal of a set of mounts as the vfs allows. >> >>> >> >>> Removing rootfs from the vfs requires an audit of everything that >> >>> manipulates mounts. It is not remotely a local excercise. >> >> >> >> Would it be a less invasive audit to allow different mount namespaces >> >> to have different rootfses? >> > >> > I.E. The same way different namespaces have different init tasks? >> > >> > The abstraction containers has implemented here should be logically >> > consistent. >> > >> Could we have an extra rootfs-like fs that is always completely empty, >> doesn't allow any writes, and can sit at the bottom of container >> namespace hierarchies? If so, and if we add a new syscall that's like >> pivot_root (or unshare) but prunes the hierarchy, then we could switch >> to that rootfs then. >> >>> >> >>> Or equally have something that guarantees that rootfs is empty and >> >>> read-only at the time the normal root filesystem is mounted. That is >> >>> certainly a much more localized change if we want to go there. >> >>> >> >>> I am half tempted to suggest that mount --move /some/path / be updated >> >>> to make the old / just go away (perhaps to be replaced with a read-only >> >>> empty rootfs). That gets us into figuring out if we break userspace >> >>> which is a big challenge. >> >> >> >> Hence my argument for a new syscall or entirely new operation. >> > >> > I'm still waiting for somebody to explain to my why chroot() shouldn't >> > be changed to do this instead of adding a new syscall. (At least when >> > mount namespace support is enabled.) >> >> Because chroot has no effect on the namespace at all. If you fork and >> the child chroots, the parent isn't chrooted. And, more importantly >> for my example, is a process has it's cwd as /foo, and then it forks >> and the child chroots, then parent's ".." isn't changed as a result of >> the chroot. >> >> > >> >> mount(2) and friends are way too multiplexed right now. I just found >> >> yet another security bug due to the insanely complicated semantics of >> >> the vfs syscalls. (Yes, a different one from the one yesterday.) >> > >> > As the guy who rewrote busybox mount 3 times, and who just implemented a >> > brand new one (toybox) from scratch: >> > >> > It's a bit fiddly, yes. >> > >> >> A new operation kills several birds with one stone. It could look like: >> >> >> >> int mntns_change_root(int dfd, const char *path, int flags); >> >> >> >> return -EPERM if chrooted. >> > >> > Really? >> >> Now that CVE-2014-7970 is public: what the heck is pivot_root supposed >> to do if the caller is chrooted? The current behavior is obviously >> incorrect (it leaks memory), but it's not entirely clear to me what >> should happen. I think it should either be disallowed or should have >> well-defined semantics. >> >> For simplicity, if a new syscall for this is added, then I think that >> the caller-is-chrooted case should be disallowed. If someone needs it >> and can articulate what the semantics should be, then I have no >> problem with allowing it going forward. > > It's not that I'd have a need for that, but rather if for some > reason I started out chrooted due to some bogus initramfs, I'd > prefer to not have to feel like a criminial and escape the chroot > first. You already can't create a userns if you're chrooted (even if you have global privilege). --Andy -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] [RFC] mnt: add ability to clone mntns starting with the current root
Quoting Andy Lutomirski (l...@amacapital.net): > On Wed, Oct 8, 2014 at 2:36 PM, Rob Landley wrote: > > On 10/08/14 14:31, Andy Lutomirski wrote: > >> On Wed, Oct 8, 2014 at 12:23 PM, Eric W. Biederman > >> wrote: > >>> Andy Lutomirski writes: > > Maybe we want to say that rootfs should not be used if we are going to > > create containers... > >>> > >>> Today it is an assumption of the vfs that rootfs is mounted. With > >>> rootfs mounted and pivot_root at the base of the mount stack you can > >>> make as minimal of a set of mounts as the vfs allows. > >>> > >>> Removing rootfs from the vfs requires an audit of everything that > >>> manipulates mounts. It is not remotely a local excercise. > >> > >> Would it be a less invasive audit to allow different mount namespaces > >> to have different rootfses? > > > > I.E. The same way different namespaces have different init tasks? > > > > The abstraction containers has implemented here should be logically > > consistent. > > > Could we have an extra rootfs-like fs that is always completely empty, > doesn't allow any writes, and can sit at the bottom of container > namespace hierarchies? If so, and if we add a new syscall that's like > pivot_root (or unshare) but prunes the hierarchy, then we could switch > to that rootfs then. > >>> > >>> Or equally have something that guarantees that rootfs is empty and > >>> read-only at the time the normal root filesystem is mounted. That is > >>> certainly a much more localized change if we want to go there. > >>> > >>> I am half tempted to suggest that mount --move /some/path / be updated > >>> to make the old / just go away (perhaps to be replaced with a read-only > >>> empty rootfs). That gets us into figuring out if we break userspace > >>> which is a big challenge. > >> > >> Hence my argument for a new syscall or entirely new operation. > > > > I'm still waiting for somebody to explain to my why chroot() shouldn't > > be changed to do this instead of adding a new syscall. (At least when > > mount namespace support is enabled.) > > Because chroot has no effect on the namespace at all. If you fork and > the child chroots, the parent isn't chrooted. And, more importantly > for my example, is a process has it's cwd as /foo, and then it forks > and the child chroots, then parent's ".." isn't changed as a result of > the chroot. > > > > >> mount(2) and friends are way too multiplexed right now. I just found > >> yet another security bug due to the insanely complicated semantics of > >> the vfs syscalls. (Yes, a different one from the one yesterday.) > > > > As the guy who rewrote busybox mount 3 times, and who just implemented a > > brand new one (toybox) from scratch: > > > > It's a bit fiddly, yes. > > > >> A new operation kills several birds with one stone. It could look like: > >> > >> int mntns_change_root(int dfd, const char *path, int flags); > >> > >> return -EPERM if chrooted. > > > > Really? > > Now that CVE-2014-7970 is public: what the heck is pivot_root supposed > to do if the caller is chrooted? The current behavior is obviously > incorrect (it leaks memory), but it's not entirely clear to me what > should happen. I think it should either be disallowed or should have > well-defined semantics. > > For simplicity, if a new syscall for this is added, then I think that > the caller-is-chrooted case should be disallowed. If someone needs it > and can articulate what the semantics should be, then I have no > problem with allowing it going forward. It's not that I'd have a need for that, but rather if for some reason I started out chrooted due to some bogus initramfs, I'd prefer to not have to feel like a criminial and escape the chroot first. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] [RFC] mnt: add ability to clone mntns starting with the current root
On 10/08/14 14:31, Andy Lutomirski wrote: > On Wed, Oct 8, 2014 at 12:23 PM, Eric W. Biederman > wrote: >> Andy Lutomirski writes: Maybe we want to say that rootfs should not be used if we are going to create containers... >> >> Today it is an assumption of the vfs that rootfs is mounted. With >> rootfs mounted and pivot_root at the base of the mount stack you can >> make as minimal of a set of mounts as the vfs allows. >> >> Removing rootfs from the vfs requires an audit of everything that >> manipulates mounts. It is not remotely a local excercise. > > Would it be a less invasive audit to allow different mount namespaces > to have different rootfses? I.E. The same way different namespaces have different init tasks? The abstraction containers has implemented here should be logically consistent. >>> Could we have an extra rootfs-like fs that is always completely empty, >>> doesn't allow any writes, and can sit at the bottom of container >>> namespace hierarchies? If so, and if we add a new syscall that's like >>> pivot_root (or unshare) but prunes the hierarchy, then we could switch >>> to that rootfs then. >> >> Or equally have something that guarantees that rootfs is empty and >> read-only at the time the normal root filesystem is mounted. That is >> certainly a much more localized change if we want to go there. >> >> I am half tempted to suggest that mount --move /some/path / be updated >> to make the old / just go away (perhaps to be replaced with a read-only >> empty rootfs). That gets us into figuring out if we break userspace >> which is a big challenge. > > Hence my argument for a new syscall or entirely new operation. I'm still waiting for somebody to explain to my why chroot() shouldn't be changed to do this instead of adding a new syscall. (At least when mount namespace support is enabled.) > mount(2) and friends are way too multiplexed right now. I just found > yet another security bug due to the insanely complicated semantics of > the vfs syscalls. (Yes, a different one from the one yesterday.) As the guy who rewrote busybox mount 3 times, and who just implemented a brand new one (toybox) from scratch: It's a bit fiddly, yes. > A new operation kills several birds with one stone. It could look like: > > int mntns_change_root(int dfd, const char *path, int flags); > > return -EPERM if chrooted. Really? > Returns -EINVAL if path (relative to dfd) isn't a mountmount. Requiring that chroot() only be called on mountpoints would break existing semantics, which gets us back to new systemcall instead of changing behavior of existing one. If I recall, the first line of pushback against merging the openvz code as is was "buckets of new syscalls". Pushback against adding a new system call is understandable. Why can't we fix chroot() now that we have the tools to do so? > Otherwise it disconnects path from the existing > hierarchy, attaches a permanently-empty read-only rootfs under it, > makes it the root of the mntns, and does the root refs fixup. The old > hierarchy gets thrown out. We have a chroot() syscall. We don't use it for containers because it doesn't do what we want. Does it currently do what _anybody_ wants? > Systemd could use this, too. While that's a strong argument against it, I'm willing to overlook it. Rob -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] [RFC] mnt: add ability to clone mntns starting with the current root
On Wed, Oct 8, 2014 at 2:36 PM, Rob Landley wrote: > On 10/08/14 14:31, Andy Lutomirski wrote: >> On Wed, Oct 8, 2014 at 12:23 PM, Eric W. Biederman >> wrote: >>> Andy Lutomirski writes: > Maybe we want to say that rootfs should not be used if we are going to > create containers... >>> >>> Today it is an assumption of the vfs that rootfs is mounted. With >>> rootfs mounted and pivot_root at the base of the mount stack you can >>> make as minimal of a set of mounts as the vfs allows. >>> >>> Removing rootfs from the vfs requires an audit of everything that >>> manipulates mounts. It is not remotely a local excercise. >> >> Would it be a less invasive audit to allow different mount namespaces >> to have different rootfses? > > I.E. The same way different namespaces have different init tasks? > > The abstraction containers has implemented here should be logically > consistent. > Could we have an extra rootfs-like fs that is always completely empty, doesn't allow any writes, and can sit at the bottom of container namespace hierarchies? If so, and if we add a new syscall that's like pivot_root (or unshare) but prunes the hierarchy, then we could switch to that rootfs then. >>> >>> Or equally have something that guarantees that rootfs is empty and >>> read-only at the time the normal root filesystem is mounted. That is >>> certainly a much more localized change if we want to go there. >>> >>> I am half tempted to suggest that mount --move /some/path / be updated >>> to make the old / just go away (perhaps to be replaced with a read-only >>> empty rootfs). That gets us into figuring out if we break userspace >>> which is a big challenge. >> >> Hence my argument for a new syscall or entirely new operation. > > I'm still waiting for somebody to explain to my why chroot() shouldn't > be changed to do this instead of adding a new syscall. (At least when > mount namespace support is enabled.) Because chroot has no effect on the namespace at all. If you fork and the child chroots, the parent isn't chrooted. And, more importantly for my example, is a process has it's cwd as /foo, and then it forks and the child chroots, then parent's ".." isn't changed as a result of the chroot. > >> mount(2) and friends are way too multiplexed right now. I just found >> yet another security bug due to the insanely complicated semantics of >> the vfs syscalls. (Yes, a different one from the one yesterday.) > > As the guy who rewrote busybox mount 3 times, and who just implemented a > brand new one (toybox) from scratch: > > It's a bit fiddly, yes. > >> A new operation kills several birds with one stone. It could look like: >> >> int mntns_change_root(int dfd, const char *path, int flags); >> >> return -EPERM if chrooted. > > Really? Now that CVE-2014-7970 is public: what the heck is pivot_root supposed to do if the caller is chrooted? The current behavior is obviously incorrect (it leaks memory), but it's not entirely clear to me what should happen. I think it should either be disallowed or should have well-defined semantics. For simplicity, if a new syscall for this is added, then I think that the caller-is-chrooted case should be disallowed. If someone needs it and can articulate what the semantics should be, then I have no problem with allowing it going forward. > >> Returns -EINVAL if path (relative to dfd) isn't a mountmount. > > Requiring that chroot() only be called on mountpoints would break > existing semantics, which gets us back to new systemcall instead of > changing behavior of existing one. But we're talking about a pivot_root replacement. You can always create a bindmount. Alternatively, the syscall could create a fresh bind-mount and reattach all children to it if needed. > > If I recall, the first line of pushback against merging the openvz code > as is was "buckets of new syscalls". Pushback against adding a new > system call is understandable. Why can't we fix chroot() now that we > have the tools to do so? Because chroot already does something else. In particularly, the new "root"'s parents are *still there*, which is why it's so easy to escape from a chroot. Sensible container systems (and initramfs code!) wants to clean up all the mounts that should be unreachable. > >> Otherwise it disconnects path from the existing >> hierarchy, attaches a permanently-empty read-only rootfs under it, >> makes it the root of the mntns, and does the root refs fixup. The old >> hierarchy gets thrown out. > > We have a chroot() syscall. We don't use it for containers because it > doesn't do what we want. Does it currently do what _anybody_ wants? Irrelevant. It's POSIX and we'll break all kinds of things if we change it. > >> Systemd could use this, too. > > While that's a strong argument against it, I'm willing to overlook it. :) Feel free to read that as "an initramfs /init that prepares to hand off to /sbin/init could use this." busybox's switch_root could do
Re: [PATCH] [RFC] mnt: add ability to clone mntns starting with the current root
On 10/08/14 14:23, Eric W. Biederman wrote: >> Could we have an extra rootfs-like fs that is always completely empty, >> doesn't allow any writes, and can sit at the bottom of container >> namespace hierarchies? If so, and if we add a new syscall that's like >> pivot_root (or unshare) but prunes the hierarchy, then we could switch >> to that rootfs then. > > Or equally have something that guarantees that rootfs is empty and > read-only at the time the normal root filesystem is mounted. That is > certainly a much more localized change if we want to go there. What do you mean "normal" root filesystem? It is entirely possible (and in fact common in the embedded world) to run from rootfs. I pushed my old inittmpfs patches (at the request of cray) last year because being able to take down the system with "cat /dev/zero > /blah" (as rootfs allows and tmpfs doesn't) is a bad thing. Rootfs is about as special as PID 1 is. We don't filter out PID 1 from "ps" to avoid confusing people, but for some reason whoever did /proc/$PID/mountinfo decided that rootfs shouldn't show up because magic magic specialness. We show /run, which is a tmpfs instance. If I mount two different filesystems on top of each other on /mnt, it shows both. (Overmounts were not invented by rootfs.) But no, mountinfo filters out rootfs because magic magic specialness. It makes me sad that this kind of special-case thinking is allowed in the kernel. > I am half tempted to suggest that mount --move /some/path / be updated > to make the old / just go away (perhaps to be replaced with a read-only > empty rootfs). That gets us into figuring out if we break userspace > which is a big challenge. My concern was that chroot() moving a magic "/" pointer that you can trivially escape from with x=open("."); chroot("sub"); fdchdir("."); chdir("../../../../../../../../.."); is having extra code in the kernel to do it _wrong_. We have per-process namespaces now. We can actually adjust the mount tree (inserting a new bind mount if the directory we're changing to is not already a mount point). If a per-process namespace needs to be anchored by a tmpfs, fine. But requiring that to be teh SAME instance globally for the entire system is not what containers is _about_. It's not true for PID 1 and it shouldn't be true for rootfs. By all means, if a filesystem is no longer accessable in a namespace, decrement its reference count. (Keeping in mind that a bind mount should count as a reference, and rootfs should always have a nonzero reference count.) But "/" is not special in this regard. If you want to make all overmounts vanish (which seems like a really bad idea and breaks 40 years of unix semantics), argue for that. Please stop treating rootfs like it isn't potentialy usable as a full-fledged filesystem. (Pet peeve of mine.) > Eric Rob -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] [RFC] mnt: add ability to clone mntns starting with the current root
On Wed, Oct 8, 2014 at 12:23 PM, Eric W. Biederman wrote: > Andy Lutomirski writes: > >> On Wed, Oct 8, 2014 at 4:08 AM, Andrew Vagin wrote: >>> On Tue, Oct 07, 2014 at 01:45:22PM -0700, Eric W. Biederman wrote: Andrey Vagin writes: > From: Andrey Vagin > > Currently when we create a new container with a separate root, > we need to clone the current mount namespace with all mounts and then > clean up it by using pivot_root(). A big part of mountpoints are cloned > only to be umounted. Is the motivation performance? Because if that is the motivation we need numbers. >>> >>> The major motivation to create a clean mount namespace which contains >>> only required mounts. >>> >>> Now you want to convince us that there is nothing wrong if we use >>> userns, because all inherited mounts are locked. My point is that all >>> useless mounts should be umounted. If the current root isn't on rootfs, >>> pivot_root() allows us to umount all useless points. But pivot_root() >>> doesn't work, if the current root is on rootfs. How can we umount >>> useless points in this case? > > One of your justifications for a new system call was so you could do > less. Doing less to get to where you want to go is only justified when > your doing less to get better performance. > > It sounds like your actual concern is about sandboxing and security > audits. That is a very legitimate concern. That isn't however the core > concern of containers, so it was not clear that is what you meant. > >>> Maybe we want to say that rootfs should not be used if we are going to >>> create containers... > > Today it is an assumption of the vfs that rootfs is mounted. With > rootfs mounted and pivot_root at the base of the mount stack you can > make as minimal of a set of mounts as the vfs allows. > > Removing rootfs from the vfs requires an audit of everything that > manipulates mounts. It is not remotely a local excercise. Would it be a less invasive audit to allow different mount namespaces to have different rootfses? > > One of the things that needs to be considered is that if you really want > to audit mounts is the code that needs manipulates them needs to be > audited every bit as much as the mounts themselves. > >> Could we have an extra rootfs-like fs that is always completely empty, >> doesn't allow any writes, and can sit at the bottom of container >> namespace hierarchies? If so, and if we add a new syscall that's like >> pivot_root (or unshare) but prunes the hierarchy, then we could switch >> to that rootfs then. > > Or equally have something that guarantees that rootfs is empty and > read-only at the time the normal root filesystem is mounted. That is > certainly a much more localized change if we want to go there. > > I am half tempted to suggest that mount --move /some/path / be updated > to make the old / just go away (perhaps to be replaced with a read-only > empty rootfs). That gets us into figuring out if we break userspace > which is a big challenge. Hence my argument for a new syscall or entirely new operation. mount(2) and friends are way too multiplexed right now. I just found yet another security bug due to the insanely complicated semantics of the vfs syscalls. (Yes, a different one from the one yesterday.) A new operation kills several birds with one stone. It could look like: int mntns_change_root(int dfd, const char *path, int flags); return -EPERM if chrooted. Returns -EINVAL if path (relative to dfd) isn't a mountmount. Otherwise it disconnects path from the existing hierarchy, attaches a permanently-empty read-only rootfs under it, makes it the root of the mntns, and does the root refs fixup. The old hierarchy gets thrown out. Benefits: - Plausibly slightly faster. (Especially if we add the trivial optimization that, if the caller holds the only reference to the mntns, then changing root refs can avoid the big loop, but maybe pivot_root could do that, too.) - Sidesteps the whole rootfs mess. - Much easier to use than pivot_root. - No userspace breakage. Heck, it could be even simpler: it could just unconditionally skip the fs struct walk. Just require callers to make sure that everyone else chroots into the new root. Systemd could use this, too. If it wants to keep a reference to the initramfs, it can use a file descriptor. --Andy -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] [RFC] mnt: add ability to clone mntns starting with the current root
Andy Lutomirski writes: > On Wed, Oct 8, 2014 at 4:08 AM, Andrew Vagin wrote: >> On Tue, Oct 07, 2014 at 01:45:22PM -0700, Eric W. Biederman wrote: >>> Andrey Vagin writes: >>> >>> > From: Andrey Vagin >>> > >>> > Currently when we create a new container with a separate root, >>> > we need to clone the current mount namespace with all mounts and then >>> > clean up it by using pivot_root(). A big part of mountpoints are cloned >>> > only to be umounted. >>> >>> Is the motivation performance? Because if that is the motivation we >>> need numbers. >> >> The major motivation to create a clean mount namespace which contains >> only required mounts. >> >> Now you want to convince us that there is nothing wrong if we use >> userns, because all inherited mounts are locked. My point is that all >> useless mounts should be umounted. If the current root isn't on rootfs, >> pivot_root() allows us to umount all useless points. But pivot_root() >> doesn't work, if the current root is on rootfs. How can we umount >> useless points in this case? One of your justifications for a new system call was so you could do less. Doing less to get to where you want to go is only justified when your doing less to get better performance. It sounds like your actual concern is about sandboxing and security audits. That is a very legitimate concern. That isn't however the core concern of containers, so it was not clear that is what you meant. >> Maybe we want to say that rootfs should not be used if we are going to >> create containers... Today it is an assumption of the vfs that rootfs is mounted. With rootfs mounted and pivot_root at the base of the mount stack you can make as minimal of a set of mounts as the vfs allows. Removing rootfs from the vfs requires an audit of everything that manipulates mounts. It is not remotely a local excercise. One of the things that needs to be considered is that if you really want to audit mounts is the code that needs manipulates them needs to be audited every bit as much as the mounts themselves. > Could we have an extra rootfs-like fs that is always completely empty, > doesn't allow any writes, and can sit at the bottom of container > namespace hierarchies? If so, and if we add a new syscall that's like > pivot_root (or unshare) but prunes the hierarchy, then we could switch > to that rootfs then. Or equally have something that guarantees that rootfs is empty and read-only at the time the normal root filesystem is mounted. That is certainly a much more localized change if we want to go there. I am half tempted to suggest that mount --move /some/path / be updated to make the old / just go away (perhaps to be replaced with a read-only empty rootfs). That gets us into figuring out if we break userspace which is a big challenge. Eric -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] [RFC] mnt: add ability to clone mntns starting with the current root
On Wed, Oct 8, 2014 at 4:08 AM, Andrew Vagin wrote: > On Tue, Oct 07, 2014 at 01:45:22PM -0700, Eric W. Biederman wrote: >> Andrey Vagin writes: >> >> > From: Andrey Vagin >> > >> > Currently when we create a new container with a separate root, >> > we need to clone the current mount namespace with all mounts and then >> > clean up it by using pivot_root(). A big part of mountpoints are cloned >> > only to be umounted. >> >> Is the motivation performance? Because if that is the motivation we >> need numbers. > > The major motivation to create a clean mount namespace which contains > only required mounts. > > Now you want to convince us that there is nothing wrong if we use > userns, because all inherited mounts are locked. My point is that all > useless mounts should be umounted. If the current root isn't on rootfs, > pivot_root() allows us to umount all useless points. But pivot_root() > doesn't work, if the current root is on rootfs. How can we umount > useless points in this case? > > Maybe we want to say that rootfs should not be used if we are going to > create containers... > Could we have an extra rootfs-like fs that is always completely empty, doesn't allow any writes, and can sit at the bottom of container namespace hierarchies? If so, and if we add a new syscall that's like pivot_root (or unshare) but prunes the hierarchy, then we could switch to that rootfs then. > Thanks, > Andrew > > -- > To unsubscribe from this list: send the line "unsubscribe linux-api" in > the body of a message to majord...@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html -- Andy Lutomirski AMA Capital Management, LLC -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] [RFC] mnt: add ability to clone mntns starting with the current root
On Tue, Oct 07, 2014 at 01:45:22PM -0700, Eric W. Biederman wrote: > Andrey Vagin writes: > > > From: Andrey Vagin > > > > Currently when we create a new container with a separate root, > > we need to clone the current mount namespace with all mounts and then > > clean up it by using pivot_root(). A big part of mountpoints are cloned > > only to be umounted. > > Is the motivation performance? Because if that is the motivation we > need numbers. The major motivation to create a clean mount namespace which contains only required mounts. Now you want to convince us that there is nothing wrong if we use userns, because all inherited mounts are locked. My point is that all useless mounts should be umounted. If the current root isn't on rootfs, pivot_root() allows us to umount all useless points. But pivot_root() doesn't work, if the current root is on rootfs. How can we umount useless points in this case? Maybe we want to say that rootfs should not be used if we are going to create containers... Thanks, Andrew -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] [RFC] mnt: add ability to clone mntns starting with the current root
On Tue, Oct 07, 2014 at 01:45:22PM -0700, Eric W. Biederman wrote: Andrey Vagin ava...@openvz.org writes: From: Andrey Vagin ava...@gmail.com Currently when we create a new container with a separate root, we need to clone the current mount namespace with all mounts and then clean up it by using pivot_root(). A big part of mountpoints are cloned only to be umounted. Is the motivation performance? Because if that is the motivation we need numbers. The major motivation to create a clean mount namespace which contains only required mounts. Now you want to convince us that there is nothing wrong if we use userns, because all inherited mounts are locked. My point is that all useless mounts should be umounted. If the current root isn't on rootfs, pivot_root() allows us to umount all useless points. But pivot_root() doesn't work, if the current root is on rootfs. How can we umount useless points in this case? Maybe we want to say that rootfs should not be used if we are going to create containers... Thanks, Andrew -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] [RFC] mnt: add ability to clone mntns starting with the current root
On Wed, Oct 8, 2014 at 4:08 AM, Andrew Vagin ava...@parallels.com wrote: On Tue, Oct 07, 2014 at 01:45:22PM -0700, Eric W. Biederman wrote: Andrey Vagin ava...@openvz.org writes: From: Andrey Vagin ava...@gmail.com Currently when we create a new container with a separate root, we need to clone the current mount namespace with all mounts and then clean up it by using pivot_root(). A big part of mountpoints are cloned only to be umounted. Is the motivation performance? Because if that is the motivation we need numbers. The major motivation to create a clean mount namespace which contains only required mounts. Now you want to convince us that there is nothing wrong if we use userns, because all inherited mounts are locked. My point is that all useless mounts should be umounted. If the current root isn't on rootfs, pivot_root() allows us to umount all useless points. But pivot_root() doesn't work, if the current root is on rootfs. How can we umount useless points in this case? Maybe we want to say that rootfs should not be used if we are going to create containers... Could we have an extra rootfs-like fs that is always completely empty, doesn't allow any writes, and can sit at the bottom of container namespace hierarchies? If so, and if we add a new syscall that's like pivot_root (or unshare) but prunes the hierarchy, then we could switch to that rootfs then. Thanks, Andrew -- To unsubscribe from this list: send the line unsubscribe linux-api in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- Andy Lutomirski AMA Capital Management, LLC -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] [RFC] mnt: add ability to clone mntns starting with the current root
Andy Lutomirski l...@amacapital.net writes: On Wed, Oct 8, 2014 at 4:08 AM, Andrew Vagin ava...@parallels.com wrote: On Tue, Oct 07, 2014 at 01:45:22PM -0700, Eric W. Biederman wrote: Andrey Vagin ava...@openvz.org writes: From: Andrey Vagin ava...@gmail.com Currently when we create a new container with a separate root, we need to clone the current mount namespace with all mounts and then clean up it by using pivot_root(). A big part of mountpoints are cloned only to be umounted. Is the motivation performance? Because if that is the motivation we need numbers. The major motivation to create a clean mount namespace which contains only required mounts. Now you want to convince us that there is nothing wrong if we use userns, because all inherited mounts are locked. My point is that all useless mounts should be umounted. If the current root isn't on rootfs, pivot_root() allows us to umount all useless points. But pivot_root() doesn't work, if the current root is on rootfs. How can we umount useless points in this case? One of your justifications for a new system call was so you could do less. Doing less to get to where you want to go is only justified when your doing less to get better performance. It sounds like your actual concern is about sandboxing and security audits. That is a very legitimate concern. That isn't however the core concern of containers, so it was not clear that is what you meant. Maybe we want to say that rootfs should not be used if we are going to create containers... Today it is an assumption of the vfs that rootfs is mounted. With rootfs mounted and pivot_root at the base of the mount stack you can make as minimal of a set of mounts as the vfs allows. Removing rootfs from the vfs requires an audit of everything that manipulates mounts. It is not remotely a local excercise. One of the things that needs to be considered is that if you really want to audit mounts is the code that needs manipulates them needs to be audited every bit as much as the mounts themselves. Could we have an extra rootfs-like fs that is always completely empty, doesn't allow any writes, and can sit at the bottom of container namespace hierarchies? If so, and if we add a new syscall that's like pivot_root (or unshare) but prunes the hierarchy, then we could switch to that rootfs then. Or equally have something that guarantees that rootfs is empty and read-only at the time the normal root filesystem is mounted. That is certainly a much more localized change if we want to go there. I am half tempted to suggest that mount --move /some/path / be updated to make the old / just go away (perhaps to be replaced with a read-only empty rootfs). That gets us into figuring out if we break userspace which is a big challenge. Eric -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] [RFC] mnt: add ability to clone mntns starting with the current root
On Wed, Oct 8, 2014 at 12:23 PM, Eric W. Biederman ebied...@xmission.com wrote: Andy Lutomirski l...@amacapital.net writes: On Wed, Oct 8, 2014 at 4:08 AM, Andrew Vagin ava...@parallels.com wrote: On Tue, Oct 07, 2014 at 01:45:22PM -0700, Eric W. Biederman wrote: Andrey Vagin ava...@openvz.org writes: From: Andrey Vagin ava...@gmail.com Currently when we create a new container with a separate root, we need to clone the current mount namespace with all mounts and then clean up it by using pivot_root(). A big part of mountpoints are cloned only to be umounted. Is the motivation performance? Because if that is the motivation we need numbers. The major motivation to create a clean mount namespace which contains only required mounts. Now you want to convince us that there is nothing wrong if we use userns, because all inherited mounts are locked. My point is that all useless mounts should be umounted. If the current root isn't on rootfs, pivot_root() allows us to umount all useless points. But pivot_root() doesn't work, if the current root is on rootfs. How can we umount useless points in this case? One of your justifications for a new system call was so you could do less. Doing less to get to where you want to go is only justified when your doing less to get better performance. It sounds like your actual concern is about sandboxing and security audits. That is a very legitimate concern. That isn't however the core concern of containers, so it was not clear that is what you meant. Maybe we want to say that rootfs should not be used if we are going to create containers... Today it is an assumption of the vfs that rootfs is mounted. With rootfs mounted and pivot_root at the base of the mount stack you can make as minimal of a set of mounts as the vfs allows. Removing rootfs from the vfs requires an audit of everything that manipulates mounts. It is not remotely a local excercise. Would it be a less invasive audit to allow different mount namespaces to have different rootfses? One of the things that needs to be considered is that if you really want to audit mounts is the code that needs manipulates them needs to be audited every bit as much as the mounts themselves. Could we have an extra rootfs-like fs that is always completely empty, doesn't allow any writes, and can sit at the bottom of container namespace hierarchies? If so, and if we add a new syscall that's like pivot_root (or unshare) but prunes the hierarchy, then we could switch to that rootfs then. Or equally have something that guarantees that rootfs is empty and read-only at the time the normal root filesystem is mounted. That is certainly a much more localized change if we want to go there. I am half tempted to suggest that mount --move /some/path / be updated to make the old / just go away (perhaps to be replaced with a read-only empty rootfs). That gets us into figuring out if we break userspace which is a big challenge. Hence my argument for a new syscall or entirely new operation. mount(2) and friends are way too multiplexed right now. I just found yet another security bug due to the insanely complicated semantics of the vfs syscalls. (Yes, a different one from the one yesterday.) A new operation kills several birds with one stone. It could look like: int mntns_change_root(int dfd, const char *path, int flags); return -EPERM if chrooted. Returns -EINVAL if path (relative to dfd) isn't a mountmount. Otherwise it disconnects path from the existing hierarchy, attaches a permanently-empty read-only rootfs under it, makes it the root of the mntns, and does the root refs fixup. The old hierarchy gets thrown out. Benefits: - Plausibly slightly faster. (Especially if we add the trivial optimization that, if the caller holds the only reference to the mntns, then changing root refs can avoid the big loop, but maybe pivot_root could do that, too.) - Sidesteps the whole rootfs mess. - Much easier to use than pivot_root. - No userspace breakage. Heck, it could be even simpler: it could just unconditionally skip the fs struct walk. Just require callers to make sure that everyone else chroots into the new root. Systemd could use this, too. If it wants to keep a reference to the initramfs, it can use a file descriptor. --Andy -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] [RFC] mnt: add ability to clone mntns starting with the current root
On 10/08/14 14:23, Eric W. Biederman wrote: Could we have an extra rootfs-like fs that is always completely empty, doesn't allow any writes, and can sit at the bottom of container namespace hierarchies? If so, and if we add a new syscall that's like pivot_root (or unshare) but prunes the hierarchy, then we could switch to that rootfs then. Or equally have something that guarantees that rootfs is empty and read-only at the time the normal root filesystem is mounted. That is certainly a much more localized change if we want to go there. What do you mean normal root filesystem? It is entirely possible (and in fact common in the embedded world) to run from rootfs. I pushed my old inittmpfs patches (at the request of cray) last year because being able to take down the system with cat /dev/zero /blah (as rootfs allows and tmpfs doesn't) is a bad thing. Rootfs is about as special as PID 1 is. We don't filter out PID 1 from ps to avoid confusing people, but for some reason whoever did /proc/$PID/mountinfo decided that rootfs shouldn't show up because magic magic specialness. We show /run, which is a tmpfs instance. If I mount two different filesystems on top of each other on /mnt, it shows both. (Overmounts were not invented by rootfs.) But no, mountinfo filters out rootfs because magic magic specialness. It makes me sad that this kind of special-case thinking is allowed in the kernel. I am half tempted to suggest that mount --move /some/path / be updated to make the old / just go away (perhaps to be replaced with a read-only empty rootfs). That gets us into figuring out if we break userspace which is a big challenge. My concern was that chroot() moving a magic / pointer that you can trivially escape from with x=open(.); chroot(sub); fdchdir(.); chdir(../../../../../../../../..); is having extra code in the kernel to do it _wrong_. We have per-process namespaces now. We can actually adjust the mount tree (inserting a new bind mount if the directory we're changing to is not already a mount point). If a per-process namespace needs to be anchored by a tmpfs, fine. But requiring that to be teh SAME instance globally for the entire system is not what containers is _about_. It's not true for PID 1 and it shouldn't be true for rootfs. By all means, if a filesystem is no longer accessable in a namespace, decrement its reference count. (Keeping in mind that a bind mount should count as a reference, and rootfs should always have a nonzero reference count.) But / is not special in this regard. If you want to make all overmounts vanish (which seems like a really bad idea and breaks 40 years of unix semantics), argue for that. Please stop treating rootfs like it isn't potentialy usable as a full-fledged filesystem. (Pet peeve of mine.) Eric Rob -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] [RFC] mnt: add ability to clone mntns starting with the current root
On Wed, Oct 8, 2014 at 2:36 PM, Rob Landley r...@landley.net wrote: On 10/08/14 14:31, Andy Lutomirski wrote: On Wed, Oct 8, 2014 at 12:23 PM, Eric W. Biederman ebied...@xmission.com wrote: Andy Lutomirski l...@amacapital.net writes: Maybe we want to say that rootfs should not be used if we are going to create containers... Today it is an assumption of the vfs that rootfs is mounted. With rootfs mounted and pivot_root at the base of the mount stack you can make as minimal of a set of mounts as the vfs allows. Removing rootfs from the vfs requires an audit of everything that manipulates mounts. It is not remotely a local excercise. Would it be a less invasive audit to allow different mount namespaces to have different rootfses? I.E. The same way different namespaces have different init tasks? The abstraction containers has implemented here should be logically consistent. Could we have an extra rootfs-like fs that is always completely empty, doesn't allow any writes, and can sit at the bottom of container namespace hierarchies? If so, and if we add a new syscall that's like pivot_root (or unshare) but prunes the hierarchy, then we could switch to that rootfs then. Or equally have something that guarantees that rootfs is empty and read-only at the time the normal root filesystem is mounted. That is certainly a much more localized change if we want to go there. I am half tempted to suggest that mount --move /some/path / be updated to make the old / just go away (perhaps to be replaced with a read-only empty rootfs). That gets us into figuring out if we break userspace which is a big challenge. Hence my argument for a new syscall or entirely new operation. I'm still waiting for somebody to explain to my why chroot() shouldn't be changed to do this instead of adding a new syscall. (At least when mount namespace support is enabled.) Because chroot has no effect on the namespace at all. If you fork and the child chroots, the parent isn't chrooted. And, more importantly for my example, is a process has it's cwd as /foo, and then it forks and the child chroots, then parent's .. isn't changed as a result of the chroot. mount(2) and friends are way too multiplexed right now. I just found yet another security bug due to the insanely complicated semantics of the vfs syscalls. (Yes, a different one from the one yesterday.) As the guy who rewrote busybox mount 3 times, and who just implemented a brand new one (toybox) from scratch: It's a bit fiddly, yes. A new operation kills several birds with one stone. It could look like: int mntns_change_root(int dfd, const char *path, int flags); return -EPERM if chrooted. Really? Now that CVE-2014-7970 is public: what the heck is pivot_root supposed to do if the caller is chrooted? The current behavior is obviously incorrect (it leaks memory), but it's not entirely clear to me what should happen. I think it should either be disallowed or should have well-defined semantics. For simplicity, if a new syscall for this is added, then I think that the caller-is-chrooted case should be disallowed. If someone needs it and can articulate what the semantics should be, then I have no problem with allowing it going forward. Returns -EINVAL if path (relative to dfd) isn't a mountmount. Requiring that chroot() only be called on mountpoints would break existing semantics, which gets us back to new systemcall instead of changing behavior of existing one. But we're talking about a pivot_root replacement. You can always create a bindmount. Alternatively, the syscall could create a fresh bind-mount and reattach all children to it if needed. If I recall, the first line of pushback against merging the openvz code as is was buckets of new syscalls. Pushback against adding a new system call is understandable. Why can't we fix chroot() now that we have the tools to do so? Because chroot already does something else. In particularly, the new root's parents are *still there*, which is why it's so easy to escape from a chroot. Sensible container systems (and initramfs code!) wants to clean up all the mounts that should be unreachable. Otherwise it disconnects path from the existing hierarchy, attaches a permanently-empty read-only rootfs under it, makes it the root of the mntns, and does the root refs fixup. The old hierarchy gets thrown out. We have a chroot() syscall. We don't use it for containers because it doesn't do what we want. Does it currently do what _anybody_ wants? Irrelevant. It's POSIX and we'll break all kinds of things if we change it. Systemd could use this, too. While that's a strong argument against it, I'm willing to overlook it. :) Feel free to read that as an initramfs /init that prepares to hand off to /sbin/init could use this. busybox's switch_root could do this, too, and even my virtme tool would indirectly benefit slightly. (virtme uses an init system that is a few
Re: [PATCH] [RFC] mnt: add ability to clone mntns starting with the current root
On 10/08/14 14:31, Andy Lutomirski wrote: On Wed, Oct 8, 2014 at 12:23 PM, Eric W. Biederman ebied...@xmission.com wrote: Andy Lutomirski l...@amacapital.net writes: Maybe we want to say that rootfs should not be used if we are going to create containers... Today it is an assumption of the vfs that rootfs is mounted. With rootfs mounted and pivot_root at the base of the mount stack you can make as minimal of a set of mounts as the vfs allows. Removing rootfs from the vfs requires an audit of everything that manipulates mounts. It is not remotely a local excercise. Would it be a less invasive audit to allow different mount namespaces to have different rootfses? I.E. The same way different namespaces have different init tasks? The abstraction containers has implemented here should be logically consistent. Could we have an extra rootfs-like fs that is always completely empty, doesn't allow any writes, and can sit at the bottom of container namespace hierarchies? If so, and if we add a new syscall that's like pivot_root (or unshare) but prunes the hierarchy, then we could switch to that rootfs then. Or equally have something that guarantees that rootfs is empty and read-only at the time the normal root filesystem is mounted. That is certainly a much more localized change if we want to go there. I am half tempted to suggest that mount --move /some/path / be updated to make the old / just go away (perhaps to be replaced with a read-only empty rootfs). That gets us into figuring out if we break userspace which is a big challenge. Hence my argument for a new syscall or entirely new operation. I'm still waiting for somebody to explain to my why chroot() shouldn't be changed to do this instead of adding a new syscall. (At least when mount namespace support is enabled.) mount(2) and friends are way too multiplexed right now. I just found yet another security bug due to the insanely complicated semantics of the vfs syscalls. (Yes, a different one from the one yesterday.) As the guy who rewrote busybox mount 3 times, and who just implemented a brand new one (toybox) from scratch: It's a bit fiddly, yes. A new operation kills several birds with one stone. It could look like: int mntns_change_root(int dfd, const char *path, int flags); return -EPERM if chrooted. Really? Returns -EINVAL if path (relative to dfd) isn't a mountmount. Requiring that chroot() only be called on mountpoints would break existing semantics, which gets us back to new systemcall instead of changing behavior of existing one. If I recall, the first line of pushback against merging the openvz code as is was buckets of new syscalls. Pushback against adding a new system call is understandable. Why can't we fix chroot() now that we have the tools to do so? Otherwise it disconnects path from the existing hierarchy, attaches a permanently-empty read-only rootfs under it, makes it the root of the mntns, and does the root refs fixup. The old hierarchy gets thrown out. We have a chroot() syscall. We don't use it for containers because it doesn't do what we want. Does it currently do what _anybody_ wants? Systemd could use this, too. While that's a strong argument against it, I'm willing to overlook it. Rob -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] [RFC] mnt: add ability to clone mntns starting with the current root
Quoting Andy Lutomirski (l...@amacapital.net): On Wed, Oct 8, 2014 at 2:36 PM, Rob Landley r...@landley.net wrote: On 10/08/14 14:31, Andy Lutomirski wrote: On Wed, Oct 8, 2014 at 12:23 PM, Eric W. Biederman ebied...@xmission.com wrote: Andy Lutomirski l...@amacapital.net writes: Maybe we want to say that rootfs should not be used if we are going to create containers... Today it is an assumption of the vfs that rootfs is mounted. With rootfs mounted and pivot_root at the base of the mount stack you can make as minimal of a set of mounts as the vfs allows. Removing rootfs from the vfs requires an audit of everything that manipulates mounts. It is not remotely a local excercise. Would it be a less invasive audit to allow different mount namespaces to have different rootfses? I.E. The same way different namespaces have different init tasks? The abstraction containers has implemented here should be logically consistent. Could we have an extra rootfs-like fs that is always completely empty, doesn't allow any writes, and can sit at the bottom of container namespace hierarchies? If so, and if we add a new syscall that's like pivot_root (or unshare) but prunes the hierarchy, then we could switch to that rootfs then. Or equally have something that guarantees that rootfs is empty and read-only at the time the normal root filesystem is mounted. That is certainly a much more localized change if we want to go there. I am half tempted to suggest that mount --move /some/path / be updated to make the old / just go away (perhaps to be replaced with a read-only empty rootfs). That gets us into figuring out if we break userspace which is a big challenge. Hence my argument for a new syscall or entirely new operation. I'm still waiting for somebody to explain to my why chroot() shouldn't be changed to do this instead of adding a new syscall. (At least when mount namespace support is enabled.) Because chroot has no effect on the namespace at all. If you fork and the child chroots, the parent isn't chrooted. And, more importantly for my example, is a process has it's cwd as /foo, and then it forks and the child chroots, then parent's .. isn't changed as a result of the chroot. mount(2) and friends are way too multiplexed right now. I just found yet another security bug due to the insanely complicated semantics of the vfs syscalls. (Yes, a different one from the one yesterday.) As the guy who rewrote busybox mount 3 times, and who just implemented a brand new one (toybox) from scratch: It's a bit fiddly, yes. A new operation kills several birds with one stone. It could look like: int mntns_change_root(int dfd, const char *path, int flags); return -EPERM if chrooted. Really? Now that CVE-2014-7970 is public: what the heck is pivot_root supposed to do if the caller is chrooted? The current behavior is obviously incorrect (it leaks memory), but it's not entirely clear to me what should happen. I think it should either be disallowed or should have well-defined semantics. For simplicity, if a new syscall for this is added, then I think that the caller-is-chrooted case should be disallowed. If someone needs it and can articulate what the semantics should be, then I have no problem with allowing it going forward. It's not that I'd have a need for that, but rather if for some reason I started out chrooted due to some bogus initramfs, I'd prefer to not have to feel like a criminial and escape the chroot first. -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] [RFC] mnt: add ability to clone mntns starting with the current root
On Wed, Oct 8, 2014 at 4:38 PM, Serge Hallyn serge.hal...@ubuntu.com wrote: Quoting Andy Lutomirski (l...@amacapital.net): On Wed, Oct 8, 2014 at 2:36 PM, Rob Landley r...@landley.net wrote: On 10/08/14 14:31, Andy Lutomirski wrote: On Wed, Oct 8, 2014 at 12:23 PM, Eric W. Biederman ebied...@xmission.com wrote: Andy Lutomirski l...@amacapital.net writes: Maybe we want to say that rootfs should not be used if we are going to create containers... Today it is an assumption of the vfs that rootfs is mounted. With rootfs mounted and pivot_root at the base of the mount stack you can make as minimal of a set of mounts as the vfs allows. Removing rootfs from the vfs requires an audit of everything that manipulates mounts. It is not remotely a local excercise. Would it be a less invasive audit to allow different mount namespaces to have different rootfses? I.E. The same way different namespaces have different init tasks? The abstraction containers has implemented here should be logically consistent. Could we have an extra rootfs-like fs that is always completely empty, doesn't allow any writes, and can sit at the bottom of container namespace hierarchies? If so, and if we add a new syscall that's like pivot_root (or unshare) but prunes the hierarchy, then we could switch to that rootfs then. Or equally have something that guarantees that rootfs is empty and read-only at the time the normal root filesystem is mounted. That is certainly a much more localized change if we want to go there. I am half tempted to suggest that mount --move /some/path / be updated to make the old / just go away (perhaps to be replaced with a read-only empty rootfs). That gets us into figuring out if we break userspace which is a big challenge. Hence my argument for a new syscall or entirely new operation. I'm still waiting for somebody to explain to my why chroot() shouldn't be changed to do this instead of adding a new syscall. (At least when mount namespace support is enabled.) Because chroot has no effect on the namespace at all. If you fork and the child chroots, the parent isn't chrooted. And, more importantly for my example, is a process has it's cwd as /foo, and then it forks and the child chroots, then parent's .. isn't changed as a result of the chroot. mount(2) and friends are way too multiplexed right now. I just found yet another security bug due to the insanely complicated semantics of the vfs syscalls. (Yes, a different one from the one yesterday.) As the guy who rewrote busybox mount 3 times, and who just implemented a brand new one (toybox) from scratch: It's a bit fiddly, yes. A new operation kills several birds with one stone. It could look like: int mntns_change_root(int dfd, const char *path, int flags); return -EPERM if chrooted. Really? Now that CVE-2014-7970 is public: what the heck is pivot_root supposed to do if the caller is chrooted? The current behavior is obviously incorrect (it leaks memory), but it's not entirely clear to me what should happen. I think it should either be disallowed or should have well-defined semantics. For simplicity, if a new syscall for this is added, then I think that the caller-is-chrooted case should be disallowed. If someone needs it and can articulate what the semantics should be, then I have no problem with allowing it going forward. It's not that I'd have a need for that, but rather if for some reason I started out chrooted due to some bogus initramfs, I'd prefer to not have to feel like a criminial and escape the chroot first. You already can't create a userns if you're chrooted (even if you have global privilege). --Andy -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] [RFC] mnt: add ability to clone mntns starting with the current root
On Tue, Oct 7, 2014 at 5:20 PM, Eric W. Biederman wrote: > Andy Lutomirski writes: > >> On Tue, Oct 7, 2014 at 4:42 PM, Eric W. Biederman >> wrote: >>> Andy Lutomirski writes: >>> On Tue, Oct 7, 2014 at 3:42 PM, Eric W. Biederman wrote: > Andy Lutomirski writes: > >> On Tue, Oct 7, 2014 at 2:42 PM, Eric W. Biederman >> wrote: >>> >>> I am squinting and looking this way and that but while I can imagine >>> someone more clever than I can think up some unique property of rootfs >>> that makes it a little more exploitable than just mounting a ramfs, >>> but since you have to be root to exploit those properties I think the >>> game is pretty much lost. >> >> Yes. rootfs might not be empty, it might have totally insane >> permissions, and it's globally shared, which makes it into a wonderful >> channel to pass things around that shouldn't be passed around. > > But if only root with proc mounted can reach it... I don't know. It doesn't have to be global root. It could be userns root. > There might be a case for setting MNT_LOCKED when we overmount "/" > as root but I don't yet see it. > >> Can non-root do this? You'd need to be in a userns with a "/" that >> isn't MNT_LOCKED. Can this happen on any normal setup? >> >> FWIW, I think we should unconditionally MNT_LOCKED the root on userns >> unshare, even if it's the only mount. > > To the best of my knowledge MNT_LOCKED is set uncondintially on userns > unshare. Only if list_empty(>mnt_expire), whatever that means, I think. >>> >>> An autofs or nfs automounted mount. Can those ever become root? >> >> Dunno. >> >> I thought that this code was what set MNT_LOCKED for things that >> should be locked: > > It does. > >> /* Don't allow unprivileged users to reveal what is under a mount */ >> if ((flag & CL_UNPRIVILEGED) && list_empty(>mnt_expire)) >> mnt->mnt.mnt_flags |= MNT_LOCKED; >> >> Now I'm confused. Shouldn't that be checking for submounts? Is that >> what it's doing? > > As it copies each mount (mostly submounts) it sets MNT_LOCKED. Oh. They're *all* MNT_LOCKED. Duh. > >> Anyway, I think that this should unconditionally set MNT_LOCKED on the >> thing that mounted on rootfs. > > As I read the code mnt_set_expiry is only for nfs, cifs, and afs > submounts (and perhaps proc should use them). So as they are generated > mnt_expiry should never start on the root of filesystem of the mount tree. > > Furthermore mnt_expiry is cleared when a mount is moved, and when > it is bind mounted, or propagated. > > pivot_root does look as though it may be missing the proper mnt_expiry > handling list_del_init(>mnt_expire), but pivot_root does have > proper MNT_LOCKED handling. pivot_root is quite broken, as noted in my other email. It's just not broken like this, I think. :) > > Looking that test was slightly off and it should be: > if ((flag & CL_UNPRIVILEGED) && > (!(flag & CL_EXPIRE) || list_empty(>mnt_expire)) > mnt->mnt.mnt_flags |= MNT_LOCKED; > > So on that note we might clear CL_EXPIRE when CL_UNPRIVILEGED is set > in copy_tree but I don't see that as being really needed. > > Eric -- Andy Lutomirski AMA Capital Management, LLC -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] [RFC] mnt: add ability to clone mntns starting with the current root
Andy Lutomirski writes: > On Tue, Oct 7, 2014 at 4:42 PM, Eric W. Biederman > wrote: >> Andy Lutomirski writes: >> >>> On Tue, Oct 7, 2014 at 3:42 PM, Eric W. Biederman >>> wrote: Andy Lutomirski writes: > On Tue, Oct 7, 2014 at 2:42 PM, Eric W. Biederman > wrote: >> >> I am squinting and looking this way and that but while I can imagine >> someone more clever than I can think up some unique property of rootfs >> that makes it a little more exploitable than just mounting a ramfs, >> but since you have to be root to exploit those properties I think the >> game is pretty much lost. > > Yes. rootfs might not be empty, it might have totally insane > permissions, and it's globally shared, which makes it into a wonderful > channel to pass things around that shouldn't be passed around. But if only root with proc mounted can reach it... I don't know. >>> >>> It doesn't have to be global root. It could be userns root. >>> There might be a case for setting MNT_LOCKED when we overmount "/" as root but I don't yet see it. > Can non-root do this? You'd need to be in a userns with a "/" that > isn't MNT_LOCKED. Can this happen on any normal setup? > > FWIW, I think we should unconditionally MNT_LOCKED the root on userns > unshare, even if it's the only mount. To the best of my knowledge MNT_LOCKED is set uncondintially on userns unshare. >>> >>> Only if list_empty(>mnt_expire), whatever that means, I think. >> >> An autofs or nfs automounted mount. Can those ever become root? > > Dunno. > > I thought that this code was what set MNT_LOCKED for things that > should be locked: It does. > /* Don't allow unprivileged users to reveal what is under a mount */ > if ((flag & CL_UNPRIVILEGED) && list_empty(>mnt_expire)) > mnt->mnt.mnt_flags |= MNT_LOCKED; > > Now I'm confused. Shouldn't that be checking for submounts? Is that > what it's doing? As it copies each mount (mostly submounts) it sets MNT_LOCKED. > Anyway, I think that this should unconditionally set MNT_LOCKED on the > thing that mounted on rootfs. As I read the code mnt_set_expiry is only for nfs, cifs, and afs submounts (and perhaps proc should use them). So as they are generated mnt_expiry should never start on the root of filesystem of the mount tree. Furthermore mnt_expiry is cleared when a mount is moved, and when it is bind mounted, or propagated. pivot_root does look as though it may be missing the proper mnt_expiry handling list_del_init(>mnt_expire), but pivot_root does have proper MNT_LOCKED handling. Looking that test was slightly off and it should be: if ((flag & CL_UNPRIVILEGED) && (!(flag & CL_EXPIRE) || list_empty(>mnt_expire)) mnt->mnt.mnt_flags |= MNT_LOCKED; So on that note we might clear CL_EXPIRE when CL_UNPRIVILEGED is set in copy_tree but I don't see that as being really needed. Eric -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] [RFC] mnt: add ability to clone mntns starting with the current root
On Tue, Oct 7, 2014 at 4:42 PM, Eric W. Biederman wrote: > Andy Lutomirski writes: > >> On Tue, Oct 7, 2014 at 3:42 PM, Eric W. Biederman >> wrote: >>> Andy Lutomirski writes: >>> On Tue, Oct 7, 2014 at 2:42 PM, Eric W. Biederman wrote: > > I am squinting and looking this way and that but while I can imagine > someone more clever than I can think up some unique property of rootfs > that makes it a little more exploitable than just mounting a ramfs, > but since you have to be root to exploit those properties I think the > game is pretty much lost. Yes. rootfs might not be empty, it might have totally insane permissions, and it's globally shared, which makes it into a wonderful channel to pass things around that shouldn't be passed around. >>> >>> But if only root with proc mounted can reach it... I don't know. >> >> It doesn't have to be global root. It could be userns root. >> >>> There might be a case for setting MNT_LOCKED when we overmount "/" >>> as root but I don't yet see it. >>> Can non-root do this? You'd need to be in a userns with a "/" that isn't MNT_LOCKED. Can this happen on any normal setup? FWIW, I think we should unconditionally MNT_LOCKED the root on userns unshare, even if it's the only mount. >>> >>> To the best of my knowledge MNT_LOCKED is set uncondintially on userns >>> unshare. >> >> Only if list_empty(>mnt_expire), whatever that means, I think. > > An autofs or nfs automounted mount. Can those ever become root? Dunno. I thought that this code was what set MNT_LOCKED for things that should be locked: /* Don't allow unprivileged users to reveal what is under a mount */ if ((flag & CL_UNPRIVILEGED) && list_empty(>mnt_expire)) mnt->mnt.mnt_flags |= MNT_LOCKED; Now I'm confused. Shouldn't that be checking for submounts? Is that what it's doing? Anyway, I think that this should unconditionally set MNT_LOCKED on the thing that mounted on rootfs. --Andy -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] [RFC] mnt: add ability to clone mntns starting with the current root
Andy Lutomirski writes: > On Tue, Oct 7, 2014 at 3:42 PM, Eric W. Biederman > wrote: >> Andy Lutomirski writes: >> >>> On Tue, Oct 7, 2014 at 2:42 PM, Eric W. Biederman >>> wrote: I am squinting and looking this way and that but while I can imagine someone more clever than I can think up some unique property of rootfs that makes it a little more exploitable than just mounting a ramfs, but since you have to be root to exploit those properties I think the game is pretty much lost. >>> >>> Yes. rootfs might not be empty, it might have totally insane >>> permissions, and it's globally shared, which makes it into a wonderful >>> channel to pass things around that shouldn't be passed around. >> >> But if only root with proc mounted can reach it... I don't know. > > It doesn't have to be global root. It could be userns root. > >> There might be a case for setting MNT_LOCKED when we overmount "/" >> as root but I don't yet see it. >> >>> Can non-root do this? You'd need to be in a userns with a "/" that >>> isn't MNT_LOCKED. Can this happen on any normal setup? >>> >>> FWIW, I think we should unconditionally MNT_LOCKED the root on userns >>> unshare, even if it's the only mount. >> >> To the best of my knowledge MNT_LOCKED is set uncondintially on userns >> unshare. > > Only if list_empty(>mnt_expire), whatever that means, I think. An autofs or nfs automounted mount. Can those ever become root? Eric -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] [RFC] mnt: add ability to clone mntns starting with the current root
On Tue, Oct 7, 2014 at 3:42 PM, Eric W. Biederman wrote: > Andy Lutomirski writes: > >> On Tue, Oct 7, 2014 at 2:42 PM, Eric W. Biederman >> wrote: >>> >>> I am squinting and looking this way and that but while I can imagine >>> someone more clever than I can think up some unique property of rootfs >>> that makes it a little more exploitable than just mounting a ramfs, >>> but since you have to be root to exploit those properties I think the >>> game is pretty much lost. >> >> Yes. rootfs might not be empty, it might have totally insane >> permissions, and it's globally shared, which makes it into a wonderful >> channel to pass things around that shouldn't be passed around. > > But if only root with proc mounted can reach it... I don't know. It doesn't have to be global root. It could be userns root. > There might be a case for setting MNT_LOCKED when we overmount "/" > as root but I don't yet see it. > >> Can non-root do this? You'd need to be in a userns with a "/" that >> isn't MNT_LOCKED. Can this happen on any normal setup? >> >> FWIW, I think we should unconditionally MNT_LOCKED the root on userns >> unshare, even if it's the only mount. > > To the best of my knowledge MNT_LOCKED is set uncondintially on userns > unshare. Only if list_empty(>mnt_expire), whatever that means, I think. --Andy -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] [RFC] mnt: add ability to clone mntns starting with the current root
Andy Lutomirski writes: > On Tue, Oct 7, 2014 at 2:42 PM, Eric W. Biederman > wrote: >> >> I am squinting and looking this way and that but while I can imagine >> someone more clever than I can think up some unique property of rootfs >> that makes it a little more exploitable than just mounting a ramfs, >> but since you have to be root to exploit those properties I think the >> game is pretty much lost. > > Yes. rootfs might not be empty, it might have totally insane > permissions, and it's globally shared, which makes it into a wonderful > channel to pass things around that shouldn't be passed around. But if only root with proc mounted can reach it... I don't know. There might be a case for setting MNT_LOCKED when we overmount "/" as root but I don't yet see it. > Can non-root do this? You'd need to be in a userns with a "/" that > isn't MNT_LOCKED. Can this happen on any normal setup? > > FWIW, I think we should unconditionally MNT_LOCKED the root on userns > unshare, even if it's the only mount. To the best of my knowledge MNT_LOCKED is set uncondintially on userns unshare. Eric -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] [RFC] mnt: add ability to clone mntns starting with the current root
On Tue, Oct 7, 2014 at 2:42 PM, Eric W. Biederman wrote: > > I am squinting and looking this way and that but while I can imagine > someone more clever than I can think up some unique property of rootfs > that makes it a little more exploitable than just mounting a ramfs, > but since you have to be root to exploit those properties I think the > game is pretty much lost. Yes. rootfs might not be empty, it might have totally insane permissions, and it's globally shared, which makes it into a wonderful channel to pass things around that shouldn't be passed around. Can non-root do this? You'd need to be in a userns with a "/" that isn't MNT_LOCKED. Can this happen on any normal setup? FWIW, I think we should unconditionally MNT_LOCKED the root on userns unshare, even if it's the only mount. > >>> >> So it is only root (and not root in a container) who can get to the >>> >> exposed rootfs. >>> >> >>> >> I have a vague memory someone actually had a real use in miminal systems >>> >> for being able to get back to the rootfs and being able to use rootfs as >>> >> the rootfs. There was even a patch at that time that Andrew Morton was >>> >> carrying for a time to allow unmounting root and get at rootfs, and to >>> >> prevent the oops on rootfs unmount in some way. >>> >> >>> >> So not only do I not think it is a bug to get back too rootfs, I think >>> >> it is a feature that some people have expressed at least half-way sane >>> >> uses for. >>> > >>> > They can still do that if they want, using chroot :) >>> >>> It would take fchdir or fchroot and a directory file descriptor open on >>> rootfs. Frequently there is no appropriate directory file descriptor. >> >> ? you can always escape if you're simply chrooted. waterbuffalo :) > > filesystem type rootfs. > > Eric > > -- > To unsubscribe from this list: send the line "unsubscribe linux-api" in > the body of a message to majord...@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html -- Andy Lutomirski AMA Capital Management, LLC -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] [RFC] mnt: add ability to clone mntns starting with the current root
On Tue, Oct 7, 2014 at 2:50 PM, Eric W. Biederman wrote: > Andy Lutomirski writes: > >> On Tue, Oct 7, 2014 at 2:26 PM, Eric W. Biederman >> wrote: >>> Andy Lutomirski writes: >>> Why should MNT_LOCKED on submounts be enforced? Is it because, if you retain a reference to the detached tree, then you can see under the submounts? >>> >>> Yes. MNT_DETACH is a recursive operation that detaches all of the mount >>> and all of it's submounts. Which means you can see under the submounts >>> if you have a reference to a detached mount. >>> If so, let's fix *that*. Because otherwise the whole model of pivot_root + detach will break. >>> >>> I am not certain what you are referring to. pivot_root doesn't >>> manipulate the mount tree so you can see under anything. >>> >>> What I believe is the appropriate fix is to fail umount2(...,MNT_DETACH) >>> if there are any referenced mount points being detached that have a >>> locked submount. >> >> Most of the container-using things do, roughly: >> >> Unshare userns and mountns >> Mount some new stuff >> pivot_root to the new stuff >> MNT_DETACH the old. >> >> That last step will almost always fail if you make this change. > > I don't think so. > > I expect I could add full busy detection of normal umounts and those > applications would not fail. > > What I am proposing is a more targeted version of busy detection that > looks at each mount in the set that detach will unmount. For each mount > if it is busy with non-submount references and it has at least one > locked submount fail the detach with -EBUSY. > > Do you really think we have userspace references to the one or more of the > mounts under old? > I suspect that we have a userspace reference to old itself. --Andy > Eric -- Andy Lutomirski AMA Capital Management, LLC -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] [RFC] mnt: add ability to clone mntns starting with the current root
Andy Lutomirski writes: > On Tue, Oct 7, 2014 at 2:26 PM, Eric W. Biederman > wrote: >> Andy Lutomirski writes: >> >>> Why should MNT_LOCKED on submounts be enforced? >>> >>> Is it because, if you retain a reference to the detached tree, then >>> you can see under the submounts? >> >> Yes. MNT_DETACH is a recursive operation that detaches all of the mount >> and all of it's submounts. Which means you can see under the submounts >> if you have a reference to a detached mount. >> >>> If so, let's fix *that*. Because >>> otherwise the whole model of pivot_root + detach will break. >> >> I am not certain what you are referring to. pivot_root doesn't >> manipulate the mount tree so you can see under anything. >> >> What I believe is the appropriate fix is to fail umount2(...,MNT_DETACH) >> if there are any referenced mount points being detached that have a >> locked submount. > > Most of the container-using things do, roughly: > > Unshare userns and mountns > Mount some new stuff > pivot_root to the new stuff > MNT_DETACH the old. > > That last step will almost always fail if you make this change. I don't think so. I expect I could add full busy detection of normal umounts and those applications would not fail. What I am proposing is a more targeted version of busy detection that looks at each mount in the set that detach will unmount. For each mount if it is busy with non-submount references and it has at least one locked submount fail the detach with -EBUSY. Do you really think we have userspace references to the one or more of the mounts under old? Eric -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] [RFC] mnt: add ability to clone mntns starting with the current root
Serge Hallyn writes: > Quoting Eric W. Biederman (ebied...@xmission.com): >> What I meant is that it isn't about containers. It is about something >> root can do. So this is not a "container" problem. > > Oh, ok. > > Sorry, I'm getting the two thread confused anyway. I'm going to bow out > here until I can pay proper attention. I think there is half a point here that may be legit, when you are using mount namespaces to jail applications, there may be a problem with umounting / and making it to the underlying rootfs filesystem. I am squinting and looking this way and that but while I can imagine someone more clever than I can think up some unique property of rootfs that makes it a little more exploitable than just mounting a ramfs, but since you have to be root to exploit those properties I think the game is pretty much lost. >> >> So it is only root (and not root in a container) who can get to the >> >> exposed rootfs. >> >> >> >> I have a vague memory someone actually had a real use in miminal systems >> >> for being able to get back to the rootfs and being able to use rootfs as >> >> the rootfs. There was even a patch at that time that Andrew Morton was >> >> carrying for a time to allow unmounting root and get at rootfs, and to >> >> prevent the oops on rootfs unmount in some way. >> >> >> >> So not only do I not think it is a bug to get back too rootfs, I think >> >> it is a feature that some people have expressed at least half-way sane >> >> uses for. >> > >> > They can still do that if they want, using chroot :) >> >> It would take fchdir or fchroot and a directory file descriptor open on >> rootfs. Frequently there is no appropriate directory file descriptor. > > ? you can always escape if you're simply chrooted. waterbuffalo :) filesystem type rootfs. Eric -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] [RFC] mnt: add ability to clone mntns starting with the current root
On Tue, Oct 7, 2014 at 2:26 PM, Eric W. Biederman wrote: > Andy Lutomirski writes: > >> Why should MNT_LOCKED on submounts be enforced? >> >> Is it because, if you retain a reference to the detached tree, then >> you can see under the submounts? > > Yes. MNT_DETACH is a recursive operation that detaches all of the mount > and all of it's submounts. Which means you can see under the submounts > if you have a reference to a detached mount. > >> If so, let's fix *that*. Because >> otherwise the whole model of pivot_root + detach will break. > > I am not certain what you are referring to. pivot_root doesn't > manipulate the mount tree so you can see under anything. > > What I believe is the appropriate fix is to fail umount2(...,MNT_DETACH) > if there are any referenced mount points being detached that have a > locked submount. Most of the container-using things do, roughly: Unshare userns and mountns Mount some new stuff pivot_root to the new stuff MNT_DETACH the old. That last step will almost always fail if you make this change. --Andy -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] [RFC] mnt: add ability to clone mntns starting with the current root
Quoting Andy Lutomirski (l...@amacapital.net): > On Tue, Oct 7, 2014 at 1:30 PM, Eric W. Biederman > wrote: > > Al Viro writes: > > > > 2> On Tue, Oct 07, 2014 at 02:30:40PM +0100, Al Viro wrote: > >>> On Tue, Oct 07, 2014 at 04:12:57PM +0400, Andrey Vagin wrote: > >>> > Another problem is that rootfs can't be hidden from a container, because > >>> > rootfs can't be moved or umounted. > >>> > >>> ... which is a bug in mntns_install(), AFAICS. > >> > >> Ability to get to exposed rootfs, that is. > > > > The container side of this argument is pretty bogus. It only applies > > if user namespaces are not used for the container. > > > > So it is only root (and not root in a container) who can get to the > > exposed rootfs. > > > > I have a vague memory someone actually had a real use in miminal systems > > for being able to get back to the rootfs and being able to use rootfs as > > the rootfs. There was even a patch at that time that Andrew Morton was > > carrying for a time to allow unmounting root and get at rootfs, and to > > prevent the oops on rootfs unmount in some way. > > > > So not only do I not think it is a bug to get back too rootfs, I think > > it is a feature that some people have expressed at least half-way sane > > uses for. > > > >>> > Here is an example how to get access to rootfs: > >>> > fd = open("/proc/self/ns/mnt", O_RDONLY) > >>> > umount2("/", MNT_DETACH); > >>> > setns(fd, CLONE_NEWNS) > >>> > > >>> > rootfs may contain data, which should not be avaliable in CT-s. > >>> > >>> Indeed. > >> > >> ... and it looks like the above is what your mangled reproducer in previous > >> patch had been made of - > >> fd = open("/proc/self/ns/mnt", O_RDONLY) > >> umount2("/", MNT_DETACH); > >> setns(fd, CLONE_NEWNS) > >> umount2("/", MNT_DETACH); > >> > >> IMO what it shows is setns() bug. This "switch root/cwd, no matter what" > >> is wrong. > > > > IMO the bug is allowing us to unmount things that should never be unmounted. > > > > In a mount namespace created with just user namespace permissions we > > can't get at rootfs because MNT_LOCKED is set on the root directory > > and thus it can not be mounted. > > > > Further if anyone has permission to call chroot and chdir on any mount > > in a mount namespace (that isn't currently covered) they can get at all > > of them that are not currently covered. A mount namespace where no one > > can get at any uncovered filesystem seems to be the definition of > > useless and ridiculous. > > > > > > Now there is a bug in that MNT_DETACH today does not currently enforce > > MNT_LOCKED on submounts of the mount point that is detached. I am > > currently looking at how to construct the appropriate permission check > > to prevent that. Unfortunately I can not disallow MNT_DETACH with > > submounts all together as that breaks too many legitimate uses. > > Why should MNT_LOCKED on submounts be enforced? > > Is it because, if you retain a reference to the detached tree, then > you can see under the submounts? If so, let's fix *that*. Because > otherwise the whole model of pivot_root + detach will break. > > Also, damn it, we need change_the_ns_root instead of pivot_root. I > doubt that any container programs actually want to keep the old root > attached after pivot_root. Right I think that'll fix the problem we were having, and I think Andrey said the same thing in another list a few days ago. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] [RFC] mnt: add ability to clone mntns starting with the current root
Quoting Eric W. Biederman (ebied...@xmission.com): > Serge Hallyn writes: > > > Quoting Eric W. Biederman (ebied...@xmission.com): > >> Al Viro writes: > >> > >> 2> On Tue, Oct 07, 2014 at 02:30:40PM +0100, Al Viro wrote: > >> >> On Tue, Oct 07, 2014 at 04:12:57PM +0400, Andrey Vagin wrote: > >> >> > Another problem is that rootfs can't be hidden from a container, > >> >> > because > >> >> > rootfs can't be moved or umounted. > >> >> > >> >> ... which is a bug in mntns_install(), AFAICS. > >> > > >> > Ability to get to exposed rootfs, that is. > >> > >> The container side of this argument is pretty bogus. It only applies > >> if user namespaces are not used for the container. > > > > User namespaces are still far too restricted for many container use > > cases. We can't say "we have user namespaces so now privileged > > containers can be ignored". Yes you never should have handed the > > keys to a privileged container to an untrusted person, but we do > > still try to protect the host from accidental damage due to a > > privileged container. > > What I meant is that it isn't about containers. It is about something > root can do. So this is not a "container" problem. Oh, ok. Sorry, I'm getting the two thread confused anyway. I'm going to bow out here until I can pay proper attention. > >> So it is only root (and not root in a container) who can get to the > >> exposed rootfs. > >> > >> I have a vague memory someone actually had a real use in miminal systems > >> for being able to get back to the rootfs and being able to use rootfs as > >> the rootfs. There was even a patch at that time that Andrew Morton was > >> carrying for a time to allow unmounting root and get at rootfs, and to > >> prevent the oops on rootfs unmount in some way. > >> > >> So not only do I not think it is a bug to get back too rootfs, I think > >> it is a feature that some people have expressed at least half-way sane > >> uses for. > > > > They can still do that if they want, using chroot :) > > It would take fchdir or fchroot and a directory file descriptor open on > rootfs. Frequently there is no appropriate directory file descriptor. ? you can always escape if you're simply chrooted. waterbuffalo :) -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] [RFC] mnt: add ability to clone mntns starting with the current root
Andy Lutomirski writes: > Why should MNT_LOCKED on submounts be enforced? > > Is it because, if you retain a reference to the detached tree, then > you can see under the submounts? Yes. MNT_DETACH is a recursive operation that detaches all of the mount and all of it's submounts. Which means you can see under the submounts if you have a reference to a detached mount. > If so, let's fix *that*. Because > otherwise the whole model of pivot_root + detach will break. I am not certain what you are referring to. pivot_root doesn't manipulate the mount tree so you can see under anything. What I believe is the appropriate fix is to fail umount2(...,MNT_DETACH) if there are any referenced mount points being detached that have a locked submount. > Also, damn it, we need change_the_ns_root instead of pivot_root. I > doubt that any container programs actually want to keep the old root > attached after pivot_root. Shrug. Except for chroot_fs_refs() pivot_root is a cheap. I'm not particularly in favor of merging pivot_root and umount2. The number of weird cases in the current api are high. A merged piece of code would just make them higher. I am hoping that one more round of bug fixing will at least get the bugs for having unprivileged mounts fixed in the current API. Eric -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] [RFC] mnt: add ability to clone mntns starting with the current root
On Tue, Oct 7, 2014 at 1:30 PM, Eric W. Biederman wrote: > Al Viro writes: > > 2> On Tue, Oct 07, 2014 at 02:30:40PM +0100, Al Viro wrote: >>> On Tue, Oct 07, 2014 at 04:12:57PM +0400, Andrey Vagin wrote: >>> > Another problem is that rootfs can't be hidden from a container, because >>> > rootfs can't be moved or umounted. >>> >>> ... which is a bug in mntns_install(), AFAICS. >> >> Ability to get to exposed rootfs, that is. > > The container side of this argument is pretty bogus. It only applies > if user namespaces are not used for the container. > > So it is only root (and not root in a container) who can get to the > exposed rootfs. > > I have a vague memory someone actually had a real use in miminal systems > for being able to get back to the rootfs and being able to use rootfs as > the rootfs. There was even a patch at that time that Andrew Morton was > carrying for a time to allow unmounting root and get at rootfs, and to > prevent the oops on rootfs unmount in some way. > > So not only do I not think it is a bug to get back too rootfs, I think > it is a feature that some people have expressed at least half-way sane > uses for. > >>> > Here is an example how to get access to rootfs: >>> > fd = open("/proc/self/ns/mnt", O_RDONLY) >>> > umount2("/", MNT_DETACH); >>> > setns(fd, CLONE_NEWNS) >>> > >>> > rootfs may contain data, which should not be avaliable in CT-s. >>> >>> Indeed. >> >> ... and it looks like the above is what your mangled reproducer in previous >> patch had been made of - >> fd = open("/proc/self/ns/mnt", O_RDONLY) >> umount2("/", MNT_DETACH); >> setns(fd, CLONE_NEWNS) >> umount2("/", MNT_DETACH); >> >> IMO what it shows is setns() bug. This "switch root/cwd, no matter what" >> is wrong. > > IMO the bug is allowing us to unmount things that should never be unmounted. > > In a mount namespace created with just user namespace permissions we > can't get at rootfs because MNT_LOCKED is set on the root directory > and thus it can not be mounted. > > Further if anyone has permission to call chroot and chdir on any mount > in a mount namespace (that isn't currently covered) they can get at all > of them that are not currently covered. A mount namespace where no one > can get at any uncovered filesystem seems to be the definition of > useless and ridiculous. > > > Now there is a bug in that MNT_DETACH today does not currently enforce > MNT_LOCKED on submounts of the mount point that is detached. I am > currently looking at how to construct the appropriate permission check > to prevent that. Unfortunately I can not disallow MNT_DETACH with > submounts all together as that breaks too many legitimate uses. Why should MNT_LOCKED on submounts be enforced? Is it because, if you retain a reference to the detached tree, then you can see under the submounts? If so, let's fix *that*. Because otherwise the whole model of pivot_root + detach will break. Also, damn it, we need change_the_ns_root instead of pivot_root. I doubt that any container programs actually want to keep the old root attached after pivot_root. --Andy -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] [RFC] mnt: add ability to clone mntns starting with the current root
Serge Hallyn writes: > Quoting Eric W. Biederman (ebied...@xmission.com): >> Al Viro writes: >> >> 2> On Tue, Oct 07, 2014 at 02:30:40PM +0100, Al Viro wrote: >> >> On Tue, Oct 07, 2014 at 04:12:57PM +0400, Andrey Vagin wrote: >> >> > Another problem is that rootfs can't be hidden from a container, because >> >> > rootfs can't be moved or umounted. >> >> >> >> ... which is a bug in mntns_install(), AFAICS. >> > >> > Ability to get to exposed rootfs, that is. >> >> The container side of this argument is pretty bogus. It only applies >> if user namespaces are not used for the container. > > User namespaces are still far too restricted for many container use > cases. We can't say "we have user namespaces so now privileged > containers can be ignored". Yes you never should have handed the > keys to a privileged container to an untrusted person, but we do > still try to protect the host from accidental damage due to a > privileged container. What I meant is that it isn't about containers. It is about something root can do. So this is not a "container" problem. >> So it is only root (and not root in a container) who can get to the >> exposed rootfs. >> >> I have a vague memory someone actually had a real use in miminal systems >> for being able to get back to the rootfs and being able to use rootfs as >> the rootfs. There was even a patch at that time that Andrew Morton was >> carrying for a time to allow unmounting root and get at rootfs, and to >> prevent the oops on rootfs unmount in some way. >> >> So not only do I not think it is a bug to get back too rootfs, I think >> it is a feature that some people have expressed at least half-way sane >> uses for. > > They can still do that if they want, using chroot :) It would take fchdir or fchroot and a directory file descriptor open on rootfs. Frequently there is no appropriate directory file descriptor. Eric -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] [RFC] mnt: add ability to clone mntns starting with the current root
Quoting Eric W. Biederman (ebied...@xmission.com): > Al Viro writes: > > 2> On Tue, Oct 07, 2014 at 02:30:40PM +0100, Al Viro wrote: > >> On Tue, Oct 07, 2014 at 04:12:57PM +0400, Andrey Vagin wrote: > >> > Another problem is that rootfs can't be hidden from a container, because > >> > rootfs can't be moved or umounted. > >> > >> ... which is a bug in mntns_install(), AFAICS. > > > > Ability to get to exposed rootfs, that is. > > The container side of this argument is pretty bogus. It only applies > if user namespaces are not used for the container. User namespaces are still far too restricted for many container use cases. We can't say "we have user namespaces so now privileged containers can be ignored". Yes you never should have handed the keys to a privileged container to an untrusted person, but we do still try to protect the host from accidental damage due to a privileged container. > So it is only root (and not root in a container) who can get to the > exposed rootfs. > > I have a vague memory someone actually had a real use in miminal systems > for being able to get back to the rootfs and being able to use rootfs as > the rootfs. There was even a patch at that time that Andrew Morton was > carrying for a time to allow unmounting root and get at rootfs, and to > prevent the oops on rootfs unmount in some way. > > So not only do I not think it is a bug to get back too rootfs, I think > it is a feature that some people have expressed at least half-way sane > uses for. They can still do that if they want, using chroot :) > >> > Here is an example how to get access to rootfs: > >> > fd = open("/proc/self/ns/mnt", O_RDONLY) > >> > umount2("/", MNT_DETACH); > >> > setns(fd, CLONE_NEWNS) > >> > > >> > rootfs may contain data, which should not be avaliable in CT-s. > >> > >> Indeed. > > > > ... and it looks like the above is what your mangled reproducer in previous > > patch had been made of - > > fd = open("/proc/self/ns/mnt", O_RDONLY) > > umount2("/", MNT_DETACH); > > setns(fd, CLONE_NEWNS) > > umount2("/", MNT_DETACH); > > > > IMO what it shows is setns() bug. This "switch root/cwd, no matter what" > > is wrong. > > IMO the bug is allowing us to unmount things that should never be unmounted. > > In a mount namespace created with just user namespace permissions we > can't get at rootfs because MNT_LOCKED is set on the root directory > and thus it can not be mounted. > > Further if anyone has permission to call chroot and chdir on any mount > in a mount namespace (that isn't currently covered) they can get at all > of them that are not currently covered. A mount namespace where no one > can get at any uncovered filesystem seems to be the definition of > useless and ridiculous. > > > Now there is a bug in that MNT_DETACH today does not currently enforce > MNT_LOCKED on submounts of the mount point that is detached. I am > currently looking at how to construct the appropriate permission check > to prevent that. Unfortunately I can not disallow MNT_DETACH with > submounts all together as that breaks too many legitimate uses. > > That failure to enforce MNT_LOCKED is my mistake. I had a naive notion > that submounts would remain mounted after a mount detach and I misread > the code when I did the original work. My mistake. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] [RFC] mnt: add ability to clone mntns starting with the current root
Andrey Vagin writes: > From: Andrey Vagin > > Currently when we create a new container with a separate root, > we need to clone the current mount namespace with all mounts and then > clean up it by using pivot_root(). A big part of mountpoints are cloned > only to be umounted. Is the motivation performance? Because if that is the motivation we need numbers. > Another problem is that rootfs can't be hidden from a container, because > rootfs can't be moved or umounted. > > Here is an example how to get access to rootfs: > fd = open("/proc/self/ns/mnt", O_RDONLY) > umount2("/", MNT_DETACH); > setns(fd, CLONE_NEWNS) > > rootfs may contain data, which should not be avaliable in CT-s. Well don't give those containers CAP_SYS_ADMIN. If you aren't using user namespaces there is no expectation of safety from those kinds of problems. Getting at rootfs is perfectly valid for root. > I suggest to add ability to create a mount namespace with specified > mount points. A current task root can be used as a root for the new > mount namespace. I really don't think you are going to like the result because you will loose access to /proc and /sys. > With this patch you can call chroot(ct->rootfs) and > unshare(UNSHARE_NEWNS2) to get a clean mount namespace. That is a little bit of an ugly way to smuggle a parameter into the creation of a mount namespace. Further I am pretty certain this patch totally breaks the setting of new_fs->root and new_fs->pwd. In net my opinion is that the code doesn't work and does not provide sufficient justification for a new system call. > UNSHARE_NEWNS2 can be used only with the unshare() syscall. The clone() > syscall doesn't have unused flags. > > Here is an example how it looks like: > $ cat ../../unshare.c > > int main(int argc, char **argv) > { /* You left out * mount --bind /some/root/path /some/root/path * chroot /some/root/path */ > if (unshare(UNSHARE_NEWNS2)) > return 1; > > execl("/bin/bash", "/bin/bash", NULL); > return 1; > } > $ mount --bind test/ubuntu/ test/ubuntu/ > $ cd test/ubuntu/ > $ chroot . > $ ./unshare2 > $ mount -t proc proc proc > $ cat /proc/self/mountinfo > 55 55 252:1 /home/avagin/test/ubuntu / rw,relatime - ext4 > /dev/disk/by-uuid/d672b85f-533c-4868-9609-ca80be52d3c6 > rw,errors=remount-ro,data=ordered > 56 55 0:3 / /proc rw,relatime - proc proc rw > > Cc: Alexander Viro > Cc: Andrew Morton > Cc: "Eric W. Biederman" > Cc: Cyrill Gorcunov > Cc: Pavel Emelyanov > Cc: Serge Hallyn > Cc: Rob Landley > Signed-off-by: Andrey Vagin > --- > fs/namespace.c | 16 ++-- > include/uapi/linux/sched.h | 8 > kernel/fork.c | 11 --- > kernel/nsproxy.c | 2 +- > 4 files changed, 31 insertions(+), 6 deletions(-) > > diff --git a/fs/namespace.c b/fs/namespace.c > index 730c50e..f50a848 100644 > --- a/fs/namespace.c > +++ b/fs/namespace.c > @@ -2569,12 +2569,24 @@ struct mnt_namespace *copy_mnt_ns(unsigned long > flags, struct mnt_namespace *ns, > > BUG_ON(!ns); > > - if (likely(!(flags & CLONE_NEWNS))) { > + if (likely(!(flags & (CLONE_NEWNS | UNSHARE_NEWNS2 { > get_mnt_ns(ns); > return ns; > } > > - old = ns->root; > + if (flags & CLONE_NEWNS) > + old = ns->root; > + else { /* UNSHARE_NEWNS2 */ > + struct path root; > + > + get_fs_root(current->fs, ); > + if (root.mnt->mnt_root != root.dentry) { > + path_put(); > + return ERR_PTR(-EINVAL); /* not a mountpoint */ > + } > + old = real_mount(root.mnt); > + path_put(); > + } > > new_ns = alloc_mnt_ns(user_ns); > if (IS_ERR(new_ns)) > diff --git a/include/uapi/linux/sched.h b/include/uapi/linux/sched.h > index 34f9d73..8092e50 100644 > --- a/include/uapi/linux/sched.h > +++ b/include/uapi/linux/sched.h > @@ -31,6 +31,14 @@ > #define CLONE_IO 0x8000 /* Clone io context */ > > /* > + * Following flags can be used only with unshare(), because > + * they are intersected with CSIGNAL > + */ > +#define UNSHARE_NEWNS2 0x0001 /* Clone mnt namespace > starting with the current task root. */ > + > +#define UNSHARE_FLAGS(UNSHARE_NEWNS2) > + > +/* > * Scheduling policies > */ > #define SCHED_NORMAL 0 > diff --git a/kernel/fork.c b/kernel/fork.c > index 0cf9cdb..52f1fc0 100644 > --- a/kernel/fork.c > +++ b/kernel/fork.c > @@ -1381,7 +1381,12 @@ static struct task_struct *copy_process(unsigned long > clone_flags, > retval = copy_mm(clone_flags, p); > if (retval) > goto bad_fork_cleanup_signal; > - retval = copy_namespaces(clone_flags, p); > + > + /* > + * CSIGNAL and UNSHARE_FLAGS are intersected, but > + * UNSHARE_FLAGS can't be used with clone(). > + */ >
Re: [PATCH] [RFC] mnt: add ability to clone mntns starting with the current root
Al Viro writes: 2> On Tue, Oct 07, 2014 at 02:30:40PM +0100, Al Viro wrote: >> On Tue, Oct 07, 2014 at 04:12:57PM +0400, Andrey Vagin wrote: >> > Another problem is that rootfs can't be hidden from a container, because >> > rootfs can't be moved or umounted. >> >> ... which is a bug in mntns_install(), AFAICS. > > Ability to get to exposed rootfs, that is. The container side of this argument is pretty bogus. It only applies if user namespaces are not used for the container. So it is only root (and not root in a container) who can get to the exposed rootfs. I have a vague memory someone actually had a real use in miminal systems for being able to get back to the rootfs and being able to use rootfs as the rootfs. There was even a patch at that time that Andrew Morton was carrying for a time to allow unmounting root and get at rootfs, and to prevent the oops on rootfs unmount in some way. So not only do I not think it is a bug to get back too rootfs, I think it is a feature that some people have expressed at least half-way sane uses for. >> > Here is an example how to get access to rootfs: >> > fd = open("/proc/self/ns/mnt", O_RDONLY) >> > umount2("/", MNT_DETACH); >> > setns(fd, CLONE_NEWNS) >> > >> > rootfs may contain data, which should not be avaliable in CT-s. >> >> Indeed. > > ... and it looks like the above is what your mangled reproducer in previous > patch had been made of - > fd = open("/proc/self/ns/mnt", O_RDONLY) > umount2("/", MNT_DETACH); > setns(fd, CLONE_NEWNS) > umount2("/", MNT_DETACH); > > IMO what it shows is setns() bug. This "switch root/cwd, no matter what" > is wrong. IMO the bug is allowing us to unmount things that should never be unmounted. In a mount namespace created with just user namespace permissions we can't get at rootfs because MNT_LOCKED is set on the root directory and thus it can not be mounted. Further if anyone has permission to call chroot and chdir on any mount in a mount namespace (that isn't currently covered) they can get at all of them that are not currently covered. A mount namespace where no one can get at any uncovered filesystem seems to be the definition of useless and ridiculous. Now there is a bug in that MNT_DETACH today does not currently enforce MNT_LOCKED on submounts of the mount point that is detached. I am currently looking at how to construct the appropriate permission check to prevent that. Unfortunately I can not disallow MNT_DETACH with submounts all together as that breaks too many legitimate uses. That failure to enforce MNT_LOCKED is my mistake. I had a naive notion that submounts would remain mounted after a mount detach and I misread the code when I did the original work. My mistake. Eric -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] [RFC] mnt: add ability to clone mntns starting with the current root
On Tue, Oct 07, 2014 at 02:33:39PM +0100, Al Viro wrote: > On Tue, Oct 07, 2014 at 02:30:40PM +0100, Al Viro wrote: > > On Tue, Oct 07, 2014 at 04:12:57PM +0400, Andrey Vagin wrote: > > > Another problem is that rootfs can't be hidden from a container, because > > > rootfs can't be moved or umounted. > > > > ... which is a bug in mntns_install(), AFAICS. > > Ability to get to exposed rootfs, that is. > > > > Here is an example how to get access to rootfs: > > > fd = open("/proc/self/ns/mnt", O_RDONLY) > > > umount2("/", MNT_DETACH); > > > setns(fd, CLONE_NEWNS) > > > > > > rootfs may contain data, which should not be avaliable in CT-s. > > > > Indeed. > > ... and it looks like the above is what your mangled reproducer in previous > patch had been made of - > fd = open("/proc/self/ns/mnt", O_RDONLY) > umount2("/", MNT_DETACH); > setns(fd, CLONE_NEWNS) > umount2("/", MNT_DETACH); > > IMO what it shows is setns() bug. This "switch root/cwd, no matter what" > is wrong. How can we move in another namespace without this "switch"? If the task root and cwd isn't switched, they points to the old tree. In this case how can we open something from the new tree? Do you mean that root should not be changed if it points on a detached mount or on a mount from another mount namespace? This looks reasonable. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] [RFC] mnt: add ability to clone mntns starting with the current root
On Tue, Oct 07, 2014 at 02:30:40PM +0100, Al Viro wrote: > On Tue, Oct 07, 2014 at 04:12:57PM +0400, Andrey Vagin wrote: > > Another problem is that rootfs can't be hidden from a container, because > > rootfs can't be moved or umounted. > > ... which is a bug in mntns_install(), AFAICS. Ability to get to exposed rootfs, that is. > > Here is an example how to get access to rootfs: > > fd = open("/proc/self/ns/mnt", O_RDONLY) > > umount2("/", MNT_DETACH); > > setns(fd, CLONE_NEWNS) > > > > rootfs may contain data, which should not be avaliable in CT-s. > > Indeed. ... and it looks like the above is what your mangled reproducer in previous patch had been made of - fd = open("/proc/self/ns/mnt", O_RDONLY) umount2("/", MNT_DETACH); setns(fd, CLONE_NEWNS) umount2("/", MNT_DETACH); IMO what it shows is setns() bug. This "switch root/cwd, no matter what" is wrong. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] [RFC] mnt: add ability to clone mntns starting with the current root
On Tue, Oct 07, 2014 at 04:12:57PM +0400, Andrey Vagin wrote: > Another problem is that rootfs can't be hidden from a container, because > rootfs can't be moved or umounted. ... which is a bug in mntns_install(), AFAICS. > Here is an example how to get access to rootfs: > fd = open("/proc/self/ns/mnt", O_RDONLY) > umount2("/", MNT_DETACH); > setns(fd, CLONE_NEWNS) > > rootfs may contain data, which should not be avaliable in CT-s. Indeed. > I suggest to add ability to create a mount namespace with specified > mount points. A current task root can be used as a root for the new > mount namespace. Yecchh... Frankly, you are opening a big can of worms - having rootfs as absolute root of all namespaces simplifies a lot of things in there. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH] [RFC] mnt: add ability to clone mntns starting with the current root
From: Andrey Vagin Currently when we create a new container with a separate root, we need to clone the current mount namespace with all mounts and then clean up it by using pivot_root(). A big part of mountpoints are cloned only to be umounted. Another problem is that rootfs can't be hidden from a container, because rootfs can't be moved or umounted. Here is an example how to get access to rootfs: fd = open("/proc/self/ns/mnt", O_RDONLY) umount2("/", MNT_DETACH); setns(fd, CLONE_NEWNS) rootfs may contain data, which should not be avaliable in CT-s. I suggest to add ability to create a mount namespace with specified mount points. A current task root can be used as a root for the new mount namespace. With this patch you can call chroot(ct->rootfs) and unshare(UNSHARE_NEWNS2) to get a clean mount namespace. UNSHARE_NEWNS2 can be used only with the unshare() syscall. The clone() syscall doesn't have unused flags. Here is an example how it looks like: $ cat ../../unshare.c int main(int argc, char **argv) { if (unshare(UNSHARE_NEWNS2)) return 1; execl("/bin/bash", "/bin/bash", NULL); return 1; } $ mount --bind test/ubuntu/ test/ubuntu/ $ cd test/ubuntu/ $ chroot . $ ./unshare2 $ mount -t proc proc proc $ cat /proc/self/mountinfo 55 55 252:1 /home/avagin/test/ubuntu / rw,relatime - ext4 /dev/disk/by-uuid/d672b85f-533c-4868-9609-ca80be52d3c6 rw,errors=remount-ro,data=ordered 56 55 0:3 / /proc rw,relatime - proc proc rw Cc: Alexander Viro Cc: Andrew Morton Cc: "Eric W. Biederman" Cc: Cyrill Gorcunov Cc: Pavel Emelyanov Cc: Serge Hallyn Cc: Rob Landley Signed-off-by: Andrey Vagin --- fs/namespace.c | 16 ++-- include/uapi/linux/sched.h | 8 kernel/fork.c | 11 --- kernel/nsproxy.c | 2 +- 4 files changed, 31 insertions(+), 6 deletions(-) diff --git a/fs/namespace.c b/fs/namespace.c index 730c50e..f50a848 100644 --- a/fs/namespace.c +++ b/fs/namespace.c @@ -2569,12 +2569,24 @@ struct mnt_namespace *copy_mnt_ns(unsigned long flags, struct mnt_namespace *ns, BUG_ON(!ns); - if (likely(!(flags & CLONE_NEWNS))) { + if (likely(!(flags & (CLONE_NEWNS | UNSHARE_NEWNS2 { get_mnt_ns(ns); return ns; } - old = ns->root; + if (flags & CLONE_NEWNS) + old = ns->root; + else { /* UNSHARE_NEWNS2 */ + struct path root; + + get_fs_root(current->fs, ); + if (root.mnt->mnt_root != root.dentry) { + path_put(); + return ERR_PTR(-EINVAL); /* not a mountpoint */ + } + old = real_mount(root.mnt); + path_put(); + } new_ns = alloc_mnt_ns(user_ns); if (IS_ERR(new_ns)) diff --git a/include/uapi/linux/sched.h b/include/uapi/linux/sched.h index 34f9d73..8092e50 100644 --- a/include/uapi/linux/sched.h +++ b/include/uapi/linux/sched.h @@ -31,6 +31,14 @@ #define CLONE_IO 0x8000 /* Clone io context */ /* + * Following flags can be used only with unshare(), because + * they are intersected with CSIGNAL + */ +#define UNSHARE_NEWNS2 0x0001 /* Clone mnt namespace starting with the current task root. */ + +#define UNSHARE_FLAGS (UNSHARE_NEWNS2) + +/* * Scheduling policies */ #define SCHED_NORMAL 0 diff --git a/kernel/fork.c b/kernel/fork.c index 0cf9cdb..52f1fc0 100644 --- a/kernel/fork.c +++ b/kernel/fork.c @@ -1381,7 +1381,12 @@ static struct task_struct *copy_process(unsigned long clone_flags, retval = copy_mm(clone_flags, p); if (retval) goto bad_fork_cleanup_signal; - retval = copy_namespaces(clone_flags, p); + + /* +* CSIGNAL and UNSHARE_FLAGS are intersected, but +* UNSHARE_FLAGS can't be used with clone(). +*/ + retval = copy_namespaces(clone_flags & ~UNSHARE_FLAGS, p); if (retval) goto bad_fork_cleanup_mm; retval = copy_io(clone_flags, p); @@ -1790,7 +1795,7 @@ static int check_unshare_flags(unsigned long unshare_flags) if (unshare_flags & ~(CLONE_THREAD|CLONE_FS|CLONE_NEWNS|CLONE_SIGHAND| CLONE_VM|CLONE_FILES|CLONE_SYSVSEM| CLONE_NEWUTS|CLONE_NEWIPC|CLONE_NEWNET| - CLONE_NEWUSER|CLONE_NEWPID)) + CLONE_NEWUSER|CLONE_NEWPID|UNSHARE_FLAGS)) return -EINVAL; /* * Not implemented, but pretend it works if there is nothing to @@ -1880,7 +1885,7 @@ SYSCALL_DEFINE1(unshare, unsigned long, unshare_flags) /* * If unsharing namespace, must also unshare filesystem information. */ - if (unshare_flags & CLONE_NEWNS) + if (unshare_flags & (CLONE_NEWNS | UNSHARE_NEWNS2)) unshare_flags |= CLONE_FS;
[PATCH] [RFC] mnt: add ability to clone mntns starting with the current root
From: Andrey Vagin ava...@gmail.com Currently when we create a new container with a separate root, we need to clone the current mount namespace with all mounts and then clean up it by using pivot_root(). A big part of mountpoints are cloned only to be umounted. Another problem is that rootfs can't be hidden from a container, because rootfs can't be moved or umounted. Here is an example how to get access to rootfs: fd = open(/proc/self/ns/mnt, O_RDONLY) umount2(/, MNT_DETACH); setns(fd, CLONE_NEWNS) rootfs may contain data, which should not be avaliable in CT-s. I suggest to add ability to create a mount namespace with specified mount points. A current task root can be used as a root for the new mount namespace. With this patch you can call chroot(ct-rootfs) and unshare(UNSHARE_NEWNS2) to get a clean mount namespace. UNSHARE_NEWNS2 can be used only with the unshare() syscall. The clone() syscall doesn't have unused flags. Here is an example how it looks like: $ cat ../../unshare.c int main(int argc, char **argv) { if (unshare(UNSHARE_NEWNS2)) return 1; execl(/bin/bash, /bin/bash, NULL); return 1; } $ mount --bind test/ubuntu/ test/ubuntu/ $ cd test/ubuntu/ $ chroot . $ ./unshare2 $ mount -t proc proc proc $ cat /proc/self/mountinfo 55 55 252:1 /home/avagin/test/ubuntu / rw,relatime - ext4 /dev/disk/by-uuid/d672b85f-533c-4868-9609-ca80be52d3c6 rw,errors=remount-ro,data=ordered 56 55 0:3 / /proc rw,relatime - proc proc rw Cc: Alexander Viro v...@zeniv.linux.org.uk Cc: Andrew Morton a...@linux-foundation.org Cc: Eric W. Biederman ebied...@xmission.com Cc: Cyrill Gorcunov gorcu...@openvz.org Cc: Pavel Emelyanov xe...@parallels.com Cc: Serge Hallyn serge.hal...@canonical.com Cc: Rob Landley r...@landley.net Signed-off-by: Andrey Vagin ava...@openvz.org --- fs/namespace.c | 16 ++-- include/uapi/linux/sched.h | 8 kernel/fork.c | 11 --- kernel/nsproxy.c | 2 +- 4 files changed, 31 insertions(+), 6 deletions(-) diff --git a/fs/namespace.c b/fs/namespace.c index 730c50e..f50a848 100644 --- a/fs/namespace.c +++ b/fs/namespace.c @@ -2569,12 +2569,24 @@ struct mnt_namespace *copy_mnt_ns(unsigned long flags, struct mnt_namespace *ns, BUG_ON(!ns); - if (likely(!(flags CLONE_NEWNS))) { + if (likely(!(flags (CLONE_NEWNS | UNSHARE_NEWNS2 { get_mnt_ns(ns); return ns; } - old = ns-root; + if (flags CLONE_NEWNS) + old = ns-root; + else { /* UNSHARE_NEWNS2 */ + struct path root; + + get_fs_root(current-fs, root); + if (root.mnt-mnt_root != root.dentry) { + path_put(root); + return ERR_PTR(-EINVAL); /* not a mountpoint */ + } + old = real_mount(root.mnt); + path_put(root); + } new_ns = alloc_mnt_ns(user_ns); if (IS_ERR(new_ns)) diff --git a/include/uapi/linux/sched.h b/include/uapi/linux/sched.h index 34f9d73..8092e50 100644 --- a/include/uapi/linux/sched.h +++ b/include/uapi/linux/sched.h @@ -31,6 +31,14 @@ #define CLONE_IO 0x8000 /* Clone io context */ /* + * Following flags can be used only with unshare(), because + * they are intersected with CSIGNAL + */ +#define UNSHARE_NEWNS2 0x0001 /* Clone mnt namespace starting with the current task root. */ + +#define UNSHARE_FLAGS (UNSHARE_NEWNS2) + +/* * Scheduling policies */ #define SCHED_NORMAL 0 diff --git a/kernel/fork.c b/kernel/fork.c index 0cf9cdb..52f1fc0 100644 --- a/kernel/fork.c +++ b/kernel/fork.c @@ -1381,7 +1381,12 @@ static struct task_struct *copy_process(unsigned long clone_flags, retval = copy_mm(clone_flags, p); if (retval) goto bad_fork_cleanup_signal; - retval = copy_namespaces(clone_flags, p); + + /* +* CSIGNAL and UNSHARE_FLAGS are intersected, but +* UNSHARE_FLAGS can't be used with clone(). +*/ + retval = copy_namespaces(clone_flags ~UNSHARE_FLAGS, p); if (retval) goto bad_fork_cleanup_mm; retval = copy_io(clone_flags, p); @@ -1790,7 +1795,7 @@ static int check_unshare_flags(unsigned long unshare_flags) if (unshare_flags ~(CLONE_THREAD|CLONE_FS|CLONE_NEWNS|CLONE_SIGHAND| CLONE_VM|CLONE_FILES|CLONE_SYSVSEM| CLONE_NEWUTS|CLONE_NEWIPC|CLONE_NEWNET| - CLONE_NEWUSER|CLONE_NEWPID)) + CLONE_NEWUSER|CLONE_NEWPID|UNSHARE_FLAGS)) return -EINVAL; /* * Not implemented, but pretend it works if there is nothing to @@ -1880,7 +1885,7 @@ SYSCALL_DEFINE1(unshare, unsigned long, unshare_flags) /* * If unsharing namespace, must also unshare
Re: [PATCH] [RFC] mnt: add ability to clone mntns starting with the current root
On Tue, Oct 07, 2014 at 04:12:57PM +0400, Andrey Vagin wrote: Another problem is that rootfs can't be hidden from a container, because rootfs can't be moved or umounted. ... which is a bug in mntns_install(), AFAICS. Here is an example how to get access to rootfs: fd = open(/proc/self/ns/mnt, O_RDONLY) umount2(/, MNT_DETACH); setns(fd, CLONE_NEWNS) rootfs may contain data, which should not be avaliable in CT-s. Indeed. I suggest to add ability to create a mount namespace with specified mount points. A current task root can be used as a root for the new mount namespace. Yecchh... Frankly, you are opening a big can of worms - having rootfs as absolute root of all namespaces simplifies a lot of things in there. -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] [RFC] mnt: add ability to clone mntns starting with the current root
On Tue, Oct 07, 2014 at 02:30:40PM +0100, Al Viro wrote: On Tue, Oct 07, 2014 at 04:12:57PM +0400, Andrey Vagin wrote: Another problem is that rootfs can't be hidden from a container, because rootfs can't be moved or umounted. ... which is a bug in mntns_install(), AFAICS. Ability to get to exposed rootfs, that is. Here is an example how to get access to rootfs: fd = open(/proc/self/ns/mnt, O_RDONLY) umount2(/, MNT_DETACH); setns(fd, CLONE_NEWNS) rootfs may contain data, which should not be avaliable in CT-s. Indeed. ... and it looks like the above is what your mangled reproducer in previous patch had been made of - fd = open(/proc/self/ns/mnt, O_RDONLY) umount2(/, MNT_DETACH); setns(fd, CLONE_NEWNS) umount2(/, MNT_DETACH); IMO what it shows is setns() bug. This switch root/cwd, no matter what is wrong. -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] [RFC] mnt: add ability to clone mntns starting with the current root
On Tue, Oct 07, 2014 at 02:33:39PM +0100, Al Viro wrote: On Tue, Oct 07, 2014 at 02:30:40PM +0100, Al Viro wrote: On Tue, Oct 07, 2014 at 04:12:57PM +0400, Andrey Vagin wrote: Another problem is that rootfs can't be hidden from a container, because rootfs can't be moved or umounted. ... which is a bug in mntns_install(), AFAICS. Ability to get to exposed rootfs, that is. Here is an example how to get access to rootfs: fd = open(/proc/self/ns/mnt, O_RDONLY) umount2(/, MNT_DETACH); setns(fd, CLONE_NEWNS) rootfs may contain data, which should not be avaliable in CT-s. Indeed. ... and it looks like the above is what your mangled reproducer in previous patch had been made of - fd = open(/proc/self/ns/mnt, O_RDONLY) umount2(/, MNT_DETACH); setns(fd, CLONE_NEWNS) umount2(/, MNT_DETACH); IMO what it shows is setns() bug. This switch root/cwd, no matter what is wrong. How can we move in another namespace without this switch? If the task root and cwd isn't switched, they points to the old tree. In this case how can we open something from the new tree? Do you mean that root should not be changed if it points on a detached mount or on a mount from another mount namespace? This looks reasonable. -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] [RFC] mnt: add ability to clone mntns starting with the current root
Al Viro v...@zeniv.linux.org.uk writes: 2 On Tue, Oct 07, 2014 at 02:30:40PM +0100, Al Viro wrote: On Tue, Oct 07, 2014 at 04:12:57PM +0400, Andrey Vagin wrote: Another problem is that rootfs can't be hidden from a container, because rootfs can't be moved or umounted. ... which is a bug in mntns_install(), AFAICS. Ability to get to exposed rootfs, that is. The container side of this argument is pretty bogus. It only applies if user namespaces are not used for the container. So it is only root (and not root in a container) who can get to the exposed rootfs. I have a vague memory someone actually had a real use in miminal systems for being able to get back to the rootfs and being able to use rootfs as the rootfs. There was even a patch at that time that Andrew Morton was carrying for a time to allow unmounting root and get at rootfs, and to prevent the oops on rootfs unmount in some way. So not only do I not think it is a bug to get back too rootfs, I think it is a feature that some people have expressed at least half-way sane uses for. Here is an example how to get access to rootfs: fd = open(/proc/self/ns/mnt, O_RDONLY) umount2(/, MNT_DETACH); setns(fd, CLONE_NEWNS) rootfs may contain data, which should not be avaliable in CT-s. Indeed. ... and it looks like the above is what your mangled reproducer in previous patch had been made of - fd = open(/proc/self/ns/mnt, O_RDONLY) umount2(/, MNT_DETACH); setns(fd, CLONE_NEWNS) umount2(/, MNT_DETACH); IMO what it shows is setns() bug. This switch root/cwd, no matter what is wrong. IMO the bug is allowing us to unmount things that should never be unmounted. In a mount namespace created with just user namespace permissions we can't get at rootfs because MNT_LOCKED is set on the root directory and thus it can not be mounted. Further if anyone has permission to call chroot and chdir on any mount in a mount namespace (that isn't currently covered) they can get at all of them that are not currently covered. A mount namespace where no one can get at any uncovered filesystem seems to be the definition of useless and ridiculous. Now there is a bug in that MNT_DETACH today does not currently enforce MNT_LOCKED on submounts of the mount point that is detached. I am currently looking at how to construct the appropriate permission check to prevent that. Unfortunately I can not disallow MNT_DETACH with submounts all together as that breaks too many legitimate uses. That failure to enforce MNT_LOCKED is my mistake. I had a naive notion that submounts would remain mounted after a mount detach and I misread the code when I did the original work. My mistake. Eric -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] [RFC] mnt: add ability to clone mntns starting with the current root
Quoting Eric W. Biederman (ebied...@xmission.com): Al Viro v...@zeniv.linux.org.uk writes: 2 On Tue, Oct 07, 2014 at 02:30:40PM +0100, Al Viro wrote: On Tue, Oct 07, 2014 at 04:12:57PM +0400, Andrey Vagin wrote: Another problem is that rootfs can't be hidden from a container, because rootfs can't be moved or umounted. ... which is a bug in mntns_install(), AFAICS. Ability to get to exposed rootfs, that is. The container side of this argument is pretty bogus. It only applies if user namespaces are not used for the container. User namespaces are still far too restricted for many container use cases. We can't say we have user namespaces so now privileged containers can be ignored. Yes you never should have handed the keys to a privileged container to an untrusted person, but we do still try to protect the host from accidental damage due to a privileged container. So it is only root (and not root in a container) who can get to the exposed rootfs. I have a vague memory someone actually had a real use in miminal systems for being able to get back to the rootfs and being able to use rootfs as the rootfs. There was even a patch at that time that Andrew Morton was carrying for a time to allow unmounting root and get at rootfs, and to prevent the oops on rootfs unmount in some way. So not only do I not think it is a bug to get back too rootfs, I think it is a feature that some people have expressed at least half-way sane uses for. They can still do that if they want, using chroot :) Here is an example how to get access to rootfs: fd = open(/proc/self/ns/mnt, O_RDONLY) umount2(/, MNT_DETACH); setns(fd, CLONE_NEWNS) rootfs may contain data, which should not be avaliable in CT-s. Indeed. ... and it looks like the above is what your mangled reproducer in previous patch had been made of - fd = open(/proc/self/ns/mnt, O_RDONLY) umount2(/, MNT_DETACH); setns(fd, CLONE_NEWNS) umount2(/, MNT_DETACH); IMO what it shows is setns() bug. This switch root/cwd, no matter what is wrong. IMO the bug is allowing us to unmount things that should never be unmounted. In a mount namespace created with just user namespace permissions we can't get at rootfs because MNT_LOCKED is set on the root directory and thus it can not be mounted. Further if anyone has permission to call chroot and chdir on any mount in a mount namespace (that isn't currently covered) they can get at all of them that are not currently covered. A mount namespace where no one can get at any uncovered filesystem seems to be the definition of useless and ridiculous. Now there is a bug in that MNT_DETACH today does not currently enforce MNT_LOCKED on submounts of the mount point that is detached. I am currently looking at how to construct the appropriate permission check to prevent that. Unfortunately I can not disallow MNT_DETACH with submounts all together as that breaks too many legitimate uses. That failure to enforce MNT_LOCKED is my mistake. I had a naive notion that submounts would remain mounted after a mount detach and I misread the code when I did the original work. My mistake. -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] [RFC] mnt: add ability to clone mntns starting with the current root
Andrey Vagin ava...@openvz.org writes: From: Andrey Vagin ava...@gmail.com Currently when we create a new container with a separate root, we need to clone the current mount namespace with all mounts and then clean up it by using pivot_root(). A big part of mountpoints are cloned only to be umounted. Is the motivation performance? Because if that is the motivation we need numbers. Another problem is that rootfs can't be hidden from a container, because rootfs can't be moved or umounted. Here is an example how to get access to rootfs: fd = open(/proc/self/ns/mnt, O_RDONLY) umount2(/, MNT_DETACH); setns(fd, CLONE_NEWNS) rootfs may contain data, which should not be avaliable in CT-s. Well don't give those containers CAP_SYS_ADMIN. If you aren't using user namespaces there is no expectation of safety from those kinds of problems. Getting at rootfs is perfectly valid for root. I suggest to add ability to create a mount namespace with specified mount points. A current task root can be used as a root for the new mount namespace. I really don't think you are going to like the result because you will loose access to /proc and /sys. With this patch you can call chroot(ct-rootfs) and unshare(UNSHARE_NEWNS2) to get a clean mount namespace. That is a little bit of an ugly way to smuggle a parameter into the creation of a mount namespace. Further I am pretty certain this patch totally breaks the setting of new_fs-root and new_fs-pwd. In net my opinion is that the code doesn't work and does not provide sufficient justification for a new system call. UNSHARE_NEWNS2 can be used only with the unshare() syscall. The clone() syscall doesn't have unused flags. Here is an example how it looks like: $ cat ../../unshare.c int main(int argc, char **argv) { /* You left out * mount --bind /some/root/path /some/root/path * chroot /some/root/path */ if (unshare(UNSHARE_NEWNS2)) return 1; execl(/bin/bash, /bin/bash, NULL); return 1; } $ mount --bind test/ubuntu/ test/ubuntu/ $ cd test/ubuntu/ $ chroot . $ ./unshare2 $ mount -t proc proc proc $ cat /proc/self/mountinfo 55 55 252:1 /home/avagin/test/ubuntu / rw,relatime - ext4 /dev/disk/by-uuid/d672b85f-533c-4868-9609-ca80be52d3c6 rw,errors=remount-ro,data=ordered 56 55 0:3 / /proc rw,relatime - proc proc rw Cc: Alexander Viro v...@zeniv.linux.org.uk Cc: Andrew Morton a...@linux-foundation.org Cc: Eric W. Biederman ebied...@xmission.com Cc: Cyrill Gorcunov gorcu...@openvz.org Cc: Pavel Emelyanov xe...@parallels.com Cc: Serge Hallyn serge.hal...@canonical.com Cc: Rob Landley r...@landley.net Signed-off-by: Andrey Vagin ava...@openvz.org --- fs/namespace.c | 16 ++-- include/uapi/linux/sched.h | 8 kernel/fork.c | 11 --- kernel/nsproxy.c | 2 +- 4 files changed, 31 insertions(+), 6 deletions(-) diff --git a/fs/namespace.c b/fs/namespace.c index 730c50e..f50a848 100644 --- a/fs/namespace.c +++ b/fs/namespace.c @@ -2569,12 +2569,24 @@ struct mnt_namespace *copy_mnt_ns(unsigned long flags, struct mnt_namespace *ns, BUG_ON(!ns); - if (likely(!(flags CLONE_NEWNS))) { + if (likely(!(flags (CLONE_NEWNS | UNSHARE_NEWNS2 { get_mnt_ns(ns); return ns; } - old = ns-root; + if (flags CLONE_NEWNS) + old = ns-root; + else { /* UNSHARE_NEWNS2 */ + struct path root; + + get_fs_root(current-fs, root); + if (root.mnt-mnt_root != root.dentry) { + path_put(root); + return ERR_PTR(-EINVAL); /* not a mountpoint */ + } + old = real_mount(root.mnt); + path_put(root); + } new_ns = alloc_mnt_ns(user_ns); if (IS_ERR(new_ns)) diff --git a/include/uapi/linux/sched.h b/include/uapi/linux/sched.h index 34f9d73..8092e50 100644 --- a/include/uapi/linux/sched.h +++ b/include/uapi/linux/sched.h @@ -31,6 +31,14 @@ #define CLONE_IO 0x8000 /* Clone io context */ /* + * Following flags can be used only with unshare(), because + * they are intersected with CSIGNAL + */ +#define UNSHARE_NEWNS2 0x0001 /* Clone mnt namespace starting with the current task root. */ + +#define UNSHARE_FLAGS(UNSHARE_NEWNS2) + +/* * Scheduling policies */ #define SCHED_NORMAL 0 diff --git a/kernel/fork.c b/kernel/fork.c index 0cf9cdb..52f1fc0 100644 --- a/kernel/fork.c +++ b/kernel/fork.c @@ -1381,7 +1381,12 @@ static struct task_struct *copy_process(unsigned long clone_flags, retval = copy_mm(clone_flags, p); if (retval) goto bad_fork_cleanup_signal; - retval = copy_namespaces(clone_flags, p); + + /* + * CSIGNAL and UNSHARE_FLAGS are intersected, but
Re: [PATCH] [RFC] mnt: add ability to clone mntns starting with the current root
Serge Hallyn serge.hal...@ubuntu.com writes: Quoting Eric W. Biederman (ebied...@xmission.com): Al Viro v...@zeniv.linux.org.uk writes: 2 On Tue, Oct 07, 2014 at 02:30:40PM +0100, Al Viro wrote: On Tue, Oct 07, 2014 at 04:12:57PM +0400, Andrey Vagin wrote: Another problem is that rootfs can't be hidden from a container, because rootfs can't be moved or umounted. ... which is a bug in mntns_install(), AFAICS. Ability to get to exposed rootfs, that is. The container side of this argument is pretty bogus. It only applies if user namespaces are not used for the container. User namespaces are still far too restricted for many container use cases. We can't say we have user namespaces so now privileged containers can be ignored. Yes you never should have handed the keys to a privileged container to an untrusted person, but we do still try to protect the host from accidental damage due to a privileged container. What I meant is that it isn't about containers. It is about something root can do. So this is not a container problem. So it is only root (and not root in a container) who can get to the exposed rootfs. I have a vague memory someone actually had a real use in miminal systems for being able to get back to the rootfs and being able to use rootfs as the rootfs. There was even a patch at that time that Andrew Morton was carrying for a time to allow unmounting root and get at rootfs, and to prevent the oops on rootfs unmount in some way. So not only do I not think it is a bug to get back too rootfs, I think it is a feature that some people have expressed at least half-way sane uses for. They can still do that if they want, using chroot :) It would take fchdir or fchroot and a directory file descriptor open on rootfs. Frequently there is no appropriate directory file descriptor. Eric -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] [RFC] mnt: add ability to clone mntns starting with the current root
On Tue, Oct 7, 2014 at 1:30 PM, Eric W. Biederman ebied...@xmission.com wrote: Al Viro v...@zeniv.linux.org.uk writes: 2 On Tue, Oct 07, 2014 at 02:30:40PM +0100, Al Viro wrote: On Tue, Oct 07, 2014 at 04:12:57PM +0400, Andrey Vagin wrote: Another problem is that rootfs can't be hidden from a container, because rootfs can't be moved or umounted. ... which is a bug in mntns_install(), AFAICS. Ability to get to exposed rootfs, that is. The container side of this argument is pretty bogus. It only applies if user namespaces are not used for the container. So it is only root (and not root in a container) who can get to the exposed rootfs. I have a vague memory someone actually had a real use in miminal systems for being able to get back to the rootfs and being able to use rootfs as the rootfs. There was even a patch at that time that Andrew Morton was carrying for a time to allow unmounting root and get at rootfs, and to prevent the oops on rootfs unmount in some way. So not only do I not think it is a bug to get back too rootfs, I think it is a feature that some people have expressed at least half-way sane uses for. Here is an example how to get access to rootfs: fd = open(/proc/self/ns/mnt, O_RDONLY) umount2(/, MNT_DETACH); setns(fd, CLONE_NEWNS) rootfs may contain data, which should not be avaliable in CT-s. Indeed. ... and it looks like the above is what your mangled reproducer in previous patch had been made of - fd = open(/proc/self/ns/mnt, O_RDONLY) umount2(/, MNT_DETACH); setns(fd, CLONE_NEWNS) umount2(/, MNT_DETACH); IMO what it shows is setns() bug. This switch root/cwd, no matter what is wrong. IMO the bug is allowing us to unmount things that should never be unmounted. In a mount namespace created with just user namespace permissions we can't get at rootfs because MNT_LOCKED is set on the root directory and thus it can not be mounted. Further if anyone has permission to call chroot and chdir on any mount in a mount namespace (that isn't currently covered) they can get at all of them that are not currently covered. A mount namespace where no one can get at any uncovered filesystem seems to be the definition of useless and ridiculous. Now there is a bug in that MNT_DETACH today does not currently enforce MNT_LOCKED on submounts of the mount point that is detached. I am currently looking at how to construct the appropriate permission check to prevent that. Unfortunately I can not disallow MNT_DETACH with submounts all together as that breaks too many legitimate uses. Why should MNT_LOCKED on submounts be enforced? Is it because, if you retain a reference to the detached tree, then you can see under the submounts? If so, let's fix *that*. Because otherwise the whole model of pivot_root + detach will break. Also, damn it, we need change_the_ns_root instead of pivot_root. I doubt that any container programs actually want to keep the old root attached after pivot_root. --Andy -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] [RFC] mnt: add ability to clone mntns starting with the current root
Andy Lutomirski l...@amacapital.net writes: Why should MNT_LOCKED on submounts be enforced? Is it because, if you retain a reference to the detached tree, then you can see under the submounts? Yes. MNT_DETACH is a recursive operation that detaches all of the mount and all of it's submounts. Which means you can see under the submounts if you have a reference to a detached mount. If so, let's fix *that*. Because otherwise the whole model of pivot_root + detach will break. I am not certain what you are referring to. pivot_root doesn't manipulate the mount tree so you can see under anything. What I believe is the appropriate fix is to fail umount2(...,MNT_DETACH) if there are any referenced mount points being detached that have a locked submount. Also, damn it, we need change_the_ns_root instead of pivot_root. I doubt that any container programs actually want to keep the old root attached after pivot_root. Shrug. Except for chroot_fs_refs() pivot_root is a cheap. I'm not particularly in favor of merging pivot_root and umount2. The number of weird cases in the current api are high. A merged piece of code would just make them higher. I am hoping that one more round of bug fixing will at least get the bugs for having unprivileged mounts fixed in the current API. Eric -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] [RFC] mnt: add ability to clone mntns starting with the current root
Quoting Eric W. Biederman (ebied...@xmission.com): Serge Hallyn serge.hal...@ubuntu.com writes: Quoting Eric W. Biederman (ebied...@xmission.com): Al Viro v...@zeniv.linux.org.uk writes: 2 On Tue, Oct 07, 2014 at 02:30:40PM +0100, Al Viro wrote: On Tue, Oct 07, 2014 at 04:12:57PM +0400, Andrey Vagin wrote: Another problem is that rootfs can't be hidden from a container, because rootfs can't be moved or umounted. ... which is a bug in mntns_install(), AFAICS. Ability to get to exposed rootfs, that is. The container side of this argument is pretty bogus. It only applies if user namespaces are not used for the container. User namespaces are still far too restricted for many container use cases. We can't say we have user namespaces so now privileged containers can be ignored. Yes you never should have handed the keys to a privileged container to an untrusted person, but we do still try to protect the host from accidental damage due to a privileged container. What I meant is that it isn't about containers. It is about something root can do. So this is not a container problem. Oh, ok. Sorry, I'm getting the two thread confused anyway. I'm going to bow out here until I can pay proper attention. So it is only root (and not root in a container) who can get to the exposed rootfs. I have a vague memory someone actually had a real use in miminal systems for being able to get back to the rootfs and being able to use rootfs as the rootfs. There was even a patch at that time that Andrew Morton was carrying for a time to allow unmounting root and get at rootfs, and to prevent the oops on rootfs unmount in some way. So not only do I not think it is a bug to get back too rootfs, I think it is a feature that some people have expressed at least half-way sane uses for. They can still do that if they want, using chroot :) It would take fchdir or fchroot and a directory file descriptor open on rootfs. Frequently there is no appropriate directory file descriptor. ? you can always escape if you're simply chrooted. waterbuffalo :) -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] [RFC] mnt: add ability to clone mntns starting with the current root
Quoting Andy Lutomirski (l...@amacapital.net): On Tue, Oct 7, 2014 at 1:30 PM, Eric W. Biederman ebied...@xmission.com wrote: Al Viro v...@zeniv.linux.org.uk writes: 2 On Tue, Oct 07, 2014 at 02:30:40PM +0100, Al Viro wrote: On Tue, Oct 07, 2014 at 04:12:57PM +0400, Andrey Vagin wrote: Another problem is that rootfs can't be hidden from a container, because rootfs can't be moved or umounted. ... which is a bug in mntns_install(), AFAICS. Ability to get to exposed rootfs, that is. The container side of this argument is pretty bogus. It only applies if user namespaces are not used for the container. So it is only root (and not root in a container) who can get to the exposed rootfs. I have a vague memory someone actually had a real use in miminal systems for being able to get back to the rootfs and being able to use rootfs as the rootfs. There was even a patch at that time that Andrew Morton was carrying for a time to allow unmounting root and get at rootfs, and to prevent the oops on rootfs unmount in some way. So not only do I not think it is a bug to get back too rootfs, I think it is a feature that some people have expressed at least half-way sane uses for. Here is an example how to get access to rootfs: fd = open(/proc/self/ns/mnt, O_RDONLY) umount2(/, MNT_DETACH); setns(fd, CLONE_NEWNS) rootfs may contain data, which should not be avaliable in CT-s. Indeed. ... and it looks like the above is what your mangled reproducer in previous patch had been made of - fd = open(/proc/self/ns/mnt, O_RDONLY) umount2(/, MNT_DETACH); setns(fd, CLONE_NEWNS) umount2(/, MNT_DETACH); IMO what it shows is setns() bug. This switch root/cwd, no matter what is wrong. IMO the bug is allowing us to unmount things that should never be unmounted. In a mount namespace created with just user namespace permissions we can't get at rootfs because MNT_LOCKED is set on the root directory and thus it can not be mounted. Further if anyone has permission to call chroot and chdir on any mount in a mount namespace (that isn't currently covered) they can get at all of them that are not currently covered. A mount namespace where no one can get at any uncovered filesystem seems to be the definition of useless and ridiculous. Now there is a bug in that MNT_DETACH today does not currently enforce MNT_LOCKED on submounts of the mount point that is detached. I am currently looking at how to construct the appropriate permission check to prevent that. Unfortunately I can not disallow MNT_DETACH with submounts all together as that breaks too many legitimate uses. Why should MNT_LOCKED on submounts be enforced? Is it because, if you retain a reference to the detached tree, then you can see under the submounts? If so, let's fix *that*. Because otherwise the whole model of pivot_root + detach will break. Also, damn it, we need change_the_ns_root instead of pivot_root. I doubt that any container programs actually want to keep the old root attached after pivot_root. Right I think that'll fix the problem we were having, and I think Andrey said the same thing in another list a few days ago. -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] [RFC] mnt: add ability to clone mntns starting with the current root
On Tue, Oct 7, 2014 at 2:26 PM, Eric W. Biederman ebied...@xmission.com wrote: Andy Lutomirski l...@amacapital.net writes: Why should MNT_LOCKED on submounts be enforced? Is it because, if you retain a reference to the detached tree, then you can see under the submounts? Yes. MNT_DETACH is a recursive operation that detaches all of the mount and all of it's submounts. Which means you can see under the submounts if you have a reference to a detached mount. If so, let's fix *that*. Because otherwise the whole model of pivot_root + detach will break. I am not certain what you are referring to. pivot_root doesn't manipulate the mount tree so you can see under anything. What I believe is the appropriate fix is to fail umount2(...,MNT_DETACH) if there are any referenced mount points being detached that have a locked submount. Most of the container-using things do, roughly: Unshare userns and mountns Mount some new stuff pivot_root to the new stuff MNT_DETACH the old. That last step will almost always fail if you make this change. --Andy -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] [RFC] mnt: add ability to clone mntns starting with the current root
Serge Hallyn serge.hal...@ubuntu.com writes: Quoting Eric W. Biederman (ebied...@xmission.com): What I meant is that it isn't about containers. It is about something root can do. So this is not a container problem. Oh, ok. Sorry, I'm getting the two thread confused anyway. I'm going to bow out here until I can pay proper attention. I think there is half a point here that may be legit, when you are using mount namespaces to jail applications, there may be a problem with umounting / and making it to the underlying rootfs filesystem. I am squinting and looking this way and that but while I can imagine someone more clever than I can think up some unique property of rootfs that makes it a little more exploitable than just mounting a ramfs, but since you have to be root to exploit those properties I think the game is pretty much lost. So it is only root (and not root in a container) who can get to the exposed rootfs. I have a vague memory someone actually had a real use in miminal systems for being able to get back to the rootfs and being able to use rootfs as the rootfs. There was even a patch at that time that Andrew Morton was carrying for a time to allow unmounting root and get at rootfs, and to prevent the oops on rootfs unmount in some way. So not only do I not think it is a bug to get back too rootfs, I think it is a feature that some people have expressed at least half-way sane uses for. They can still do that if they want, using chroot :) It would take fchdir or fchroot and a directory file descriptor open on rootfs. Frequently there is no appropriate directory file descriptor. ? you can always escape if you're simply chrooted. waterbuffalo :) filesystem type rootfs. Eric -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] [RFC] mnt: add ability to clone mntns starting with the current root
Andy Lutomirski l...@amacapital.net writes: On Tue, Oct 7, 2014 at 2:26 PM, Eric W. Biederman ebied...@xmission.com wrote: Andy Lutomirski l...@amacapital.net writes: Why should MNT_LOCKED on submounts be enforced? Is it because, if you retain a reference to the detached tree, then you can see under the submounts? Yes. MNT_DETACH is a recursive operation that detaches all of the mount and all of it's submounts. Which means you can see under the submounts if you have a reference to a detached mount. If so, let's fix *that*. Because otherwise the whole model of pivot_root + detach will break. I am not certain what you are referring to. pivot_root doesn't manipulate the mount tree so you can see under anything. What I believe is the appropriate fix is to fail umount2(...,MNT_DETACH) if there are any referenced mount points being detached that have a locked submount. Most of the container-using things do, roughly: Unshare userns and mountns Mount some new stuff pivot_root to the new stuff MNT_DETACH the old. That last step will almost always fail if you make this change. I don't think so. I expect I could add full busy detection of normal umounts and those applications would not fail. What I am proposing is a more targeted version of busy detection that looks at each mount in the set that detach will unmount. For each mount if it is busy with non-submount references and it has at least one locked submount fail the detach with -EBUSY. Do you really think we have userspace references to the one or more of the mounts under old? Eric -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] [RFC] mnt: add ability to clone mntns starting with the current root
On Tue, Oct 7, 2014 at 2:50 PM, Eric W. Biederman ebied...@xmission.com wrote: Andy Lutomirski l...@amacapital.net writes: On Tue, Oct 7, 2014 at 2:26 PM, Eric W. Biederman ebied...@xmission.com wrote: Andy Lutomirski l...@amacapital.net writes: Why should MNT_LOCKED on submounts be enforced? Is it because, if you retain a reference to the detached tree, then you can see under the submounts? Yes. MNT_DETACH is a recursive operation that detaches all of the mount and all of it's submounts. Which means you can see under the submounts if you have a reference to a detached mount. If so, let's fix *that*. Because otherwise the whole model of pivot_root + detach will break. I am not certain what you are referring to. pivot_root doesn't manipulate the mount tree so you can see under anything. What I believe is the appropriate fix is to fail umount2(...,MNT_DETACH) if there are any referenced mount points being detached that have a locked submount. Most of the container-using things do, roughly: Unshare userns and mountns Mount some new stuff pivot_root to the new stuff MNT_DETACH the old. That last step will almost always fail if you make this change. I don't think so. I expect I could add full busy detection of normal umounts and those applications would not fail. What I am proposing is a more targeted version of busy detection that looks at each mount in the set that detach will unmount. For each mount if it is busy with non-submount references and it has at least one locked submount fail the detach with -EBUSY. Do you really think we have userspace references to the one or more of the mounts under old? I suspect that we have a userspace reference to old itself. --Andy Eric -- Andy Lutomirski AMA Capital Management, LLC -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] [RFC] mnt: add ability to clone mntns starting with the current root
On Tue, Oct 7, 2014 at 2:42 PM, Eric W. Biederman ebied...@xmission.com wrote: I am squinting and looking this way and that but while I can imagine someone more clever than I can think up some unique property of rootfs that makes it a little more exploitable than just mounting a ramfs, but since you have to be root to exploit those properties I think the game is pretty much lost. Yes. rootfs might not be empty, it might have totally insane permissions, and it's globally shared, which makes it into a wonderful channel to pass things around that shouldn't be passed around. Can non-root do this? You'd need to be in a userns with a / that isn't MNT_LOCKED. Can this happen on any normal setup? FWIW, I think we should unconditionally MNT_LOCKED the root on userns unshare, even if it's the only mount. So it is only root (and not root in a container) who can get to the exposed rootfs. I have a vague memory someone actually had a real use in miminal systems for being able to get back to the rootfs and being able to use rootfs as the rootfs. There was even a patch at that time that Andrew Morton was carrying for a time to allow unmounting root and get at rootfs, and to prevent the oops on rootfs unmount in some way. So not only do I not think it is a bug to get back too rootfs, I think it is a feature that some people have expressed at least half-way sane uses for. They can still do that if they want, using chroot :) It would take fchdir or fchroot and a directory file descriptor open on rootfs. Frequently there is no appropriate directory file descriptor. ? you can always escape if you're simply chrooted. waterbuffalo :) filesystem type rootfs. Eric -- To unsubscribe from this list: send the line unsubscribe linux-api in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- Andy Lutomirski AMA Capital Management, LLC -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] [RFC] mnt: add ability to clone mntns starting with the current root
Andy Lutomirski l...@amacapital.net writes: On Tue, Oct 7, 2014 at 2:42 PM, Eric W. Biederman ebied...@xmission.com wrote: I am squinting and looking this way and that but while I can imagine someone more clever than I can think up some unique property of rootfs that makes it a little more exploitable than just mounting a ramfs, but since you have to be root to exploit those properties I think the game is pretty much lost. Yes. rootfs might not be empty, it might have totally insane permissions, and it's globally shared, which makes it into a wonderful channel to pass things around that shouldn't be passed around. But if only root with proc mounted can reach it... I don't know. There might be a case for setting MNT_LOCKED when we overmount / as root but I don't yet see it. Can non-root do this? You'd need to be in a userns with a / that isn't MNT_LOCKED. Can this happen on any normal setup? FWIW, I think we should unconditionally MNT_LOCKED the root on userns unshare, even if it's the only mount. To the best of my knowledge MNT_LOCKED is set uncondintially on userns unshare. Eric -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] [RFC] mnt: add ability to clone mntns starting with the current root
On Tue, Oct 7, 2014 at 3:42 PM, Eric W. Biederman ebied...@xmission.com wrote: Andy Lutomirski l...@amacapital.net writes: On Tue, Oct 7, 2014 at 2:42 PM, Eric W. Biederman ebied...@xmission.com wrote: I am squinting and looking this way and that but while I can imagine someone more clever than I can think up some unique property of rootfs that makes it a little more exploitable than just mounting a ramfs, but since you have to be root to exploit those properties I think the game is pretty much lost. Yes. rootfs might not be empty, it might have totally insane permissions, and it's globally shared, which makes it into a wonderful channel to pass things around that shouldn't be passed around. But if only root with proc mounted can reach it... I don't know. It doesn't have to be global root. It could be userns root. There might be a case for setting MNT_LOCKED when we overmount / as root but I don't yet see it. Can non-root do this? You'd need to be in a userns with a / that isn't MNT_LOCKED. Can this happen on any normal setup? FWIW, I think we should unconditionally MNT_LOCKED the root on userns unshare, even if it's the only mount. To the best of my knowledge MNT_LOCKED is set uncondintially on userns unshare. Only if list_empty(old-mnt_expire), whatever that means, I think. --Andy -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] [RFC] mnt: add ability to clone mntns starting with the current root
Andy Lutomirski l...@amacapital.net writes: On Tue, Oct 7, 2014 at 3:42 PM, Eric W. Biederman ebied...@xmission.com wrote: Andy Lutomirski l...@amacapital.net writes: On Tue, Oct 7, 2014 at 2:42 PM, Eric W. Biederman ebied...@xmission.com wrote: I am squinting and looking this way and that but while I can imagine someone more clever than I can think up some unique property of rootfs that makes it a little more exploitable than just mounting a ramfs, but since you have to be root to exploit those properties I think the game is pretty much lost. Yes. rootfs might not be empty, it might have totally insane permissions, and it's globally shared, which makes it into a wonderful channel to pass things around that shouldn't be passed around. But if only root with proc mounted can reach it... I don't know. It doesn't have to be global root. It could be userns root. There might be a case for setting MNT_LOCKED when we overmount / as root but I don't yet see it. Can non-root do this? You'd need to be in a userns with a / that isn't MNT_LOCKED. Can this happen on any normal setup? FWIW, I think we should unconditionally MNT_LOCKED the root on userns unshare, even if it's the only mount. To the best of my knowledge MNT_LOCKED is set uncondintially on userns unshare. Only if list_empty(old-mnt_expire), whatever that means, I think. An autofs or nfs automounted mount. Can those ever become root? Eric -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] [RFC] mnt: add ability to clone mntns starting with the current root
On Tue, Oct 7, 2014 at 4:42 PM, Eric W. Biederman ebied...@xmission.com wrote: Andy Lutomirski l...@amacapital.net writes: On Tue, Oct 7, 2014 at 3:42 PM, Eric W. Biederman ebied...@xmission.com wrote: Andy Lutomirski l...@amacapital.net writes: On Tue, Oct 7, 2014 at 2:42 PM, Eric W. Biederman ebied...@xmission.com wrote: I am squinting and looking this way and that but while I can imagine someone more clever than I can think up some unique property of rootfs that makes it a little more exploitable than just mounting a ramfs, but since you have to be root to exploit those properties I think the game is pretty much lost. Yes. rootfs might not be empty, it might have totally insane permissions, and it's globally shared, which makes it into a wonderful channel to pass things around that shouldn't be passed around. But if only root with proc mounted can reach it... I don't know. It doesn't have to be global root. It could be userns root. There might be a case for setting MNT_LOCKED when we overmount / as root but I don't yet see it. Can non-root do this? You'd need to be in a userns with a / that isn't MNT_LOCKED. Can this happen on any normal setup? FWIW, I think we should unconditionally MNT_LOCKED the root on userns unshare, even if it's the only mount. To the best of my knowledge MNT_LOCKED is set uncondintially on userns unshare. Only if list_empty(old-mnt_expire), whatever that means, I think. An autofs or nfs automounted mount. Can those ever become root? Dunno. I thought that this code was what set MNT_LOCKED for things that should be locked: /* Don't allow unprivileged users to reveal what is under a mount */ if ((flag CL_UNPRIVILEGED) list_empty(old-mnt_expire)) mnt-mnt.mnt_flags |= MNT_LOCKED; Now I'm confused. Shouldn't that be checking for submounts? Is that what it's doing? Anyway, I think that this should unconditionally set MNT_LOCKED on the thing that mounted on rootfs. --Andy -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] [RFC] mnt: add ability to clone mntns starting with the current root
Andy Lutomirski l...@amacapital.net writes: On Tue, Oct 7, 2014 at 4:42 PM, Eric W. Biederman ebied...@xmission.com wrote: Andy Lutomirski l...@amacapital.net writes: On Tue, Oct 7, 2014 at 3:42 PM, Eric W. Biederman ebied...@xmission.com wrote: Andy Lutomirski l...@amacapital.net writes: On Tue, Oct 7, 2014 at 2:42 PM, Eric W. Biederman ebied...@xmission.com wrote: I am squinting and looking this way and that but while I can imagine someone more clever than I can think up some unique property of rootfs that makes it a little more exploitable than just mounting a ramfs, but since you have to be root to exploit those properties I think the game is pretty much lost. Yes. rootfs might not be empty, it might have totally insane permissions, and it's globally shared, which makes it into a wonderful channel to pass things around that shouldn't be passed around. But if only root with proc mounted can reach it... I don't know. It doesn't have to be global root. It could be userns root. There might be a case for setting MNT_LOCKED when we overmount / as root but I don't yet see it. Can non-root do this? You'd need to be in a userns with a / that isn't MNT_LOCKED. Can this happen on any normal setup? FWIW, I think we should unconditionally MNT_LOCKED the root on userns unshare, even if it's the only mount. To the best of my knowledge MNT_LOCKED is set uncondintially on userns unshare. Only if list_empty(old-mnt_expire), whatever that means, I think. An autofs or nfs automounted mount. Can those ever become root? Dunno. I thought that this code was what set MNT_LOCKED for things that should be locked: It does. /* Don't allow unprivileged users to reveal what is under a mount */ if ((flag CL_UNPRIVILEGED) list_empty(old-mnt_expire)) mnt-mnt.mnt_flags |= MNT_LOCKED; Now I'm confused. Shouldn't that be checking for submounts? Is that what it's doing? As it copies each mount (mostly submounts) it sets MNT_LOCKED. Anyway, I think that this should unconditionally set MNT_LOCKED on the thing that mounted on rootfs. As I read the code mnt_set_expiry is only for nfs, cifs, and afs submounts (and perhaps proc should use them). So as they are generated mnt_expiry should never start on the root of filesystem of the mount tree. Furthermore mnt_expiry is cleared when a mount is moved, and when it is bind mounted, or propagated. pivot_root does look as though it may be missing the proper mnt_expiry handling list_del_init(old-mnt_expire), but pivot_root does have proper MNT_LOCKED handling. Looking that test was slightly off and it should be: if ((flag CL_UNPRIVILEGED) (!(flag CL_EXPIRE) || list_empty(old-mnt_expire)) mnt-mnt.mnt_flags |= MNT_LOCKED; So on that note we might clear CL_EXPIRE when CL_UNPRIVILEGED is set in copy_tree but I don't see that as being really needed. Eric -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] [RFC] mnt: add ability to clone mntns starting with the current root
On Tue, Oct 7, 2014 at 5:20 PM, Eric W. Biederman ebied...@xmission.com wrote: Andy Lutomirski l...@amacapital.net writes: On Tue, Oct 7, 2014 at 4:42 PM, Eric W. Biederman ebied...@xmission.com wrote: Andy Lutomirski l...@amacapital.net writes: On Tue, Oct 7, 2014 at 3:42 PM, Eric W. Biederman ebied...@xmission.com wrote: Andy Lutomirski l...@amacapital.net writes: On Tue, Oct 7, 2014 at 2:42 PM, Eric W. Biederman ebied...@xmission.com wrote: I am squinting and looking this way and that but while I can imagine someone more clever than I can think up some unique property of rootfs that makes it a little more exploitable than just mounting a ramfs, but since you have to be root to exploit those properties I think the game is pretty much lost. Yes. rootfs might not be empty, it might have totally insane permissions, and it's globally shared, which makes it into a wonderful channel to pass things around that shouldn't be passed around. But if only root with proc mounted can reach it... I don't know. It doesn't have to be global root. It could be userns root. There might be a case for setting MNT_LOCKED when we overmount / as root but I don't yet see it. Can non-root do this? You'd need to be in a userns with a / that isn't MNT_LOCKED. Can this happen on any normal setup? FWIW, I think we should unconditionally MNT_LOCKED the root on userns unshare, even if it's the only mount. To the best of my knowledge MNT_LOCKED is set uncondintially on userns unshare. Only if list_empty(old-mnt_expire), whatever that means, I think. An autofs or nfs automounted mount. Can those ever become root? Dunno. I thought that this code was what set MNT_LOCKED for things that should be locked: It does. /* Don't allow unprivileged users to reveal what is under a mount */ if ((flag CL_UNPRIVILEGED) list_empty(old-mnt_expire)) mnt-mnt.mnt_flags |= MNT_LOCKED; Now I'm confused. Shouldn't that be checking for submounts? Is that what it's doing? As it copies each mount (mostly submounts) it sets MNT_LOCKED. Oh. They're *all* MNT_LOCKED. Duh. Anyway, I think that this should unconditionally set MNT_LOCKED on the thing that mounted on rootfs. As I read the code mnt_set_expiry is only for nfs, cifs, and afs submounts (and perhaps proc should use them). So as they are generated mnt_expiry should never start on the root of filesystem of the mount tree. Furthermore mnt_expiry is cleared when a mount is moved, and when it is bind mounted, or propagated. pivot_root does look as though it may be missing the proper mnt_expiry handling list_del_init(old-mnt_expire), but pivot_root does have proper MNT_LOCKED handling. pivot_root is quite broken, as noted in my other email. It's just not broken like this, I think. :) Looking that test was slightly off and it should be: if ((flag CL_UNPRIVILEGED) (!(flag CL_EXPIRE) || list_empty(old-mnt_expire)) mnt-mnt.mnt_flags |= MNT_LOCKED; So on that note we might clear CL_EXPIRE when CL_UNPRIVILEGED is set in copy_tree but I don't see that as being really needed. Eric -- Andy Lutomirski AMA Capital Management, LLC -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/