Re: [PATCH] [RFC] mnt: add ability to clone mntns starting with the current root

2014-10-09 Thread Andrew Vagin
On Wed, Oct 08, 2014 at 12:23:52PM -0700, Eric W. Biederman wrote:
> Andy Lutomirski  writes:
> 
> > On Wed, Oct 8, 2014 at 4:08 AM, Andrew Vagin  wrote:
> >> On Tue, Oct 07, 2014 at 01:45:22PM -0700, Eric W. Biederman wrote:
> >>> Andrey Vagin  writes:
> >>>
> >>> > From: Andrey Vagin 
> >>> >
> >>> > Currently when we create a new container with a separate root,
> >>> > we need to clone the current mount namespace with all mounts and then
> >>> > clean up it by using pivot_root(). A big part of mountpoints are cloned
> >>> > only to be umounted.
> >>>
> >>> Is the motivation performance?  Because if that is the motivation we
> >>> need numbers.
> >>
> >> The major motivation to create a clean mount namespace which contains
> >> only required mounts.
> >>
> >> Now you want to convince us that there is nothing wrong if we use
> >> userns, because all inherited mounts are locked. My point is that all
> >> useless mounts should be umounted.  If the current root isn't on rootfs,
> >> pivot_root() allows us to umount all useless points. But pivot_root()
> >> doesn't work, if the current root is on rootfs. How can we umount
> >> useless points in this case?
> 
> One of your justifications for a new system call was so you could do
> less.  Doing less to get to where you want to go is only justified when
> your doing less to get better performance.
> 
> >> Maybe we want to say that rootfs should not be used if we are going to
> >> create containers...
> 
> Today it is an assumption of the vfs that rootfs is mounted.  With
> rootfs mounted and pivot_root at the base of the mount stack you can
> make as minimal of a set of mounts as the vfs allows.

You have misunderstood me.
For most system /proc/self/mountinfo looks like this:
[root@dhcp-10-30-23-214 ~]# cat /proc/self/mountinfo 
17 22 0:3 / /proc rw,relatime - proc proc rw
18 22 0:0 / /sys rw,relatime - sysfs sysfs rw
19 22 0:5 / /dev rw,relatime - devtmpfs devtmpfs 
rw,size=502324k,nr_inodes=125581,mode=755
20 19 0:11 / /dev/pts rw,relatime - devpts devpts rw,gid=5,mode=620,ptmxmode=000
21 19 0:17 / /dev/shm rw,nosuid,nodev,noexec,relatime - tmpfs tmpfs rw
22 1 253:2 / / rw,relatime - ext4 /dev/vda2 rw,barrier=1,data=ordered
24 22 253:1 / /boot rw,relatime - ext3 /dev/vda1 
rw,errors=continue,user_xattr,acl,barrier=1,data=ordered

/ isn't a rootfs mount here and pivot_root() works fine in this case. Here is
no problem for such system.

Now look at the second case:
hell@android:/ $ cat /proc/self/mountinfo
1 1 0:1 / / ro,relatime - rootfs rootfs ro
11 1 0:11 / /dev rw,nosuid,relatime - tmpfs tmpfs rw,mode=755
12 11 0:9 / /dev/pts rw,relatime - devpts devpts rw,mode=600
13 1 0:3 / /proc rw,relatime - proc proc rw
14 1 0:12 / /sys rw,relatime - sysfs sysfs rw

Now / is an rootfs mount. pivot_root() doesn't work in this case and we
need to do some tricks to get a minimal set of mounts.

Thanks,
Andrew

> 
> Removing rootfs from the vfs requires an audit of everything that
> manipulates mounts.  It is not remotely a local excercise.
> 
> One of the things that needs to be considered is that if you really want
> to audit mounts is the code that needs manipulates them needs to be
> audited every bit as much as the mounts themselves.
> 
> > Could we have an extra rootfs-like fs that is always completely empty,
> > doesn't allow any writes, and can sit at the bottom of container
> > namespace hierarchies?  If so, and if we add a new syscall that's like
> > pivot_root (or unshare) but prunes the hierarchy, then we could switch
> > to that rootfs then.
> 
> Or equally have something that guarantees that rootfs is empty and
> read-only at the time the normal root filesystem is mounted.  That is
> certainly a much more localized change if we want to go there.
> 
> I am half tempted to suggest that mount --move /some/path / be updated
> to make the old / just go away (perhaps to be replaced with a read-only
> empty rootfs).  That gets us into figuring out if we break userspace
> which is a big challenge.
> 
> Eric
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] [RFC] mnt: add ability to clone mntns starting with the current root

2014-10-09 Thread Andrew Vagin
On Wed, Oct 08, 2014 at 12:23:52PM -0700, Eric W. Biederman wrote:
 Andy Lutomirski l...@amacapital.net writes:
 
  On Wed, Oct 8, 2014 at 4:08 AM, Andrew Vagin ava...@parallels.com wrote:
  On Tue, Oct 07, 2014 at 01:45:22PM -0700, Eric W. Biederman wrote:
  Andrey Vagin ava...@openvz.org writes:
 
   From: Andrey Vagin ava...@gmail.com
  
   Currently when we create a new container with a separate root,
   we need to clone the current mount namespace with all mounts and then
   clean up it by using pivot_root(). A big part of mountpoints are cloned
   only to be umounted.
 
  Is the motivation performance?  Because if that is the motivation we
  need numbers.
 
  The major motivation to create a clean mount namespace which contains
  only required mounts.
 
  Now you want to convince us that there is nothing wrong if we use
  userns, because all inherited mounts are locked. My point is that all
  useless mounts should be umounted.  If the current root isn't on rootfs,
  pivot_root() allows us to umount all useless points. But pivot_root()
  doesn't work, if the current root is on rootfs. How can we umount
  useless points in this case?
 
 One of your justifications for a new system call was so you could do
 less.  Doing less to get to where you want to go is only justified when
 your doing less to get better performance.
 
  Maybe we want to say that rootfs should not be used if we are going to
  create containers...
 
 Today it is an assumption of the vfs that rootfs is mounted.  With
 rootfs mounted and pivot_root at the base of the mount stack you can
 make as minimal of a set of mounts as the vfs allows.

You have misunderstood me.
For most system /proc/self/mountinfo looks like this:
[root@dhcp-10-30-23-214 ~]# cat /proc/self/mountinfo 
17 22 0:3 / /proc rw,relatime - proc proc rw
18 22 0:0 / /sys rw,relatime - sysfs sysfs rw
19 22 0:5 / /dev rw,relatime - devtmpfs devtmpfs 
rw,size=502324k,nr_inodes=125581,mode=755
20 19 0:11 / /dev/pts rw,relatime - devpts devpts rw,gid=5,mode=620,ptmxmode=000
21 19 0:17 / /dev/shm rw,nosuid,nodev,noexec,relatime - tmpfs tmpfs rw
22 1 253:2 / / rw,relatime - ext4 /dev/vda2 rw,barrier=1,data=ordered
24 22 253:1 / /boot rw,relatime - ext3 /dev/vda1 
rw,errors=continue,user_xattr,acl,barrier=1,data=ordered

/ isn't a rootfs mount here and pivot_root() works fine in this case. Here is
no problem for such system.

Now look at the second case:
hell@android:/ $ cat /proc/self/mountinfo
1 1 0:1 / / ro,relatime - rootfs rootfs ro
11 1 0:11 / /dev rw,nosuid,relatime - tmpfs tmpfs rw,mode=755
12 11 0:9 / /dev/pts rw,relatime - devpts devpts rw,mode=600
13 1 0:3 / /proc rw,relatime - proc proc rw
14 1 0:12 / /sys rw,relatime - sysfs sysfs rw

Now / is an rootfs mount. pivot_root() doesn't work in this case and we
need to do some tricks to get a minimal set of mounts.

Thanks,
Andrew

 
 Removing rootfs from the vfs requires an audit of everything that
 manipulates mounts.  It is not remotely a local excercise.
 
 One of the things that needs to be considered is that if you really want
 to audit mounts is the code that needs manipulates them needs to be
 audited every bit as much as the mounts themselves.
 
  Could we have an extra rootfs-like fs that is always completely empty,
  doesn't allow any writes, and can sit at the bottom of container
  namespace hierarchies?  If so, and if we add a new syscall that's like
  pivot_root (or unshare) but prunes the hierarchy, then we could switch
  to that rootfs then.
 
 Or equally have something that guarantees that rootfs is empty and
 read-only at the time the normal root filesystem is mounted.  That is
 certainly a much more localized change if we want to go there.
 
 I am half tempted to suggest that mount --move /some/path / be updated
 to make the old / just go away (perhaps to be replaced with a read-only
 empty rootfs).  That gets us into figuring out if we break userspace
 which is a big challenge.
 
 Eric
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] [RFC] mnt: add ability to clone mntns starting with the current root

2014-10-08 Thread Andy Lutomirski
On Wed, Oct 8, 2014 at 4:38 PM, Serge Hallyn  wrote:
> Quoting Andy Lutomirski (l...@amacapital.net):
>> On Wed, Oct 8, 2014 at 2:36 PM, Rob Landley  wrote:
>> > On 10/08/14 14:31, Andy Lutomirski wrote:
>> >> On Wed, Oct 8, 2014 at 12:23 PM, Eric W. Biederman
>> >>  wrote:
>> >>> Andy Lutomirski  writes:
>> > Maybe we want to say that rootfs should not be used if we are going to
>> > create containers...
>> >>>
>> >>> Today it is an assumption of the vfs that rootfs is mounted.  With
>> >>> rootfs mounted and pivot_root at the base of the mount stack you can
>> >>> make as minimal of a set of mounts as the vfs allows.
>> >>>
>> >>> Removing rootfs from the vfs requires an audit of everything that
>> >>> manipulates mounts.  It is not remotely a local excercise.
>> >>
>> >> Would it be a less invasive audit to allow different mount namespaces
>> >> to have different rootfses?
>> >
>> > I.E. The same way different namespaces have different init tasks?
>> >
>> > The abstraction containers has implemented here should be logically
>> > consistent.
>> >
>>  Could we have an extra rootfs-like fs that is always completely empty,
>>  doesn't allow any writes, and can sit at the bottom of container
>>  namespace hierarchies?  If so, and if we add a new syscall that's like
>>  pivot_root (or unshare) but prunes the hierarchy, then we could switch
>>  to that rootfs then.
>> >>>
>> >>> Or equally have something that guarantees that rootfs is empty and
>> >>> read-only at the time the normal root filesystem is mounted.  That is
>> >>> certainly a much more localized change if we want to go there.
>> >>>
>> >>> I am half tempted to suggest that mount --move /some/path / be updated
>> >>> to make the old / just go away (perhaps to be replaced with a read-only
>> >>> empty rootfs).  That gets us into figuring out if we break userspace
>> >>> which is a big challenge.
>> >>
>> >> Hence my argument for a new syscall or entirely new operation.
>> >
>> > I'm still waiting for somebody to explain to my why chroot() shouldn't
>> > be changed to do this instead of adding a new syscall. (At least when
>> > mount namespace support is enabled.)
>>
>> Because chroot has no effect on the namespace at all.  If you fork and
>> the child chroots, the parent isn't chrooted.  And, more importantly
>> for my example, is a process has it's cwd as /foo, and then it forks
>> and the child chroots, then parent's ".." isn't changed as a result of
>> the chroot.
>>
>> >
>> >> mount(2) and friends are way too multiplexed right now.  I just found
>> >> yet another security bug due to the insanely complicated semantics of
>> >> the vfs syscalls.  (Yes, a different one from the one yesterday.)
>> >
>> > As the guy who rewrote busybox mount 3 times, and who just implemented a
>> > brand new one (toybox) from scratch:
>> >
>> > It's a bit fiddly, yes.
>> >
>> >> A new operation kills several birds with one stone.  It could look like:
>> >>
>> >> int mntns_change_root(int dfd, const char *path, int flags);
>> >>
>> >> return -EPERM if chrooted.
>> >
>> > Really?
>>
>> Now that CVE-2014-7970 is public: what the heck is pivot_root supposed
>> to do if the caller is chrooted?  The current behavior is obviously
>> incorrect (it leaks memory), but it's not entirely clear to me what
>> should happen.  I think it should either be disallowed or should have
>> well-defined semantics.
>>
>> For simplicity, if a new syscall for this is added, then I think that
>> the caller-is-chrooted case should be disallowed.  If someone needs it
>> and can articulate what the semantics should be, then I have no
>> problem with allowing it going forward.
>
> It's not that I'd have a need for that, but rather if for some
> reason I started out chrooted due to some bogus initramfs, I'd
> prefer to not have to feel like a criminial and escape the chroot
> first.

You already can't create a userns if you're chrooted (even if you have
global privilege).

--Andy
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] [RFC] mnt: add ability to clone mntns starting with the current root

2014-10-08 Thread Serge Hallyn
Quoting Andy Lutomirski (l...@amacapital.net):
> On Wed, Oct 8, 2014 at 2:36 PM, Rob Landley  wrote:
> > On 10/08/14 14:31, Andy Lutomirski wrote:
> >> On Wed, Oct 8, 2014 at 12:23 PM, Eric W. Biederman
> >>  wrote:
> >>> Andy Lutomirski  writes:
> > Maybe we want to say that rootfs should not be used if we are going to
> > create containers...
> >>>
> >>> Today it is an assumption of the vfs that rootfs is mounted.  With
> >>> rootfs mounted and pivot_root at the base of the mount stack you can
> >>> make as minimal of a set of mounts as the vfs allows.
> >>>
> >>> Removing rootfs from the vfs requires an audit of everything that
> >>> manipulates mounts.  It is not remotely a local excercise.
> >>
> >> Would it be a less invasive audit to allow different mount namespaces
> >> to have different rootfses?
> >
> > I.E. The same way different namespaces have different init tasks?
> >
> > The abstraction containers has implemented here should be logically
> > consistent.
> >
>  Could we have an extra rootfs-like fs that is always completely empty,
>  doesn't allow any writes, and can sit at the bottom of container
>  namespace hierarchies?  If so, and if we add a new syscall that's like
>  pivot_root (or unshare) but prunes the hierarchy, then we could switch
>  to that rootfs then.
> >>>
> >>> Or equally have something that guarantees that rootfs is empty and
> >>> read-only at the time the normal root filesystem is mounted.  That is
> >>> certainly a much more localized change if we want to go there.
> >>>
> >>> I am half tempted to suggest that mount --move /some/path / be updated
> >>> to make the old / just go away (perhaps to be replaced with a read-only
> >>> empty rootfs).  That gets us into figuring out if we break userspace
> >>> which is a big challenge.
> >>
> >> Hence my argument for a new syscall or entirely new operation.
> >
> > I'm still waiting for somebody to explain to my why chroot() shouldn't
> > be changed to do this instead of adding a new syscall. (At least when
> > mount namespace support is enabled.)
> 
> Because chroot has no effect on the namespace at all.  If you fork and
> the child chroots, the parent isn't chrooted.  And, more importantly
> for my example, is a process has it's cwd as /foo, and then it forks
> and the child chroots, then parent's ".." isn't changed as a result of
> the chroot.
> 
> >
> >> mount(2) and friends are way too multiplexed right now.  I just found
> >> yet another security bug due to the insanely complicated semantics of
> >> the vfs syscalls.  (Yes, a different one from the one yesterday.)
> >
> > As the guy who rewrote busybox mount 3 times, and who just implemented a
> > brand new one (toybox) from scratch:
> >
> > It's a bit fiddly, yes.
> >
> >> A new operation kills several birds with one stone.  It could look like:
> >>
> >> int mntns_change_root(int dfd, const char *path, int flags);
> >>
> >> return -EPERM if chrooted.
> >
> > Really?
> 
> Now that CVE-2014-7970 is public: what the heck is pivot_root supposed
> to do if the caller is chrooted?  The current behavior is obviously
> incorrect (it leaks memory), but it's not entirely clear to me what
> should happen.  I think it should either be disallowed or should have
> well-defined semantics.
> 
> For simplicity, if a new syscall for this is added, then I think that
> the caller-is-chrooted case should be disallowed.  If someone needs it
> and can articulate what the semantics should be, then I have no
> problem with allowing it going forward.

It's not that I'd have a need for that, but rather if for some
reason I started out chrooted due to some bogus initramfs, I'd
prefer to not have to feel like a criminial and escape the chroot
first.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] [RFC] mnt: add ability to clone mntns starting with the current root

2014-10-08 Thread Rob Landley
On 10/08/14 14:31, Andy Lutomirski wrote:
> On Wed, Oct 8, 2014 at 12:23 PM, Eric W. Biederman
>  wrote:
>> Andy Lutomirski  writes:
 Maybe we want to say that rootfs should not be used if we are going to
 create containers...
>>
>> Today it is an assumption of the vfs that rootfs is mounted.  With
>> rootfs mounted and pivot_root at the base of the mount stack you can
>> make as minimal of a set of mounts as the vfs allows.
>>
>> Removing rootfs from the vfs requires an audit of everything that
>> manipulates mounts.  It is not remotely a local excercise.
> 
> Would it be a less invasive audit to allow different mount namespaces
> to have different rootfses?

I.E. The same way different namespaces have different init tasks?

The abstraction containers has implemented here should be logically
consistent.

>>> Could we have an extra rootfs-like fs that is always completely empty,
>>> doesn't allow any writes, and can sit at the bottom of container
>>> namespace hierarchies?  If so, and if we add a new syscall that's like
>>> pivot_root (or unshare) but prunes the hierarchy, then we could switch
>>> to that rootfs then.
>>
>> Or equally have something that guarantees that rootfs is empty and
>> read-only at the time the normal root filesystem is mounted.  That is
>> certainly a much more localized change if we want to go there.
>>
>> I am half tempted to suggest that mount --move /some/path / be updated
>> to make the old / just go away (perhaps to be replaced with a read-only
>> empty rootfs).  That gets us into figuring out if we break userspace
>> which is a big challenge.
> 
> Hence my argument for a new syscall or entirely new operation.

I'm still waiting for somebody to explain to my why chroot() shouldn't
be changed to do this instead of adding a new syscall. (At least when
mount namespace support is enabled.)

> mount(2) and friends are way too multiplexed right now.  I just found
> yet another security bug due to the insanely complicated semantics of
> the vfs syscalls.  (Yes, a different one from the one yesterday.)

As the guy who rewrote busybox mount 3 times, and who just implemented a
brand new one (toybox) from scratch:

It's a bit fiddly, yes.

> A new operation kills several birds with one stone.  It could look like:
> 
> int mntns_change_root(int dfd, const char *path, int flags);
> 
> return -EPERM if chrooted.

Really?

>  Returns -EINVAL if path (relative to dfd) isn't a mountmount.

Requiring that chroot() only be called on mountpoints would break
existing semantics, which gets us back to new systemcall instead of
changing behavior of existing one.

If I recall, the first line of pushback against merging the openvz code
as is was "buckets of new syscalls". Pushback against adding a new
system call is understandable. Why can't we fix chroot() now that we
have the tools to do so?

>  Otherwise it disconnects path from the existing
> hierarchy, attaches a permanently-empty read-only rootfs under it,
> makes it the root of the mntns, and does the root refs fixup.  The old
> hierarchy gets thrown out.

We have a chroot() syscall. We don't use it for containers because it
doesn't do what we want. Does it currently do what _anybody_ wants?

> Systemd could use this, too.

While that's a strong argument against it, I'm willing to overlook it.

Rob
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] [RFC] mnt: add ability to clone mntns starting with the current root

2014-10-08 Thread Andy Lutomirski
On Wed, Oct 8, 2014 at 2:36 PM, Rob Landley  wrote:
> On 10/08/14 14:31, Andy Lutomirski wrote:
>> On Wed, Oct 8, 2014 at 12:23 PM, Eric W. Biederman
>>  wrote:
>>> Andy Lutomirski  writes:
> Maybe we want to say that rootfs should not be used if we are going to
> create containers...
>>>
>>> Today it is an assumption of the vfs that rootfs is mounted.  With
>>> rootfs mounted and pivot_root at the base of the mount stack you can
>>> make as minimal of a set of mounts as the vfs allows.
>>>
>>> Removing rootfs from the vfs requires an audit of everything that
>>> manipulates mounts.  It is not remotely a local excercise.
>>
>> Would it be a less invasive audit to allow different mount namespaces
>> to have different rootfses?
>
> I.E. The same way different namespaces have different init tasks?
>
> The abstraction containers has implemented here should be logically
> consistent.
>
 Could we have an extra rootfs-like fs that is always completely empty,
 doesn't allow any writes, and can sit at the bottom of container
 namespace hierarchies?  If so, and if we add a new syscall that's like
 pivot_root (or unshare) but prunes the hierarchy, then we could switch
 to that rootfs then.
>>>
>>> Or equally have something that guarantees that rootfs is empty and
>>> read-only at the time the normal root filesystem is mounted.  That is
>>> certainly a much more localized change if we want to go there.
>>>
>>> I am half tempted to suggest that mount --move /some/path / be updated
>>> to make the old / just go away (perhaps to be replaced with a read-only
>>> empty rootfs).  That gets us into figuring out if we break userspace
>>> which is a big challenge.
>>
>> Hence my argument for a new syscall or entirely new operation.
>
> I'm still waiting for somebody to explain to my why chroot() shouldn't
> be changed to do this instead of adding a new syscall. (At least when
> mount namespace support is enabled.)

Because chroot has no effect on the namespace at all.  If you fork and
the child chroots, the parent isn't chrooted.  And, more importantly
for my example, is a process has it's cwd as /foo, and then it forks
and the child chroots, then parent's ".." isn't changed as a result of
the chroot.

>
>> mount(2) and friends are way too multiplexed right now.  I just found
>> yet another security bug due to the insanely complicated semantics of
>> the vfs syscalls.  (Yes, a different one from the one yesterday.)
>
> As the guy who rewrote busybox mount 3 times, and who just implemented a
> brand new one (toybox) from scratch:
>
> It's a bit fiddly, yes.
>
>> A new operation kills several birds with one stone.  It could look like:
>>
>> int mntns_change_root(int dfd, const char *path, int flags);
>>
>> return -EPERM if chrooted.
>
> Really?

Now that CVE-2014-7970 is public: what the heck is pivot_root supposed
to do if the caller is chrooted?  The current behavior is obviously
incorrect (it leaks memory), but it's not entirely clear to me what
should happen.  I think it should either be disallowed or should have
well-defined semantics.

For simplicity, if a new syscall for this is added, then I think that
the caller-is-chrooted case should be disallowed.  If someone needs it
and can articulate what the semantics should be, then I have no
problem with allowing it going forward.

>
>>  Returns -EINVAL if path (relative to dfd) isn't a mountmount.
>
> Requiring that chroot() only be called on mountpoints would break
> existing semantics, which gets us back to new systemcall instead of
> changing behavior of existing one.

But we're talking about a pivot_root replacement.  You can always
create a bindmount.  Alternatively, the syscall could create a fresh
bind-mount and reattach all children to it if needed.

>
> If I recall, the first line of pushback against merging the openvz code
> as is was "buckets of new syscalls". Pushback against adding a new
> system call is understandable. Why can't we fix chroot() now that we
> have the tools to do so?

Because chroot already does something else.  In particularly, the new
"root"'s parents are *still there*, which is why it's so easy to
escape from a chroot.  Sensible container systems (and initramfs
code!) wants to clean up all the mounts that should be unreachable.

>
>>  Otherwise it disconnects path from the existing
>> hierarchy, attaches a permanently-empty read-only rootfs under it,
>> makes it the root of the mntns, and does the root refs fixup.  The old
>> hierarchy gets thrown out.
>
> We have a chroot() syscall. We don't use it for containers because it
> doesn't do what we want. Does it currently do what _anybody_ wants?

Irrelevant.  It's POSIX and we'll break all kinds of things if we change it.

>
>> Systemd could use this, too.
>
> While that's a strong argument against it, I'm willing to overlook it.

:)

Feel free to read that as "an initramfs /init that prepares to hand
off to /sbin/init could use this."  busybox's switch_root could do

Re: [PATCH] [RFC] mnt: add ability to clone mntns starting with the current root

2014-10-08 Thread Rob Landley
On 10/08/14 14:23, Eric W. Biederman wrote:
>> Could we have an extra rootfs-like fs that is always completely empty,
>> doesn't allow any writes, and can sit at the bottom of container
>> namespace hierarchies?  If so, and if we add a new syscall that's like
>> pivot_root (or unshare) but prunes the hierarchy, then we could switch
>> to that rootfs then.
> 
> Or equally have something that guarantees that rootfs is empty and
> read-only at the time the normal root filesystem is mounted.  That is
> certainly a much more localized change if we want to go there.

What do you mean "normal" root filesystem? It is entirely possible (and
in fact common in the embedded world) to run from rootfs. I pushed my
old inittmpfs patches (at the request of cray) last year because being
able to take down the system with "cat /dev/zero > /blah" (as rootfs
allows and tmpfs doesn't) is a bad thing.

Rootfs is about as special as PID 1 is. We don't filter out PID 1 from
"ps" to avoid confusing people, but for some reason whoever did
/proc/$PID/mountinfo decided that rootfs shouldn't show up because magic
magic specialness.

We show /run, which is a tmpfs instance. If I mount two different
filesystems on top of each other on /mnt, it shows both. (Overmounts
were not invented by rootfs.) But no, mountinfo filters out rootfs
because magic magic specialness.

It makes me sad that this kind of special-case thinking is allowed in
the kernel.

> I am half tempted to suggest that mount --move /some/path / be updated
> to make the old / just go away (perhaps to be replaced with a read-only
> empty rootfs).  That gets us into figuring out if we break userspace
> which is a big challenge.

My concern was that chroot() moving a magic "/" pointer that you can
trivially escape from with x=open("."); chroot("sub"); fdchdir(".");
chdir("../../../../../../../../.."); is having extra code in the kernel
to do it _wrong_.

We have per-process namespaces now. We can actually adjust the mount
tree (inserting a new bind mount if the directory we're changing to is
not already a mount point). If a per-process namespace needs to be
anchored by a tmpfs, fine. But requiring that to be teh SAME instance
globally for the entire system is not what containers is _about_. It's
not true for PID 1 and it shouldn't be true for rootfs.

By all means, if a filesystem is no longer accessable in a namespace,
decrement its reference count. (Keeping in mind that a bind mount should
count as a reference, and rootfs should always have a nonzero reference
count.) But "/" is not special in this regard. If you want to make all
overmounts vanish (which seems like a really bad idea and breaks 40
years of unix semantics), argue for that. Please stop treating rootfs
like it isn't potentialy usable as a full-fledged filesystem.

(Pet peeve of mine.)

> Eric

Rob
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] [RFC] mnt: add ability to clone mntns starting with the current root

2014-10-08 Thread Andy Lutomirski
On Wed, Oct 8, 2014 at 12:23 PM, Eric W. Biederman
 wrote:
> Andy Lutomirski  writes:
>
>> On Wed, Oct 8, 2014 at 4:08 AM, Andrew Vagin  wrote:
>>> On Tue, Oct 07, 2014 at 01:45:22PM -0700, Eric W. Biederman wrote:
 Andrey Vagin  writes:

 > From: Andrey Vagin 
 >
 > Currently when we create a new container with a separate root,
 > we need to clone the current mount namespace with all mounts and then
 > clean up it by using pivot_root(). A big part of mountpoints are cloned
 > only to be umounted.

 Is the motivation performance?  Because if that is the motivation we
 need numbers.
>>>
>>> The major motivation to create a clean mount namespace which contains
>>> only required mounts.
>>>
>>> Now you want to convince us that there is nothing wrong if we use
>>> userns, because all inherited mounts are locked. My point is that all
>>> useless mounts should be umounted.  If the current root isn't on rootfs,
>>> pivot_root() allows us to umount all useless points. But pivot_root()
>>> doesn't work, if the current root is on rootfs. How can we umount
>>> useless points in this case?
>
> One of your justifications for a new system call was so you could do
> less.  Doing less to get to where you want to go is only justified when
> your doing less to get better performance.
>
> It sounds like your actual concern is about sandboxing and security
> audits.  That is a very legitimate concern.  That isn't however the core
> concern of containers, so it was not clear that is what you meant.
>
>>> Maybe we want to say that rootfs should not be used if we are going to
>>> create containers...
>
> Today it is an assumption of the vfs that rootfs is mounted.  With
> rootfs mounted and pivot_root at the base of the mount stack you can
> make as minimal of a set of mounts as the vfs allows.
>
> Removing rootfs from the vfs requires an audit of everything that
> manipulates mounts.  It is not remotely a local excercise.

Would it be a less invasive audit to allow different mount namespaces
to have different rootfses?

>
> One of the things that needs to be considered is that if you really want
> to audit mounts is the code that needs manipulates them needs to be
> audited every bit as much as the mounts themselves.
>
>> Could we have an extra rootfs-like fs that is always completely empty,
>> doesn't allow any writes, and can sit at the bottom of container
>> namespace hierarchies?  If so, and if we add a new syscall that's like
>> pivot_root (or unshare) but prunes the hierarchy, then we could switch
>> to that rootfs then.
>
> Or equally have something that guarantees that rootfs is empty and
> read-only at the time the normal root filesystem is mounted.  That is
> certainly a much more localized change if we want to go there.
>
> I am half tempted to suggest that mount --move /some/path / be updated
> to make the old / just go away (perhaps to be replaced with a read-only
> empty rootfs).  That gets us into figuring out if we break userspace
> which is a big challenge.

Hence my argument for a new syscall or entirely new operation.
mount(2) and friends are way too multiplexed right now.  I just found
yet another security bug due to the insanely complicated semantics of
the vfs syscalls.  (Yes, a different one from the one yesterday.)

A new operation kills several birds with one stone.  It could look like:

int mntns_change_root(int dfd, const char *path, int flags);

return -EPERM if chrooted.  Returns -EINVAL if path (relative to dfd)
isn't a mountmount.  Otherwise it disconnects path from the existing
hierarchy, attaches a permanently-empty read-only rootfs under it,
makes it the root of the mntns, and does the root refs fixup.  The old
hierarchy gets thrown out.

Benefits:
 - Plausibly slightly faster.  (Especially if we add the trivial
optimization that, if the caller holds the only reference to the
mntns, then changing root refs can avoid the big loop, but maybe
pivot_root could do that, too.)
 - Sidesteps the whole rootfs mess.
 - Much easier to use than pivot_root.
 - No userspace breakage.

Heck, it could be even simpler: it could just unconditionally skip the
fs struct walk.  Just require callers to make sure that everyone else
chroots into the new root.

Systemd could use this, too.  If it wants to keep a reference to the
initramfs, it can use a file descriptor.

--Andy
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] [RFC] mnt: add ability to clone mntns starting with the current root

2014-10-08 Thread Eric W. Biederman
Andy Lutomirski  writes:

> On Wed, Oct 8, 2014 at 4:08 AM, Andrew Vagin  wrote:
>> On Tue, Oct 07, 2014 at 01:45:22PM -0700, Eric W. Biederman wrote:
>>> Andrey Vagin  writes:
>>>
>>> > From: Andrey Vagin 
>>> >
>>> > Currently when we create a new container with a separate root,
>>> > we need to clone the current mount namespace with all mounts and then
>>> > clean up it by using pivot_root(). A big part of mountpoints are cloned
>>> > only to be umounted.
>>>
>>> Is the motivation performance?  Because if that is the motivation we
>>> need numbers.
>>
>> The major motivation to create a clean mount namespace which contains
>> only required mounts.
>>
>> Now you want to convince us that there is nothing wrong if we use
>> userns, because all inherited mounts are locked. My point is that all
>> useless mounts should be umounted.  If the current root isn't on rootfs,
>> pivot_root() allows us to umount all useless points. But pivot_root()
>> doesn't work, if the current root is on rootfs. How can we umount
>> useless points in this case?

One of your justifications for a new system call was so you could do
less.  Doing less to get to where you want to go is only justified when
your doing less to get better performance.

It sounds like your actual concern is about sandboxing and security
audits.  That is a very legitimate concern.  That isn't however the core
concern of containers, so it was not clear that is what you meant.

>> Maybe we want to say that rootfs should not be used if we are going to
>> create containers...

Today it is an assumption of the vfs that rootfs is mounted.  With
rootfs mounted and pivot_root at the base of the mount stack you can
make as minimal of a set of mounts as the vfs allows.

Removing rootfs from the vfs requires an audit of everything that
manipulates mounts.  It is not remotely a local excercise.

One of the things that needs to be considered is that if you really want
to audit mounts is the code that needs manipulates them needs to be
audited every bit as much as the mounts themselves.

> Could we have an extra rootfs-like fs that is always completely empty,
> doesn't allow any writes, and can sit at the bottom of container
> namespace hierarchies?  If so, and if we add a new syscall that's like
> pivot_root (or unshare) but prunes the hierarchy, then we could switch
> to that rootfs then.

Or equally have something that guarantees that rootfs is empty and
read-only at the time the normal root filesystem is mounted.  That is
certainly a much more localized change if we want to go there.

I am half tempted to suggest that mount --move /some/path / be updated
to make the old / just go away (perhaps to be replaced with a read-only
empty rootfs).  That gets us into figuring out if we break userspace
which is a big challenge.

Eric
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] [RFC] mnt: add ability to clone mntns starting with the current root

2014-10-08 Thread Andy Lutomirski
On Wed, Oct 8, 2014 at 4:08 AM, Andrew Vagin  wrote:
> On Tue, Oct 07, 2014 at 01:45:22PM -0700, Eric W. Biederman wrote:
>> Andrey Vagin  writes:
>>
>> > From: Andrey Vagin 
>> >
>> > Currently when we create a new container with a separate root,
>> > we need to clone the current mount namespace with all mounts and then
>> > clean up it by using pivot_root(). A big part of mountpoints are cloned
>> > only to be umounted.
>>
>> Is the motivation performance?  Because if that is the motivation we
>> need numbers.
>
> The major motivation to create a clean mount namespace which contains
> only required mounts.
>
> Now you want to convince us that there is nothing wrong if we use
> userns, because all inherited mounts are locked. My point is that all
> useless mounts should be umounted.  If the current root isn't on rootfs,
> pivot_root() allows us to umount all useless points. But pivot_root()
> doesn't work, if the current root is on rootfs. How can we umount
> useless points in this case?
>
> Maybe we want to say that rootfs should not be used if we are going to
> create containers...
>

Could we have an extra rootfs-like fs that is always completely empty,
doesn't allow any writes, and can sit at the bottom of container
namespace hierarchies?  If so, and if we add a new syscall that's like
pivot_root (or unshare) but prunes the hierarchy, then we could switch
to that rootfs then.

> Thanks,
> Andrew
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-api" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html



-- 
Andy Lutomirski
AMA Capital Management, LLC
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] [RFC] mnt: add ability to clone mntns starting with the current root

2014-10-08 Thread Andrew Vagin
On Tue, Oct 07, 2014 at 01:45:22PM -0700, Eric W. Biederman wrote:
> Andrey Vagin  writes:
> 
> > From: Andrey Vagin 
> >
> > Currently when we create a new container with a separate root,
> > we need to clone the current mount namespace with all mounts and then
> > clean up it by using pivot_root(). A big part of mountpoints are cloned
> > only to be umounted.
> 
> Is the motivation performance?  Because if that is the motivation we
> need numbers.

The major motivation to create a clean mount namespace which contains
only required mounts.

Now you want to convince us that there is nothing wrong if we use
userns, because all inherited mounts are locked. My point is that all
useless mounts should be umounted.  If the current root isn't on rootfs,
pivot_root() allows us to umount all useless points. But pivot_root()
doesn't work, if the current root is on rootfs. How can we umount
useless points in this case?

Maybe we want to say that rootfs should not be used if we are going to
create containers...

Thanks,
Andrew

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] [RFC] mnt: add ability to clone mntns starting with the current root

2014-10-08 Thread Andrew Vagin
On Tue, Oct 07, 2014 at 01:45:22PM -0700, Eric W. Biederman wrote:
 Andrey Vagin ava...@openvz.org writes:
 
  From: Andrey Vagin ava...@gmail.com
 
  Currently when we create a new container with a separate root,
  we need to clone the current mount namespace with all mounts and then
  clean up it by using pivot_root(). A big part of mountpoints are cloned
  only to be umounted.
 
 Is the motivation performance?  Because if that is the motivation we
 need numbers.

The major motivation to create a clean mount namespace which contains
only required mounts.

Now you want to convince us that there is nothing wrong if we use
userns, because all inherited mounts are locked. My point is that all
useless mounts should be umounted.  If the current root isn't on rootfs,
pivot_root() allows us to umount all useless points. But pivot_root()
doesn't work, if the current root is on rootfs. How can we umount
useless points in this case?

Maybe we want to say that rootfs should not be used if we are going to
create containers...

Thanks,
Andrew

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] [RFC] mnt: add ability to clone mntns starting with the current root

2014-10-08 Thread Andy Lutomirski
On Wed, Oct 8, 2014 at 4:08 AM, Andrew Vagin ava...@parallels.com wrote:
 On Tue, Oct 07, 2014 at 01:45:22PM -0700, Eric W. Biederman wrote:
 Andrey Vagin ava...@openvz.org writes:

  From: Andrey Vagin ava...@gmail.com
 
  Currently when we create a new container with a separate root,
  we need to clone the current mount namespace with all mounts and then
  clean up it by using pivot_root(). A big part of mountpoints are cloned
  only to be umounted.

 Is the motivation performance?  Because if that is the motivation we
 need numbers.

 The major motivation to create a clean mount namespace which contains
 only required mounts.

 Now you want to convince us that there is nothing wrong if we use
 userns, because all inherited mounts are locked. My point is that all
 useless mounts should be umounted.  If the current root isn't on rootfs,
 pivot_root() allows us to umount all useless points. But pivot_root()
 doesn't work, if the current root is on rootfs. How can we umount
 useless points in this case?

 Maybe we want to say that rootfs should not be used if we are going to
 create containers...


Could we have an extra rootfs-like fs that is always completely empty,
doesn't allow any writes, and can sit at the bottom of container
namespace hierarchies?  If so, and if we add a new syscall that's like
pivot_root (or unshare) but prunes the hierarchy, then we could switch
to that rootfs then.

 Thanks,
 Andrew

 --
 To unsubscribe from this list: send the line unsubscribe linux-api in
 the body of a message to majord...@vger.kernel.org
 More majordomo info at  http://vger.kernel.org/majordomo-info.html



-- 
Andy Lutomirski
AMA Capital Management, LLC
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] [RFC] mnt: add ability to clone mntns starting with the current root

2014-10-08 Thread Eric W. Biederman
Andy Lutomirski l...@amacapital.net writes:

 On Wed, Oct 8, 2014 at 4:08 AM, Andrew Vagin ava...@parallels.com wrote:
 On Tue, Oct 07, 2014 at 01:45:22PM -0700, Eric W. Biederman wrote:
 Andrey Vagin ava...@openvz.org writes:

  From: Andrey Vagin ava...@gmail.com
 
  Currently when we create a new container with a separate root,
  we need to clone the current mount namespace with all mounts and then
  clean up it by using pivot_root(). A big part of mountpoints are cloned
  only to be umounted.

 Is the motivation performance?  Because if that is the motivation we
 need numbers.

 The major motivation to create a clean mount namespace which contains
 only required mounts.

 Now you want to convince us that there is nothing wrong if we use
 userns, because all inherited mounts are locked. My point is that all
 useless mounts should be umounted.  If the current root isn't on rootfs,
 pivot_root() allows us to umount all useless points. But pivot_root()
 doesn't work, if the current root is on rootfs. How can we umount
 useless points in this case?

One of your justifications for a new system call was so you could do
less.  Doing less to get to where you want to go is only justified when
your doing less to get better performance.

It sounds like your actual concern is about sandboxing and security
audits.  That is a very legitimate concern.  That isn't however the core
concern of containers, so it was not clear that is what you meant.

 Maybe we want to say that rootfs should not be used if we are going to
 create containers...

Today it is an assumption of the vfs that rootfs is mounted.  With
rootfs mounted and pivot_root at the base of the mount stack you can
make as minimal of a set of mounts as the vfs allows.

Removing rootfs from the vfs requires an audit of everything that
manipulates mounts.  It is not remotely a local excercise.

One of the things that needs to be considered is that if you really want
to audit mounts is the code that needs manipulates them needs to be
audited every bit as much as the mounts themselves.

 Could we have an extra rootfs-like fs that is always completely empty,
 doesn't allow any writes, and can sit at the bottom of container
 namespace hierarchies?  If so, and if we add a new syscall that's like
 pivot_root (or unshare) but prunes the hierarchy, then we could switch
 to that rootfs then.

Or equally have something that guarantees that rootfs is empty and
read-only at the time the normal root filesystem is mounted.  That is
certainly a much more localized change if we want to go there.

I am half tempted to suggest that mount --move /some/path / be updated
to make the old / just go away (perhaps to be replaced with a read-only
empty rootfs).  That gets us into figuring out if we break userspace
which is a big challenge.

Eric
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] [RFC] mnt: add ability to clone mntns starting with the current root

2014-10-08 Thread Andy Lutomirski
On Wed, Oct 8, 2014 at 12:23 PM, Eric W. Biederman
ebied...@xmission.com wrote:
 Andy Lutomirski l...@amacapital.net writes:

 On Wed, Oct 8, 2014 at 4:08 AM, Andrew Vagin ava...@parallels.com wrote:
 On Tue, Oct 07, 2014 at 01:45:22PM -0700, Eric W. Biederman wrote:
 Andrey Vagin ava...@openvz.org writes:

  From: Andrey Vagin ava...@gmail.com
 
  Currently when we create a new container with a separate root,
  we need to clone the current mount namespace with all mounts and then
  clean up it by using pivot_root(). A big part of mountpoints are cloned
  only to be umounted.

 Is the motivation performance?  Because if that is the motivation we
 need numbers.

 The major motivation to create a clean mount namespace which contains
 only required mounts.

 Now you want to convince us that there is nothing wrong if we use
 userns, because all inherited mounts are locked. My point is that all
 useless mounts should be umounted.  If the current root isn't on rootfs,
 pivot_root() allows us to umount all useless points. But pivot_root()
 doesn't work, if the current root is on rootfs. How can we umount
 useless points in this case?

 One of your justifications for a new system call was so you could do
 less.  Doing less to get to where you want to go is only justified when
 your doing less to get better performance.

 It sounds like your actual concern is about sandboxing and security
 audits.  That is a very legitimate concern.  That isn't however the core
 concern of containers, so it was not clear that is what you meant.

 Maybe we want to say that rootfs should not be used if we are going to
 create containers...

 Today it is an assumption of the vfs that rootfs is mounted.  With
 rootfs mounted and pivot_root at the base of the mount stack you can
 make as minimal of a set of mounts as the vfs allows.

 Removing rootfs from the vfs requires an audit of everything that
 manipulates mounts.  It is not remotely a local excercise.

Would it be a less invasive audit to allow different mount namespaces
to have different rootfses?


 One of the things that needs to be considered is that if you really want
 to audit mounts is the code that needs manipulates them needs to be
 audited every bit as much as the mounts themselves.

 Could we have an extra rootfs-like fs that is always completely empty,
 doesn't allow any writes, and can sit at the bottom of container
 namespace hierarchies?  If so, and if we add a new syscall that's like
 pivot_root (or unshare) but prunes the hierarchy, then we could switch
 to that rootfs then.

 Or equally have something that guarantees that rootfs is empty and
 read-only at the time the normal root filesystem is mounted.  That is
 certainly a much more localized change if we want to go there.

 I am half tempted to suggest that mount --move /some/path / be updated
 to make the old / just go away (perhaps to be replaced with a read-only
 empty rootfs).  That gets us into figuring out if we break userspace
 which is a big challenge.

Hence my argument for a new syscall or entirely new operation.
mount(2) and friends are way too multiplexed right now.  I just found
yet another security bug due to the insanely complicated semantics of
the vfs syscalls.  (Yes, a different one from the one yesterday.)

A new operation kills several birds with one stone.  It could look like:

int mntns_change_root(int dfd, const char *path, int flags);

return -EPERM if chrooted.  Returns -EINVAL if path (relative to dfd)
isn't a mountmount.  Otherwise it disconnects path from the existing
hierarchy, attaches a permanently-empty read-only rootfs under it,
makes it the root of the mntns, and does the root refs fixup.  The old
hierarchy gets thrown out.

Benefits:
 - Plausibly slightly faster.  (Especially if we add the trivial
optimization that, if the caller holds the only reference to the
mntns, then changing root refs can avoid the big loop, but maybe
pivot_root could do that, too.)
 - Sidesteps the whole rootfs mess.
 - Much easier to use than pivot_root.
 - No userspace breakage.

Heck, it could be even simpler: it could just unconditionally skip the
fs struct walk.  Just require callers to make sure that everyone else
chroots into the new root.

Systemd could use this, too.  If it wants to keep a reference to the
initramfs, it can use a file descriptor.

--Andy
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] [RFC] mnt: add ability to clone mntns starting with the current root

2014-10-08 Thread Rob Landley
On 10/08/14 14:23, Eric W. Biederman wrote:
 Could we have an extra rootfs-like fs that is always completely empty,
 doesn't allow any writes, and can sit at the bottom of container
 namespace hierarchies?  If so, and if we add a new syscall that's like
 pivot_root (or unshare) but prunes the hierarchy, then we could switch
 to that rootfs then.
 
 Or equally have something that guarantees that rootfs is empty and
 read-only at the time the normal root filesystem is mounted.  That is
 certainly a much more localized change if we want to go there.

What do you mean normal root filesystem? It is entirely possible (and
in fact common in the embedded world) to run from rootfs. I pushed my
old inittmpfs patches (at the request of cray) last year because being
able to take down the system with cat /dev/zero  /blah (as rootfs
allows and tmpfs doesn't) is a bad thing.

Rootfs is about as special as PID 1 is. We don't filter out PID 1 from
ps to avoid confusing people, but for some reason whoever did
/proc/$PID/mountinfo decided that rootfs shouldn't show up because magic
magic specialness.

We show /run, which is a tmpfs instance. If I mount two different
filesystems on top of each other on /mnt, it shows both. (Overmounts
were not invented by rootfs.) But no, mountinfo filters out rootfs
because magic magic specialness.

It makes me sad that this kind of special-case thinking is allowed in
the kernel.

 I am half tempted to suggest that mount --move /some/path / be updated
 to make the old / just go away (perhaps to be replaced with a read-only
 empty rootfs).  That gets us into figuring out if we break userspace
 which is a big challenge.

My concern was that chroot() moving a magic / pointer that you can
trivially escape from with x=open(.); chroot(sub); fdchdir(.);
chdir(../../../../../../../../..); is having extra code in the kernel
to do it _wrong_.

We have per-process namespaces now. We can actually adjust the mount
tree (inserting a new bind mount if the directory we're changing to is
not already a mount point). If a per-process namespace needs to be
anchored by a tmpfs, fine. But requiring that to be teh SAME instance
globally for the entire system is not what containers is _about_. It's
not true for PID 1 and it shouldn't be true for rootfs.

By all means, if a filesystem is no longer accessable in a namespace,
decrement its reference count. (Keeping in mind that a bind mount should
count as a reference, and rootfs should always have a nonzero reference
count.) But / is not special in this regard. If you want to make all
overmounts vanish (which seems like a really bad idea and breaks 40
years of unix semantics), argue for that. Please stop treating rootfs
like it isn't potentialy usable as a full-fledged filesystem.

(Pet peeve of mine.)

 Eric

Rob
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] [RFC] mnt: add ability to clone mntns starting with the current root

2014-10-08 Thread Andy Lutomirski
On Wed, Oct 8, 2014 at 2:36 PM, Rob Landley r...@landley.net wrote:
 On 10/08/14 14:31, Andy Lutomirski wrote:
 On Wed, Oct 8, 2014 at 12:23 PM, Eric W. Biederman
 ebied...@xmission.com wrote:
 Andy Lutomirski l...@amacapital.net writes:
 Maybe we want to say that rootfs should not be used if we are going to
 create containers...

 Today it is an assumption of the vfs that rootfs is mounted.  With
 rootfs mounted and pivot_root at the base of the mount stack you can
 make as minimal of a set of mounts as the vfs allows.

 Removing rootfs from the vfs requires an audit of everything that
 manipulates mounts.  It is not remotely a local excercise.

 Would it be a less invasive audit to allow different mount namespaces
 to have different rootfses?

 I.E. The same way different namespaces have different init tasks?

 The abstraction containers has implemented here should be logically
 consistent.

 Could we have an extra rootfs-like fs that is always completely empty,
 doesn't allow any writes, and can sit at the bottom of container
 namespace hierarchies?  If so, and if we add a new syscall that's like
 pivot_root (or unshare) but prunes the hierarchy, then we could switch
 to that rootfs then.

 Or equally have something that guarantees that rootfs is empty and
 read-only at the time the normal root filesystem is mounted.  That is
 certainly a much more localized change if we want to go there.

 I am half tempted to suggest that mount --move /some/path / be updated
 to make the old / just go away (perhaps to be replaced with a read-only
 empty rootfs).  That gets us into figuring out if we break userspace
 which is a big challenge.

 Hence my argument for a new syscall or entirely new operation.

 I'm still waiting for somebody to explain to my why chroot() shouldn't
 be changed to do this instead of adding a new syscall. (At least when
 mount namespace support is enabled.)

Because chroot has no effect on the namespace at all.  If you fork and
the child chroots, the parent isn't chrooted.  And, more importantly
for my example, is a process has it's cwd as /foo, and then it forks
and the child chroots, then parent's .. isn't changed as a result of
the chroot.


 mount(2) and friends are way too multiplexed right now.  I just found
 yet another security bug due to the insanely complicated semantics of
 the vfs syscalls.  (Yes, a different one from the one yesterday.)

 As the guy who rewrote busybox mount 3 times, and who just implemented a
 brand new one (toybox) from scratch:

 It's a bit fiddly, yes.

 A new operation kills several birds with one stone.  It could look like:

 int mntns_change_root(int dfd, const char *path, int flags);

 return -EPERM if chrooted.

 Really?

Now that CVE-2014-7970 is public: what the heck is pivot_root supposed
to do if the caller is chrooted?  The current behavior is obviously
incorrect (it leaks memory), but it's not entirely clear to me what
should happen.  I think it should either be disallowed or should have
well-defined semantics.

For simplicity, if a new syscall for this is added, then I think that
the caller-is-chrooted case should be disallowed.  If someone needs it
and can articulate what the semantics should be, then I have no
problem with allowing it going forward.


  Returns -EINVAL if path (relative to dfd) isn't a mountmount.

 Requiring that chroot() only be called on mountpoints would break
 existing semantics, which gets us back to new systemcall instead of
 changing behavior of existing one.

But we're talking about a pivot_root replacement.  You can always
create a bindmount.  Alternatively, the syscall could create a fresh
bind-mount and reattach all children to it if needed.


 If I recall, the first line of pushback against merging the openvz code
 as is was buckets of new syscalls. Pushback against adding a new
 system call is understandable. Why can't we fix chroot() now that we
 have the tools to do so?

Because chroot already does something else.  In particularly, the new
root's parents are *still there*, which is why it's so easy to
escape from a chroot.  Sensible container systems (and initramfs
code!) wants to clean up all the mounts that should be unreachable.


  Otherwise it disconnects path from the existing
 hierarchy, attaches a permanently-empty read-only rootfs under it,
 makes it the root of the mntns, and does the root refs fixup.  The old
 hierarchy gets thrown out.

 We have a chroot() syscall. We don't use it for containers because it
 doesn't do what we want. Does it currently do what _anybody_ wants?

Irrelevant.  It's POSIX and we'll break all kinds of things if we change it.


 Systemd could use this, too.

 While that's a strong argument against it, I'm willing to overlook it.

:)

Feel free to read that as an initramfs /init that prepares to hand
off to /sbin/init could use this.  busybox's switch_root could do
this, too, and even my virtme tool would indirectly benefit slightly.
(virtme uses an init system that is a few 

Re: [PATCH] [RFC] mnt: add ability to clone mntns starting with the current root

2014-10-08 Thread Rob Landley
On 10/08/14 14:31, Andy Lutomirski wrote:
 On Wed, Oct 8, 2014 at 12:23 PM, Eric W. Biederman
 ebied...@xmission.com wrote:
 Andy Lutomirski l...@amacapital.net writes:
 Maybe we want to say that rootfs should not be used if we are going to
 create containers...

 Today it is an assumption of the vfs that rootfs is mounted.  With
 rootfs mounted and pivot_root at the base of the mount stack you can
 make as minimal of a set of mounts as the vfs allows.

 Removing rootfs from the vfs requires an audit of everything that
 manipulates mounts.  It is not remotely a local excercise.
 
 Would it be a less invasive audit to allow different mount namespaces
 to have different rootfses?

I.E. The same way different namespaces have different init tasks?

The abstraction containers has implemented here should be logically
consistent.

 Could we have an extra rootfs-like fs that is always completely empty,
 doesn't allow any writes, and can sit at the bottom of container
 namespace hierarchies?  If so, and if we add a new syscall that's like
 pivot_root (or unshare) but prunes the hierarchy, then we could switch
 to that rootfs then.

 Or equally have something that guarantees that rootfs is empty and
 read-only at the time the normal root filesystem is mounted.  That is
 certainly a much more localized change if we want to go there.

 I am half tempted to suggest that mount --move /some/path / be updated
 to make the old / just go away (perhaps to be replaced with a read-only
 empty rootfs).  That gets us into figuring out if we break userspace
 which is a big challenge.
 
 Hence my argument for a new syscall or entirely new operation.

I'm still waiting for somebody to explain to my why chroot() shouldn't
be changed to do this instead of adding a new syscall. (At least when
mount namespace support is enabled.)

 mount(2) and friends are way too multiplexed right now.  I just found
 yet another security bug due to the insanely complicated semantics of
 the vfs syscalls.  (Yes, a different one from the one yesterday.)

As the guy who rewrote busybox mount 3 times, and who just implemented a
brand new one (toybox) from scratch:

It's a bit fiddly, yes.

 A new operation kills several birds with one stone.  It could look like:
 
 int mntns_change_root(int dfd, const char *path, int flags);
 
 return -EPERM if chrooted.

Really?

  Returns -EINVAL if path (relative to dfd) isn't a mountmount.

Requiring that chroot() only be called on mountpoints would break
existing semantics, which gets us back to new systemcall instead of
changing behavior of existing one.

If I recall, the first line of pushback against merging the openvz code
as is was buckets of new syscalls. Pushback against adding a new
system call is understandable. Why can't we fix chroot() now that we
have the tools to do so?

  Otherwise it disconnects path from the existing
 hierarchy, attaches a permanently-empty read-only rootfs under it,
 makes it the root of the mntns, and does the root refs fixup.  The old
 hierarchy gets thrown out.

We have a chroot() syscall. We don't use it for containers because it
doesn't do what we want. Does it currently do what _anybody_ wants?

 Systemd could use this, too.

While that's a strong argument against it, I'm willing to overlook it.

Rob
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] [RFC] mnt: add ability to clone mntns starting with the current root

2014-10-08 Thread Serge Hallyn
Quoting Andy Lutomirski (l...@amacapital.net):
 On Wed, Oct 8, 2014 at 2:36 PM, Rob Landley r...@landley.net wrote:
  On 10/08/14 14:31, Andy Lutomirski wrote:
  On Wed, Oct 8, 2014 at 12:23 PM, Eric W. Biederman
  ebied...@xmission.com wrote:
  Andy Lutomirski l...@amacapital.net writes:
  Maybe we want to say that rootfs should not be used if we are going to
  create containers...
 
  Today it is an assumption of the vfs that rootfs is mounted.  With
  rootfs mounted and pivot_root at the base of the mount stack you can
  make as minimal of a set of mounts as the vfs allows.
 
  Removing rootfs from the vfs requires an audit of everything that
  manipulates mounts.  It is not remotely a local excercise.
 
  Would it be a less invasive audit to allow different mount namespaces
  to have different rootfses?
 
  I.E. The same way different namespaces have different init tasks?
 
  The abstraction containers has implemented here should be logically
  consistent.
 
  Could we have an extra rootfs-like fs that is always completely empty,
  doesn't allow any writes, and can sit at the bottom of container
  namespace hierarchies?  If so, and if we add a new syscall that's like
  pivot_root (or unshare) but prunes the hierarchy, then we could switch
  to that rootfs then.
 
  Or equally have something that guarantees that rootfs is empty and
  read-only at the time the normal root filesystem is mounted.  That is
  certainly a much more localized change if we want to go there.
 
  I am half tempted to suggest that mount --move /some/path / be updated
  to make the old / just go away (perhaps to be replaced with a read-only
  empty rootfs).  That gets us into figuring out if we break userspace
  which is a big challenge.
 
  Hence my argument for a new syscall or entirely new operation.
 
  I'm still waiting for somebody to explain to my why chroot() shouldn't
  be changed to do this instead of adding a new syscall. (At least when
  mount namespace support is enabled.)
 
 Because chroot has no effect on the namespace at all.  If you fork and
 the child chroots, the parent isn't chrooted.  And, more importantly
 for my example, is a process has it's cwd as /foo, and then it forks
 and the child chroots, then parent's .. isn't changed as a result of
 the chroot.
 
 
  mount(2) and friends are way too multiplexed right now.  I just found
  yet another security bug due to the insanely complicated semantics of
  the vfs syscalls.  (Yes, a different one from the one yesterday.)
 
  As the guy who rewrote busybox mount 3 times, and who just implemented a
  brand new one (toybox) from scratch:
 
  It's a bit fiddly, yes.
 
  A new operation kills several birds with one stone.  It could look like:
 
  int mntns_change_root(int dfd, const char *path, int flags);
 
  return -EPERM if chrooted.
 
  Really?
 
 Now that CVE-2014-7970 is public: what the heck is pivot_root supposed
 to do if the caller is chrooted?  The current behavior is obviously
 incorrect (it leaks memory), but it's not entirely clear to me what
 should happen.  I think it should either be disallowed or should have
 well-defined semantics.
 
 For simplicity, if a new syscall for this is added, then I think that
 the caller-is-chrooted case should be disallowed.  If someone needs it
 and can articulate what the semantics should be, then I have no
 problem with allowing it going forward.

It's not that I'd have a need for that, but rather if for some
reason I started out chrooted due to some bogus initramfs, I'd
prefer to not have to feel like a criminial and escape the chroot
first.
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] [RFC] mnt: add ability to clone mntns starting with the current root

2014-10-08 Thread Andy Lutomirski
On Wed, Oct 8, 2014 at 4:38 PM, Serge Hallyn serge.hal...@ubuntu.com wrote:
 Quoting Andy Lutomirski (l...@amacapital.net):
 On Wed, Oct 8, 2014 at 2:36 PM, Rob Landley r...@landley.net wrote:
  On 10/08/14 14:31, Andy Lutomirski wrote:
  On Wed, Oct 8, 2014 at 12:23 PM, Eric W. Biederman
  ebied...@xmission.com wrote:
  Andy Lutomirski l...@amacapital.net writes:
  Maybe we want to say that rootfs should not be used if we are going to
  create containers...
 
  Today it is an assumption of the vfs that rootfs is mounted.  With
  rootfs mounted and pivot_root at the base of the mount stack you can
  make as minimal of a set of mounts as the vfs allows.
 
  Removing rootfs from the vfs requires an audit of everything that
  manipulates mounts.  It is not remotely a local excercise.
 
  Would it be a less invasive audit to allow different mount namespaces
  to have different rootfses?
 
  I.E. The same way different namespaces have different init tasks?
 
  The abstraction containers has implemented here should be logically
  consistent.
 
  Could we have an extra rootfs-like fs that is always completely empty,
  doesn't allow any writes, and can sit at the bottom of container
  namespace hierarchies?  If so, and if we add a new syscall that's like
  pivot_root (or unshare) but prunes the hierarchy, then we could switch
  to that rootfs then.
 
  Or equally have something that guarantees that rootfs is empty and
  read-only at the time the normal root filesystem is mounted.  That is
  certainly a much more localized change if we want to go there.
 
  I am half tempted to suggest that mount --move /some/path / be updated
  to make the old / just go away (perhaps to be replaced with a read-only
  empty rootfs).  That gets us into figuring out if we break userspace
  which is a big challenge.
 
  Hence my argument for a new syscall or entirely new operation.
 
  I'm still waiting for somebody to explain to my why chroot() shouldn't
  be changed to do this instead of adding a new syscall. (At least when
  mount namespace support is enabled.)

 Because chroot has no effect on the namespace at all.  If you fork and
 the child chroots, the parent isn't chrooted.  And, more importantly
 for my example, is a process has it's cwd as /foo, and then it forks
 and the child chroots, then parent's .. isn't changed as a result of
 the chroot.

 
  mount(2) and friends are way too multiplexed right now.  I just found
  yet another security bug due to the insanely complicated semantics of
  the vfs syscalls.  (Yes, a different one from the one yesterday.)
 
  As the guy who rewrote busybox mount 3 times, and who just implemented a
  brand new one (toybox) from scratch:
 
  It's a bit fiddly, yes.
 
  A new operation kills several birds with one stone.  It could look like:
 
  int mntns_change_root(int dfd, const char *path, int flags);
 
  return -EPERM if chrooted.
 
  Really?

 Now that CVE-2014-7970 is public: what the heck is pivot_root supposed
 to do if the caller is chrooted?  The current behavior is obviously
 incorrect (it leaks memory), but it's not entirely clear to me what
 should happen.  I think it should either be disallowed or should have
 well-defined semantics.

 For simplicity, if a new syscall for this is added, then I think that
 the caller-is-chrooted case should be disallowed.  If someone needs it
 and can articulate what the semantics should be, then I have no
 problem with allowing it going forward.

 It's not that I'd have a need for that, but rather if for some
 reason I started out chrooted due to some bogus initramfs, I'd
 prefer to not have to feel like a criminial and escape the chroot
 first.

You already can't create a userns if you're chrooted (even if you have
global privilege).

--Andy
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] [RFC] mnt: add ability to clone mntns starting with the current root

2014-10-07 Thread Andy Lutomirski
On Tue, Oct 7, 2014 at 5:20 PM, Eric W. Biederman  wrote:
> Andy Lutomirski  writes:
>
>> On Tue, Oct 7, 2014 at 4:42 PM, Eric W. Biederman  
>> wrote:
>>> Andy Lutomirski  writes:
>>>
 On Tue, Oct 7, 2014 at 3:42 PM, Eric W. Biederman  
 wrote:
> Andy Lutomirski  writes:
>
>> On Tue, Oct 7, 2014 at 2:42 PM, Eric W. Biederman 
>>  wrote:
>>>
>>> I am squinting and looking this way and that but while I can imagine
>>> someone more clever than I can think up some unique property of rootfs
>>> that makes it a little more exploitable than just mounting a ramfs,
>>> but since you have to be root to exploit those properties I think the
>>> game is pretty much lost.
>>
>> Yes.  rootfs might not be empty, it might have totally insane
>> permissions, and it's globally shared, which makes it into a wonderful
>> channel to pass things around that shouldn't be passed around.
>
> But if only root with proc mounted can reach it...  I don't know.

 It doesn't have to be global root.  It could be userns root.

> There might be a case for setting MNT_LOCKED when we overmount "/"
> as root but I don't yet see it.
>
>> Can non-root do this?  You'd need to be in a userns with a "/" that
>> isn't MNT_LOCKED.  Can this happen on any normal setup?
>>
>> FWIW, I think we should unconditionally MNT_LOCKED the root on userns
>> unshare, even if it's the only mount.
>
> To the best of my knowledge MNT_LOCKED is set uncondintially on userns
> unshare.

 Only if list_empty(>mnt_expire), whatever that means, I think.
>>>
>>> An autofs or nfs automounted mount.  Can those ever become root?
>>
>> Dunno.
>>
>> I thought that this code was what set MNT_LOCKED for things that
>> should be locked:
>
> It does.
>
>> /* Don't allow unprivileged users to reveal what is under a mount */
>> if ((flag & CL_UNPRIVILEGED) && list_empty(>mnt_expire))
>> mnt->mnt.mnt_flags |= MNT_LOCKED;
>>
>> Now I'm confused.  Shouldn't that be checking for submounts?  Is that
>> what it's doing?
>
> As it copies each mount (mostly submounts) it sets MNT_LOCKED.

Oh.  They're *all* MNT_LOCKED.  Duh.

>
>> Anyway, I think that this should unconditionally set MNT_LOCKED on the
>> thing that mounted on rootfs.
>
> As I read the code mnt_set_expiry is only for nfs, cifs, and afs
> submounts (and perhaps proc should use them). So as they are generated
> mnt_expiry should never start on the root of filesystem of the mount tree.
>
> Furthermore mnt_expiry is cleared when a mount is moved, and when
> it is bind mounted, or propagated.
>
> pivot_root does look as though it may be missing the proper mnt_expiry
> handling list_del_init(>mnt_expire), but pivot_root does have
> proper MNT_LOCKED handling.

pivot_root is quite broken, as noted in my other email.  It's just not
broken like this, I think. :)

>
> Looking that test was slightly off and it should be:
> if ((flag & CL_UNPRIVILEGED) &&
> (!(flag & CL_EXPIRE) || list_empty(>mnt_expire))
> mnt->mnt.mnt_flags |= MNT_LOCKED;
>
> So on that note we might clear CL_EXPIRE when CL_UNPRIVILEGED is set
> in copy_tree but I don't see that as being really needed.
>
> Eric



-- 
Andy Lutomirski
AMA Capital Management, LLC
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] [RFC] mnt: add ability to clone mntns starting with the current root

2014-10-07 Thread Eric W. Biederman
Andy Lutomirski  writes:

> On Tue, Oct 7, 2014 at 4:42 PM, Eric W. Biederman  
> wrote:
>> Andy Lutomirski  writes:
>>
>>> On Tue, Oct 7, 2014 at 3:42 PM, Eric W. Biederman  
>>> wrote:
 Andy Lutomirski  writes:

> On Tue, Oct 7, 2014 at 2:42 PM, Eric W. Biederman  
> wrote:
>>
>> I am squinting and looking this way and that but while I can imagine
>> someone more clever than I can think up some unique property of rootfs
>> that makes it a little more exploitable than just mounting a ramfs,
>> but since you have to be root to exploit those properties I think the
>> game is pretty much lost.
>
> Yes.  rootfs might not be empty, it might have totally insane
> permissions, and it's globally shared, which makes it into a wonderful
> channel to pass things around that shouldn't be passed around.

 But if only root with proc mounted can reach it...  I don't know.
>>>
>>> It doesn't have to be global root.  It could be userns root.
>>>
 There might be a case for setting MNT_LOCKED when we overmount "/"
 as root but I don't yet see it.

> Can non-root do this?  You'd need to be in a userns with a "/" that
> isn't MNT_LOCKED.  Can this happen on any normal setup?
>
> FWIW, I think we should unconditionally MNT_LOCKED the root on userns
> unshare, even if it's the only mount.

 To the best of my knowledge MNT_LOCKED is set uncondintially on userns
 unshare.
>>>
>>> Only if list_empty(>mnt_expire), whatever that means, I think.
>>
>> An autofs or nfs automounted mount.  Can those ever become root?
>
> Dunno.
>
> I thought that this code was what set MNT_LOCKED for things that
> should be locked:

It does.

> /* Don't allow unprivileged users to reveal what is under a mount */
> if ((flag & CL_UNPRIVILEGED) && list_empty(>mnt_expire))
> mnt->mnt.mnt_flags |= MNT_LOCKED;
>
> Now I'm confused.  Shouldn't that be checking for submounts?  Is that
> what it's doing?

As it copies each mount (mostly submounts) it sets MNT_LOCKED.

> Anyway, I think that this should unconditionally set MNT_LOCKED on the
> thing that mounted on rootfs.

As I read the code mnt_set_expiry is only for nfs, cifs, and afs
submounts (and perhaps proc should use them). So as they are generated
mnt_expiry should never start on the root of filesystem of the mount tree.

Furthermore mnt_expiry is cleared when a mount is moved, and when
it is bind mounted, or propagated.

pivot_root does look as though it may be missing the proper mnt_expiry
handling list_del_init(>mnt_expire), but pivot_root does have
proper MNT_LOCKED handling.

Looking that test was slightly off and it should be:
if ((flag & CL_UNPRIVILEGED) &&
(!(flag & CL_EXPIRE) || list_empty(>mnt_expire))
mnt->mnt.mnt_flags |= MNT_LOCKED;

So on that note we might clear CL_EXPIRE when CL_UNPRIVILEGED is set
in copy_tree but I don't see that as being really needed.

Eric
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] [RFC] mnt: add ability to clone mntns starting with the current root

2014-10-07 Thread Andy Lutomirski
On Tue, Oct 7, 2014 at 4:42 PM, Eric W. Biederman  wrote:
> Andy Lutomirski  writes:
>
>> On Tue, Oct 7, 2014 at 3:42 PM, Eric W. Biederman  
>> wrote:
>>> Andy Lutomirski  writes:
>>>
 On Tue, Oct 7, 2014 at 2:42 PM, Eric W. Biederman  
 wrote:
>
> I am squinting and looking this way and that but while I can imagine
> someone more clever than I can think up some unique property of rootfs
> that makes it a little more exploitable than just mounting a ramfs,
> but since you have to be root to exploit those properties I think the
> game is pretty much lost.

 Yes.  rootfs might not be empty, it might have totally insane
 permissions, and it's globally shared, which makes it into a wonderful
 channel to pass things around that shouldn't be passed around.
>>>
>>> But if only root with proc mounted can reach it...  I don't know.
>>
>> It doesn't have to be global root.  It could be userns root.
>>
>>> There might be a case for setting MNT_LOCKED when we overmount "/"
>>> as root but I don't yet see it.
>>>
 Can non-root do this?  You'd need to be in a userns with a "/" that
 isn't MNT_LOCKED.  Can this happen on any normal setup?

 FWIW, I think we should unconditionally MNT_LOCKED the root on userns
 unshare, even if it's the only mount.
>>>
>>> To the best of my knowledge MNT_LOCKED is set uncondintially on userns
>>> unshare.
>>
>> Only if list_empty(>mnt_expire), whatever that means, I think.
>
> An autofs or nfs automounted mount.  Can those ever become root?

Dunno.

I thought that this code was what set MNT_LOCKED for things that
should be locked:

/* Don't allow unprivileged users to reveal what is under a mount */
if ((flag & CL_UNPRIVILEGED) && list_empty(>mnt_expire))
mnt->mnt.mnt_flags |= MNT_LOCKED;

Now I'm confused.  Shouldn't that be checking for submounts?  Is that
what it's doing?

Anyway, I think that this should unconditionally set MNT_LOCKED on the
thing that mounted on rootfs.

--Andy
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] [RFC] mnt: add ability to clone mntns starting with the current root

2014-10-07 Thread Eric W. Biederman
Andy Lutomirski  writes:

> On Tue, Oct 7, 2014 at 3:42 PM, Eric W. Biederman  
> wrote:
>> Andy Lutomirski  writes:
>>
>>> On Tue, Oct 7, 2014 at 2:42 PM, Eric W. Biederman  
>>> wrote:

 I am squinting and looking this way and that but while I can imagine
 someone more clever than I can think up some unique property of rootfs
 that makes it a little more exploitable than just mounting a ramfs,
 but since you have to be root to exploit those properties I think the
 game is pretty much lost.
>>>
>>> Yes.  rootfs might not be empty, it might have totally insane
>>> permissions, and it's globally shared, which makes it into a wonderful
>>> channel to pass things around that shouldn't be passed around.
>>
>> But if only root with proc mounted can reach it...  I don't know.
>
> It doesn't have to be global root.  It could be userns root.
>
>> There might be a case for setting MNT_LOCKED when we overmount "/"
>> as root but I don't yet see it.
>>
>>> Can non-root do this?  You'd need to be in a userns with a "/" that
>>> isn't MNT_LOCKED.  Can this happen on any normal setup?
>>>
>>> FWIW, I think we should unconditionally MNT_LOCKED the root on userns
>>> unshare, even if it's the only mount.
>>
>> To the best of my knowledge MNT_LOCKED is set uncondintially on userns
>> unshare.
>
> Only if list_empty(>mnt_expire), whatever that means, I think.

An autofs or nfs automounted mount.  Can those ever become root?

Eric
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] [RFC] mnt: add ability to clone mntns starting with the current root

2014-10-07 Thread Andy Lutomirski
On Tue, Oct 7, 2014 at 3:42 PM, Eric W. Biederman  wrote:
> Andy Lutomirski  writes:
>
>> On Tue, Oct 7, 2014 at 2:42 PM, Eric W. Biederman  
>> wrote:
>>>
>>> I am squinting and looking this way and that but while I can imagine
>>> someone more clever than I can think up some unique property of rootfs
>>> that makes it a little more exploitable than just mounting a ramfs,
>>> but since you have to be root to exploit those properties I think the
>>> game is pretty much lost.
>>
>> Yes.  rootfs might not be empty, it might have totally insane
>> permissions, and it's globally shared, which makes it into a wonderful
>> channel to pass things around that shouldn't be passed around.
>
> But if only root with proc mounted can reach it...  I don't know.

It doesn't have to be global root.  It could be userns root.

> There might be a case for setting MNT_LOCKED when we overmount "/"
> as root but I don't yet see it.
>
>> Can non-root do this?  You'd need to be in a userns with a "/" that
>> isn't MNT_LOCKED.  Can this happen on any normal setup?
>>
>> FWIW, I think we should unconditionally MNT_LOCKED the root on userns
>> unshare, even if it's the only mount.
>
> To the best of my knowledge MNT_LOCKED is set uncondintially on userns
> unshare.

Only if list_empty(>mnt_expire), whatever that means, I think.

--Andy
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] [RFC] mnt: add ability to clone mntns starting with the current root

2014-10-07 Thread Eric W. Biederman
Andy Lutomirski  writes:

> On Tue, Oct 7, 2014 at 2:42 PM, Eric W. Biederman  
> wrote:
>>
>> I am squinting and looking this way and that but while I can imagine
>> someone more clever than I can think up some unique property of rootfs
>> that makes it a little more exploitable than just mounting a ramfs,
>> but since you have to be root to exploit those properties I think the
>> game is pretty much lost.
>
> Yes.  rootfs might not be empty, it might have totally insane
> permissions, and it's globally shared, which makes it into a wonderful
> channel to pass things around that shouldn't be passed around.

But if only root with proc mounted can reach it...  I don't know.
There might be a case for setting MNT_LOCKED when we overmount "/"
as root but I don't yet see it.

> Can non-root do this?  You'd need to be in a userns with a "/" that
> isn't MNT_LOCKED.  Can this happen on any normal setup?
>
> FWIW, I think we should unconditionally MNT_LOCKED the root on userns
> unshare, even if it's the only mount.

To the best of my knowledge MNT_LOCKED is set uncondintially on userns
unshare.

Eric

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] [RFC] mnt: add ability to clone mntns starting with the current root

2014-10-07 Thread Andy Lutomirski
On Tue, Oct 7, 2014 at 2:42 PM, Eric W. Biederman  wrote:
>
> I am squinting and looking this way and that but while I can imagine
> someone more clever than I can think up some unique property of rootfs
> that makes it a little more exploitable than just mounting a ramfs,
> but since you have to be root to exploit those properties I think the
> game is pretty much lost.

Yes.  rootfs might not be empty, it might have totally insane
permissions, and it's globally shared, which makes it into a wonderful
channel to pass things around that shouldn't be passed around.

Can non-root do this?  You'd need to be in a userns with a "/" that
isn't MNT_LOCKED.  Can this happen on any normal setup?

FWIW, I think we should unconditionally MNT_LOCKED the root on userns
unshare, even if it's the only mount.


>
>>> >> So it is only root (and not root in a container) who can get to the
>>> >> exposed rootfs.
>>> >>
>>> >> I have a vague memory someone actually had a real use in miminal systems
>>> >> for being able to get back to the rootfs and being able to use rootfs as
>>> >> the rootfs.  There was even a patch at that time that Andrew Morton was
>>> >> carrying for a time to allow unmounting root and get at rootfs, and to
>>> >> prevent the oops on rootfs unmount in some way.
>>> >>
>>> >> So not only do I not think it is a bug to get back too rootfs, I think
>>> >> it is a feature that some people have expressed at least half-way sane
>>> >> uses for.
>>> >
>>> > They can still do that if they want, using chroot :)
>>>
>>> It would take fchdir or fchroot and a directory file descriptor open on
>>> rootfs.  Frequently there is no appropriate directory file descriptor.
>>
>> ?  you can always escape if you're simply chrooted.  waterbuffalo :)
>
> filesystem type rootfs.
>
> Eric
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-api" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html



-- 
Andy Lutomirski
AMA Capital Management, LLC
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] [RFC] mnt: add ability to clone mntns starting with the current root

2014-10-07 Thread Andy Lutomirski
On Tue, Oct 7, 2014 at 2:50 PM, Eric W. Biederman  wrote:
> Andy Lutomirski  writes:
>
>> On Tue, Oct 7, 2014 at 2:26 PM, Eric W. Biederman  
>> wrote:
>>> Andy Lutomirski  writes:
>>>
 Why should MNT_LOCKED on submounts be enforced?

 Is it because, if you retain a reference to the detached tree, then
 you can see under the submounts?
>>>
>>> Yes. MNT_DETACH is a recursive operation that detaches all of the mount
>>> and all of it's submounts.  Which means you can see under the submounts
>>> if you have a reference to a detached mount.
>>>
 If so, let's fix *that*.  Because
 otherwise the whole model of pivot_root + detach will break.
>>>
>>> I am not certain what you are referring to.  pivot_root doesn't
>>> manipulate the mount tree so you can see under anything.
>>>
>>> What I believe is the appropriate fix is to fail umount2(...,MNT_DETACH)
>>> if there are any referenced mount points being detached that have a
>>> locked submount.
>>
>> Most of the container-using things do, roughly:
>>
>> Unshare userns and mountns
>> Mount some new stuff
>> pivot_root to the new stuff
>> MNT_DETACH the old.
>>
>> That last step will almost always fail if you make this change.
>
> I don't think so.
>
> I expect I could add full busy detection of normal umounts and those
> applications would not fail.
>
> What I am proposing is a more targeted version of busy detection that
> looks at each mount in the set that detach will unmount.  For each mount
> if it is busy with non-submount references and it has at least one
> locked submount fail the detach with -EBUSY.
>
> Do you really think we have userspace references to the one or more of the
> mounts under old?
>

I suspect that we have a userspace reference to old itself.

--Andy

> Eric



-- 
Andy Lutomirski
AMA Capital Management, LLC
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] [RFC] mnt: add ability to clone mntns starting with the current root

2014-10-07 Thread Eric W. Biederman
Andy Lutomirski  writes:

> On Tue, Oct 7, 2014 at 2:26 PM, Eric W. Biederman  
> wrote:
>> Andy Lutomirski  writes:
>>
>>> Why should MNT_LOCKED on submounts be enforced?
>>>
>>> Is it because, if you retain a reference to the detached tree, then
>>> you can see under the submounts?
>>
>> Yes. MNT_DETACH is a recursive operation that detaches all of the mount
>> and all of it's submounts.  Which means you can see under the submounts
>> if you have a reference to a detached mount.
>>
>>> If so, let's fix *that*.  Because
>>> otherwise the whole model of pivot_root + detach will break.
>>
>> I am not certain what you are referring to.  pivot_root doesn't
>> manipulate the mount tree so you can see under anything.
>>
>> What I believe is the appropriate fix is to fail umount2(...,MNT_DETACH)
>> if there are any referenced mount points being detached that have a
>> locked submount.
>
> Most of the container-using things do, roughly:
>
> Unshare userns and mountns
> Mount some new stuff
> pivot_root to the new stuff
> MNT_DETACH the old.
>
> That last step will almost always fail if you make this change.

I don't think so.

I expect I could add full busy detection of normal umounts and those
applications would not fail.

What I am proposing is a more targeted version of busy detection that
looks at each mount in the set that detach will unmount.  For each mount
if it is busy with non-submount references and it has at least one
locked submount fail the detach with -EBUSY.

Do you really think we have userspace references to the one or more of the
mounts under old?

Eric
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] [RFC] mnt: add ability to clone mntns starting with the current root

2014-10-07 Thread Eric W. Biederman
Serge Hallyn  writes:

> Quoting Eric W. Biederman (ebied...@xmission.com):

>> What I meant is that it isn't about containers.   It is about something
>> root can do.  So this is not a "container" problem.
>
> Oh, ok.  
>
> Sorry, I'm getting the two thread confused anyway.  I'm going to bow out
> here until I can pay proper attention.

I think there is half a point here that may be legit, when you are using
mount namespaces to jail applications, there may be a problem
with umounting / and making it to the underlying rootfs filesystem.

I am squinting and looking this way and that but while I can imagine
someone more clever than I can think up some unique property of rootfs
that makes it a little more exploitable than just mounting a ramfs,
but since you have to be root to exploit those properties I think the
game is pretty much lost.

>> >> So it is only root (and not root in a container) who can get to the
>> >> exposed rootfs.
>> >> 
>> >> I have a vague memory someone actually had a real use in miminal systems
>> >> for being able to get back to the rootfs and being able to use rootfs as
>> >> the rootfs.  There was even a patch at that time that Andrew Morton was
>> >> carrying for a time to allow unmounting root and get at rootfs, and to
>> >> prevent the oops on rootfs unmount in some way.
>> >> 
>> >> So not only do I not think it is a bug to get back too rootfs, I think
>> >> it is a feature that some people have expressed at least half-way sane
>> >> uses for.
>> >
>> > They can still do that if they want, using chroot :)
>> 
>> It would take fchdir or fchroot and a directory file descriptor open on
>> rootfs.  Frequently there is no appropriate directory file descriptor.
>
> ?  you can always escape if you're simply chrooted.  waterbuffalo :)

filesystem type rootfs.

Eric

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] [RFC] mnt: add ability to clone mntns starting with the current root

2014-10-07 Thread Andy Lutomirski
On Tue, Oct 7, 2014 at 2:26 PM, Eric W. Biederman  wrote:
> Andy Lutomirski  writes:
>
>> Why should MNT_LOCKED on submounts be enforced?
>>
>> Is it because, if you retain a reference to the detached tree, then
>> you can see under the submounts?
>
> Yes. MNT_DETACH is a recursive operation that detaches all of the mount
> and all of it's submounts.  Which means you can see under the submounts
> if you have a reference to a detached mount.
>
>> If so, let's fix *that*.  Because
>> otherwise the whole model of pivot_root + detach will break.
>
> I am not certain what you are referring to.  pivot_root doesn't
> manipulate the mount tree so you can see under anything.
>
> What I believe is the appropriate fix is to fail umount2(...,MNT_DETACH)
> if there are any referenced mount points being detached that have a
> locked submount.

Most of the container-using things do, roughly:

Unshare userns and mountns
Mount some new stuff
pivot_root to the new stuff
MNT_DETACH the old.

That last step will almost always fail if you make this change.

--Andy
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] [RFC] mnt: add ability to clone mntns starting with the current root

2014-10-07 Thread Serge Hallyn
Quoting Andy Lutomirski (l...@amacapital.net):
> On Tue, Oct 7, 2014 at 1:30 PM, Eric W. Biederman  
> wrote:
> > Al Viro  writes:
> >
> > 2> On Tue, Oct 07, 2014 at 02:30:40PM +0100, Al Viro wrote:
> >>> On Tue, Oct 07, 2014 at 04:12:57PM +0400, Andrey Vagin wrote:
> >>> > Another problem is that rootfs can't be hidden from a container, because
> >>> > rootfs can't be moved or umounted.
> >>>
> >>> ... which is a bug in mntns_install(), AFAICS.
> >>
> >> Ability to get to exposed rootfs, that is.
> >
> > The container side of this argument is pretty bogus.  It only applies
> > if user namespaces are not used for the container.
> >
> > So it is only root (and not root in a container) who can get to the
> > exposed rootfs.
> >
> > I have a vague memory someone actually had a real use in miminal systems
> > for being able to get back to the rootfs and being able to use rootfs as
> > the rootfs.  There was even a patch at that time that Andrew Morton was
> > carrying for a time to allow unmounting root and get at rootfs, and to
> > prevent the oops on rootfs unmount in some way.
> >
> > So not only do I not think it is a bug to get back too rootfs, I think
> > it is a feature that some people have expressed at least half-way sane
> > uses for.
> >
> >>> > Here is an example how to get access to rootfs:
> >>> > fd = open("/proc/self/ns/mnt", O_RDONLY)
> >>> > umount2("/", MNT_DETACH);
> >>> > setns(fd, CLONE_NEWNS)
> >>> >
> >>> > rootfs may contain data, which should not be avaliable in CT-s.
> >>>
> >>> Indeed.
> >>
> >> ... and it looks like the above is what your mangled reproducer in previous
> >> patch had been made of -
> >>   fd = open("/proc/self/ns/mnt", O_RDONLY)
> >>   umount2("/", MNT_DETACH);
> >>   setns(fd, CLONE_NEWNS)
> >>   umount2("/", MNT_DETACH);
> >>
> >> IMO what it shows is setns() bug.  This "switch root/cwd, no matter what"
> >> is wrong.
> >
> > IMO the bug is allowing us to unmount things that should never be unmounted.
> >
> > In a mount namespace created with just user namespace permissions we
> > can't get at rootfs because MNT_LOCKED is set on the root directory
> > and thus it can not be mounted.
> >
> > Further if anyone has permission to call chroot and chdir on any mount
> > in a mount namespace (that isn't currently covered) they can get at all
> > of them that are not currently covered.  A mount namespace where no one
> > can get at any uncovered filesystem seems to be the definition of
> > useless and ridiculous.
> >
> >
> > Now there is a bug in that MNT_DETACH today does not currently enforce
> > MNT_LOCKED on submounts of the mount point that is detached. I am
> > currently looking at how to construct the appropriate permission check
> > to prevent that.  Unfortunately I can not disallow MNT_DETACH with
> > submounts all together as that breaks too many legitimate uses.
> 
> Why should MNT_LOCKED on submounts be enforced?
> 
> Is it because, if you retain a reference to the detached tree, then
> you can see under the submounts?  If so, let's fix *that*.  Because
> otherwise the whole model of pivot_root + detach will break.
> 
> Also, damn it, we need change_the_ns_root instead of pivot_root.  I
> doubt that any container programs actually want to keep the old root
> attached after pivot_root.

Right I think that'll fix the problem we were having, and I think
Andrey said the same thing in another list a few days ago.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] [RFC] mnt: add ability to clone mntns starting with the current root

2014-10-07 Thread Serge Hallyn
Quoting Eric W. Biederman (ebied...@xmission.com):
> Serge Hallyn  writes:
> 
> > Quoting Eric W. Biederman (ebied...@xmission.com):
> >> Al Viro  writes:
> >> 
> >> 2> On Tue, Oct 07, 2014 at 02:30:40PM +0100, Al Viro wrote:
> >> >> On Tue, Oct 07, 2014 at 04:12:57PM +0400, Andrey Vagin wrote:
> >> >> > Another problem is that rootfs can't be hidden from a container, 
> >> >> > because
> >> >> > rootfs can't be moved or umounted.
> >> >> 
> >> >> ... which is a bug in mntns_install(), AFAICS.
> >> >
> >> > Ability to get to exposed rootfs, that is.
> >> 
> >> The container side of this argument is pretty bogus.  It only applies
> >> if user namespaces are not used for the container.
> >
> > User namespaces are still far too restricted for many container use
> > cases.  We can't say "we have user namespaces so now privileged
> > containers can be ignored".  Yes you never should have handed the
> > keys to a privileged container to an untrusted person, but we do
> > still try to protect the host from accidental damage due to a
> > privileged container.
> 
> What I meant is that it isn't about containers.   It is about something
> root can do.  So this is not a "container" problem.

Oh, ok.  

Sorry, I'm getting the two thread confused anyway.  I'm going to bow out
here until I can pay proper attention.

> >> So it is only root (and not root in a container) who can get to the
> >> exposed rootfs.
> >> 
> >> I have a vague memory someone actually had a real use in miminal systems
> >> for being able to get back to the rootfs and being able to use rootfs as
> >> the rootfs.  There was even a patch at that time that Andrew Morton was
> >> carrying for a time to allow unmounting root and get at rootfs, and to
> >> prevent the oops on rootfs unmount in some way.
> >> 
> >> So not only do I not think it is a bug to get back too rootfs, I think
> >> it is a feature that some people have expressed at least half-way sane
> >> uses for.
> >
> > They can still do that if they want, using chroot :)
> 
> It would take fchdir or fchroot and a directory file descriptor open on
> rootfs.  Frequently there is no appropriate directory file descriptor.

?  you can always escape if you're simply chrooted.  waterbuffalo :)
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] [RFC] mnt: add ability to clone mntns starting with the current root

2014-10-07 Thread Eric W. Biederman
Andy Lutomirski  writes:

> Why should MNT_LOCKED on submounts be enforced?
>
> Is it because, if you retain a reference to the detached tree, then
> you can see under the submounts?  

Yes. MNT_DETACH is a recursive operation that detaches all of the mount
and all of it's submounts.  Which means you can see under the submounts
if you have a reference to a detached mount.

> If so, let's fix *that*.  Because
> otherwise the whole model of pivot_root + detach will break.

I am not certain what you are referring to.  pivot_root doesn't
manipulate the mount tree so you can see under anything.

What I believe is the appropriate fix is to fail umount2(...,MNT_DETACH)
if there are any referenced mount points being detached that have a
locked submount.

> Also, damn it, we need change_the_ns_root instead of pivot_root.  I
> doubt that any container programs actually want to keep the old root
> attached after pivot_root.

Shrug.  Except for chroot_fs_refs() pivot_root is a cheap.

I'm not particularly in favor of merging pivot_root and umount2.  The
number of weird cases in the current api are high.  A merged piece of
code would just make them higher.

I am hoping that one more round of bug fixing will at least get the bugs
for having unprivileged mounts fixed in the current API.

Eric
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] [RFC] mnt: add ability to clone mntns starting with the current root

2014-10-07 Thread Andy Lutomirski
On Tue, Oct 7, 2014 at 1:30 PM, Eric W. Biederman  wrote:
> Al Viro  writes:
>
> 2> On Tue, Oct 07, 2014 at 02:30:40PM +0100, Al Viro wrote:
>>> On Tue, Oct 07, 2014 at 04:12:57PM +0400, Andrey Vagin wrote:
>>> > Another problem is that rootfs can't be hidden from a container, because
>>> > rootfs can't be moved or umounted.
>>>
>>> ... which is a bug in mntns_install(), AFAICS.
>>
>> Ability to get to exposed rootfs, that is.
>
> The container side of this argument is pretty bogus.  It only applies
> if user namespaces are not used for the container.
>
> So it is only root (and not root in a container) who can get to the
> exposed rootfs.
>
> I have a vague memory someone actually had a real use in miminal systems
> for being able to get back to the rootfs and being able to use rootfs as
> the rootfs.  There was even a patch at that time that Andrew Morton was
> carrying for a time to allow unmounting root and get at rootfs, and to
> prevent the oops on rootfs unmount in some way.
>
> So not only do I not think it is a bug to get back too rootfs, I think
> it is a feature that some people have expressed at least half-way sane
> uses for.
>
>>> > Here is an example how to get access to rootfs:
>>> > fd = open("/proc/self/ns/mnt", O_RDONLY)
>>> > umount2("/", MNT_DETACH);
>>> > setns(fd, CLONE_NEWNS)
>>> >
>>> > rootfs may contain data, which should not be avaliable in CT-s.
>>>
>>> Indeed.
>>
>> ... and it looks like the above is what your mangled reproducer in previous
>> patch had been made of -
>>   fd = open("/proc/self/ns/mnt", O_RDONLY)
>>   umount2("/", MNT_DETACH);
>>   setns(fd, CLONE_NEWNS)
>>   umount2("/", MNT_DETACH);
>>
>> IMO what it shows is setns() bug.  This "switch root/cwd, no matter what"
>> is wrong.
>
> IMO the bug is allowing us to unmount things that should never be unmounted.
>
> In a mount namespace created with just user namespace permissions we
> can't get at rootfs because MNT_LOCKED is set on the root directory
> and thus it can not be mounted.
>
> Further if anyone has permission to call chroot and chdir on any mount
> in a mount namespace (that isn't currently covered) they can get at all
> of them that are not currently covered.  A mount namespace where no one
> can get at any uncovered filesystem seems to be the definition of
> useless and ridiculous.
>
>
> Now there is a bug in that MNT_DETACH today does not currently enforce
> MNT_LOCKED on submounts of the mount point that is detached. I am
> currently looking at how to construct the appropriate permission check
> to prevent that.  Unfortunately I can not disallow MNT_DETACH with
> submounts all together as that breaks too many legitimate uses.

Why should MNT_LOCKED on submounts be enforced?

Is it because, if you retain a reference to the detached tree, then
you can see under the submounts?  If so, let's fix *that*.  Because
otherwise the whole model of pivot_root + detach will break.

Also, damn it, we need change_the_ns_root instead of pivot_root.  I
doubt that any container programs actually want to keep the old root
attached after pivot_root.

--Andy
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] [RFC] mnt: add ability to clone mntns starting with the current root

2014-10-07 Thread Eric W. Biederman
Serge Hallyn  writes:

> Quoting Eric W. Biederman (ebied...@xmission.com):
>> Al Viro  writes:
>> 
>> 2> On Tue, Oct 07, 2014 at 02:30:40PM +0100, Al Viro wrote:
>> >> On Tue, Oct 07, 2014 at 04:12:57PM +0400, Andrey Vagin wrote:
>> >> > Another problem is that rootfs can't be hidden from a container, because
>> >> > rootfs can't be moved or umounted.
>> >> 
>> >> ... which is a bug in mntns_install(), AFAICS.
>> >
>> > Ability to get to exposed rootfs, that is.
>> 
>> The container side of this argument is pretty bogus.  It only applies
>> if user namespaces are not used for the container.
>
> User namespaces are still far too restricted for many container use
> cases.  We can't say "we have user namespaces so now privileged
> containers can be ignored".  Yes you never should have handed the
> keys to a privileged container to an untrusted person, but we do
> still try to protect the host from accidental damage due to a
> privileged container.

What I meant is that it isn't about containers.   It is about something
root can do.  So this is not a "container" problem.

>> So it is only root (and not root in a container) who can get to the
>> exposed rootfs.
>> 
>> I have a vague memory someone actually had a real use in miminal systems
>> for being able to get back to the rootfs and being able to use rootfs as
>> the rootfs.  There was even a patch at that time that Andrew Morton was
>> carrying for a time to allow unmounting root and get at rootfs, and to
>> prevent the oops on rootfs unmount in some way.
>> 
>> So not only do I not think it is a bug to get back too rootfs, I think
>> it is a feature that some people have expressed at least half-way sane
>> uses for.
>
> They can still do that if they want, using chroot :)

It would take fchdir or fchroot and a directory file descriptor open on
rootfs.  Frequently there is no appropriate directory file descriptor.

Eric
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] [RFC] mnt: add ability to clone mntns starting with the current root

2014-10-07 Thread Serge Hallyn
Quoting Eric W. Biederman (ebied...@xmission.com):
> Al Viro  writes:
> 
> 2> On Tue, Oct 07, 2014 at 02:30:40PM +0100, Al Viro wrote:
> >> On Tue, Oct 07, 2014 at 04:12:57PM +0400, Andrey Vagin wrote:
> >> > Another problem is that rootfs can't be hidden from a container, because
> >> > rootfs can't be moved or umounted.
> >> 
> >> ... which is a bug in mntns_install(), AFAICS.
> >
> > Ability to get to exposed rootfs, that is.
> 
> The container side of this argument is pretty bogus.  It only applies
> if user namespaces are not used for the container.

User namespaces are still far too restricted for many container use
cases.  We can't say "we have user namespaces so now privileged
containers can be ignored".  Yes you never should have handed the
keys to a privileged container to an untrusted person, but we do
still try to protect the host from accidental damage due to a
privileged container.

> So it is only root (and not root in a container) who can get to the
> exposed rootfs.
> 
> I have a vague memory someone actually had a real use in miminal systems
> for being able to get back to the rootfs and being able to use rootfs as
> the rootfs.  There was even a patch at that time that Andrew Morton was
> carrying for a time to allow unmounting root and get at rootfs, and to
> prevent the oops on rootfs unmount in some way.
> 
> So not only do I not think it is a bug to get back too rootfs, I think
> it is a feature that some people have expressed at least half-way sane
> uses for.

They can still do that if they want, using chroot :)

> >> > Here is an example how to get access to rootfs:
> >> > fd = open("/proc/self/ns/mnt", O_RDONLY)
> >> > umount2("/", MNT_DETACH);
> >> > setns(fd, CLONE_NEWNS)
> >> > 
> >> > rootfs may contain data, which should not be avaliable in CT-s.
> >> 
> >> Indeed.
> >
> > ... and it looks like the above is what your mangled reproducer in previous
> > patch had been made of -
> > fd = open("/proc/self/ns/mnt", O_RDONLY)
> > umount2("/", MNT_DETACH);
> > setns(fd, CLONE_NEWNS)
> > umount2("/", MNT_DETACH);
> >
> > IMO what it shows is setns() bug.  This "switch root/cwd, no matter what"
> > is wrong.
> 
> IMO the bug is allowing us to unmount things that should never be unmounted.
> 
> In a mount namespace created with just user namespace permissions we
> can't get at rootfs because MNT_LOCKED is set on the root directory
> and thus it can not be mounted.
> 
> Further if anyone has permission to call chroot and chdir on any mount
> in a mount namespace (that isn't currently covered) they can get at all
> of them that are not currently covered.  A mount namespace where no one
> can get at any uncovered filesystem seems to be the definition of
> useless and ridiculous.
> 
> 
> Now there is a bug in that MNT_DETACH today does not currently enforce
> MNT_LOCKED on submounts of the mount point that is detached. I am
> currently looking at how to construct the appropriate permission check
> to prevent that.  Unfortunately I can not disallow MNT_DETACH with
> submounts all together as that breaks too many legitimate uses.
> 
> That failure to enforce MNT_LOCKED is my mistake. I had a naive notion
> that submounts would remain mounted after a mount detach and I misread
> the code when I did the original work.  My mistake. 

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] [RFC] mnt: add ability to clone mntns starting with the current root

2014-10-07 Thread Eric W. Biederman
Andrey Vagin  writes:

> From: Andrey Vagin 
>
> Currently when we create a new container with a separate root,
> we need to clone the current mount namespace with all mounts and then
> clean up it by using pivot_root(). A big part of mountpoints are cloned
> only to be umounted.

Is the motivation performance?  Because if that is the motivation we
need numbers.

> Another problem is that rootfs can't be hidden from a container, because
> rootfs can't be moved or umounted.
>
> Here is an example how to get access to rootfs:
> fd = open("/proc/self/ns/mnt", O_RDONLY)
> umount2("/", MNT_DETACH);
> setns(fd, CLONE_NEWNS)
>
> rootfs may contain data, which should not be avaliable in CT-s.

Well don't give those containers CAP_SYS_ADMIN.  If you aren't using
user namespaces there is no expectation of safety from those kinds of
problems.  Getting at rootfs is perfectly valid for root.

> I suggest to add ability to create a mount namespace with specified
> mount points. A current task root can be used as a root for the new
> mount namespace.

I really don't think you are going to like the result because you will
loose access to /proc and /sys.

> With this patch you can call chroot(ct->rootfs) and
> unshare(UNSHARE_NEWNS2) to get a clean mount namespace.

That is a little bit of an ugly way to smuggle a parameter into the
creation of a mount namespace.  Further I am pretty certain this patch
totally breaks the setting of new_fs->root and new_fs->pwd.


In net my opinion is that the code doesn't work and does not provide
sufficient justification for a new system call.

> UNSHARE_NEWNS2 can be used only with the unshare() syscall. The clone()
> syscall doesn't have unused flags.
>
> Here is an example how it looks like:
> $ cat ../../unshare.c
>
> int main(int argc, char **argv)
> {
/* You left out
 * mount --bind /some/root/path /some/root/path
 * chroot /some/root/path
 */

>   if (unshare(UNSHARE_NEWNS2))
>   return 1;
>
>   execl("/bin/bash", "/bin/bash", NULL);
>   return 1;
> }
> $ mount --bind test/ubuntu/ test/ubuntu/
> $ cd test/ubuntu/
> $ chroot .
> $ ./unshare2
> $ mount -t proc proc proc
> $ cat /proc/self/mountinfo
> 55 55 252:1 /home/avagin/test/ubuntu / rw,relatime - ext4 
> /dev/disk/by-uuid/d672b85f-533c-4868-9609-ca80be52d3c6 
> rw,errors=remount-ro,data=ordered
> 56 55 0:3 / /proc rw,relatime - proc proc rw
>
> Cc: Alexander Viro 
> Cc: Andrew Morton 
> Cc: "Eric W. Biederman" 
> Cc: Cyrill Gorcunov 
> Cc: Pavel Emelyanov 
> Cc: Serge Hallyn 
> Cc: Rob Landley 
> Signed-off-by: Andrey Vagin 
> ---
>  fs/namespace.c | 16 ++--
>  include/uapi/linux/sched.h |  8 
>  kernel/fork.c  | 11 ---
>  kernel/nsproxy.c   |  2 +-
>  4 files changed, 31 insertions(+), 6 deletions(-)
>
> diff --git a/fs/namespace.c b/fs/namespace.c
> index 730c50e..f50a848 100644
> --- a/fs/namespace.c
> +++ b/fs/namespace.c
> @@ -2569,12 +2569,24 @@ struct mnt_namespace *copy_mnt_ns(unsigned long 
> flags, struct mnt_namespace *ns,
>  
>   BUG_ON(!ns);
>  
> - if (likely(!(flags & CLONE_NEWNS))) {
> + if (likely(!(flags & (CLONE_NEWNS | UNSHARE_NEWNS2 {
>   get_mnt_ns(ns);
>   return ns;
>   }
>  
> - old = ns->root;
> + if (flags & CLONE_NEWNS)
> + old = ns->root;
> + else { /* UNSHARE_NEWNS2 */
> + struct path root;
> +
> + get_fs_root(current->fs, );
> + if (root.mnt->mnt_root != root.dentry) {
> + path_put();
> + return ERR_PTR(-EINVAL); /* not a mountpoint */
> + }
> + old = real_mount(root.mnt);
> + path_put();
> + }
>  
>   new_ns = alloc_mnt_ns(user_ns);
>   if (IS_ERR(new_ns))
> diff --git a/include/uapi/linux/sched.h b/include/uapi/linux/sched.h
> index 34f9d73..8092e50 100644
> --- a/include/uapi/linux/sched.h
> +++ b/include/uapi/linux/sched.h
> @@ -31,6 +31,14 @@
>  #define CLONE_IO 0x8000  /* Clone io context */
>  
>  /*
> + * Following flags can be used only with unshare(), because
> + * they are intersected with CSIGNAL
> + */
> +#define UNSHARE_NEWNS2   0x0001  /* Clone mnt namespace 
> starting with the current task root. */
> +
> +#define UNSHARE_FLAGS(UNSHARE_NEWNS2)
> +
> +/*
>   * Scheduling policies
>   */
>  #define SCHED_NORMAL 0
> diff --git a/kernel/fork.c b/kernel/fork.c
> index 0cf9cdb..52f1fc0 100644
> --- a/kernel/fork.c
> +++ b/kernel/fork.c
> @@ -1381,7 +1381,12 @@ static struct task_struct *copy_process(unsigned long 
> clone_flags,
>   retval = copy_mm(clone_flags, p);
>   if (retval)
>   goto bad_fork_cleanup_signal;
> - retval = copy_namespaces(clone_flags, p);
> +
> + /*
> +  * CSIGNAL and UNSHARE_FLAGS are intersected, but
> +  * UNSHARE_FLAGS can't be used with clone().
> +  */
> 

Re: [PATCH] [RFC] mnt: add ability to clone mntns starting with the current root

2014-10-07 Thread Eric W. Biederman
Al Viro  writes:

2> On Tue, Oct 07, 2014 at 02:30:40PM +0100, Al Viro wrote:
>> On Tue, Oct 07, 2014 at 04:12:57PM +0400, Andrey Vagin wrote:
>> > Another problem is that rootfs can't be hidden from a container, because
>> > rootfs can't be moved or umounted.
>> 
>> ... which is a bug in mntns_install(), AFAICS.
>
> Ability to get to exposed rootfs, that is.

The container side of this argument is pretty bogus.  It only applies
if user namespaces are not used for the container.

So it is only root (and not root in a container) who can get to the
exposed rootfs.

I have a vague memory someone actually had a real use in miminal systems
for being able to get back to the rootfs and being able to use rootfs as
the rootfs.  There was even a patch at that time that Andrew Morton was
carrying for a time to allow unmounting root and get at rootfs, and to
prevent the oops on rootfs unmount in some way.

So not only do I not think it is a bug to get back too rootfs, I think
it is a feature that some people have expressed at least half-way sane
uses for.

>> > Here is an example how to get access to rootfs:
>> > fd = open("/proc/self/ns/mnt", O_RDONLY)
>> > umount2("/", MNT_DETACH);
>> > setns(fd, CLONE_NEWNS)
>> > 
>> > rootfs may contain data, which should not be avaliable in CT-s.
>> 
>> Indeed.
>
> ... and it looks like the above is what your mangled reproducer in previous
> patch had been made of -
>   fd = open("/proc/self/ns/mnt", O_RDONLY)
>   umount2("/", MNT_DETACH);
>   setns(fd, CLONE_NEWNS)
>   umount2("/", MNT_DETACH);
>
> IMO what it shows is setns() bug.  This "switch root/cwd, no matter what"
> is wrong.

IMO the bug is allowing us to unmount things that should never be unmounted.

In a mount namespace created with just user namespace permissions we
can't get at rootfs because MNT_LOCKED is set on the root directory
and thus it can not be mounted.

Further if anyone has permission to call chroot and chdir on any mount
in a mount namespace (that isn't currently covered) they can get at all
of them that are not currently covered.  A mount namespace where no one
can get at any uncovered filesystem seems to be the definition of
useless and ridiculous.


Now there is a bug in that MNT_DETACH today does not currently enforce
MNT_LOCKED on submounts of the mount point that is detached. I am
currently looking at how to construct the appropriate permission check
to prevent that.  Unfortunately I can not disallow MNT_DETACH with
submounts all together as that breaks too many legitimate uses.

That failure to enforce MNT_LOCKED is my mistake. I had a naive notion
that submounts would remain mounted after a mount detach and I misread
the code when I did the original work.  My mistake. 

Eric
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] [RFC] mnt: add ability to clone mntns starting with the current root

2014-10-07 Thread Andrew Vagin
On Tue, Oct 07, 2014 at 02:33:39PM +0100, Al Viro wrote:
> On Tue, Oct 07, 2014 at 02:30:40PM +0100, Al Viro wrote:
> > On Tue, Oct 07, 2014 at 04:12:57PM +0400, Andrey Vagin wrote:
> > > Another problem is that rootfs can't be hidden from a container, because
> > > rootfs can't be moved or umounted.
> > 
> > ... which is a bug in mntns_install(), AFAICS.
> 
> Ability to get to exposed rootfs, that is.
> 
> > > Here is an example how to get access to rootfs:
> > > fd = open("/proc/self/ns/mnt", O_RDONLY)
> > > umount2("/", MNT_DETACH);
> > > setns(fd, CLONE_NEWNS)
> > > 
> > > rootfs may contain data, which should not be avaliable in CT-s.
> > 
> > Indeed.
> 
> ... and it looks like the above is what your mangled reproducer in previous
> patch had been made of -
>   fd = open("/proc/self/ns/mnt", O_RDONLY)
>   umount2("/", MNT_DETACH);
>   setns(fd, CLONE_NEWNS)
>   umount2("/", MNT_DETACH);
> 
> IMO what it shows is setns() bug.  This "switch root/cwd, no matter what"
> is wrong.

How can we move in another namespace without this "switch"? If the task
root and cwd isn't switched, they points to the old tree. In this case
how can we open something from the new tree?

Do you mean that root should not be changed if it points on a detached
mount or on a mount from another mount namespace? This looks reasonable.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] [RFC] mnt: add ability to clone mntns starting with the current root

2014-10-07 Thread Al Viro
On Tue, Oct 07, 2014 at 02:30:40PM +0100, Al Viro wrote:
> On Tue, Oct 07, 2014 at 04:12:57PM +0400, Andrey Vagin wrote:
> > Another problem is that rootfs can't be hidden from a container, because
> > rootfs can't be moved or umounted.
> 
> ... which is a bug in mntns_install(), AFAICS.

Ability to get to exposed rootfs, that is.

> > Here is an example how to get access to rootfs:
> > fd = open("/proc/self/ns/mnt", O_RDONLY)
> > umount2("/", MNT_DETACH);
> > setns(fd, CLONE_NEWNS)
> > 
> > rootfs may contain data, which should not be avaliable in CT-s.
> 
> Indeed.

... and it looks like the above is what your mangled reproducer in previous
patch had been made of -
fd = open("/proc/self/ns/mnt", O_RDONLY)
umount2("/", MNT_DETACH);
setns(fd, CLONE_NEWNS)
umount2("/", MNT_DETACH);

IMO what it shows is setns() bug.  This "switch root/cwd, no matter what"
is wrong.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] [RFC] mnt: add ability to clone mntns starting with the current root

2014-10-07 Thread Al Viro
On Tue, Oct 07, 2014 at 04:12:57PM +0400, Andrey Vagin wrote:
> Another problem is that rootfs can't be hidden from a container, because
> rootfs can't be moved or umounted.

... which is a bug in mntns_install(), AFAICS.

> Here is an example how to get access to rootfs:
> fd = open("/proc/self/ns/mnt", O_RDONLY)
> umount2("/", MNT_DETACH);
> setns(fd, CLONE_NEWNS)
> 
> rootfs may contain data, which should not be avaliable in CT-s.

Indeed.

> I suggest to add ability to create a mount namespace with specified
> mount points. A current task root can be used as a root for the new
> mount namespace.

Yecchh...  Frankly, you are opening a big can of worms - having rootfs
as absolute root of all namespaces simplifies a lot of things in there.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH] [RFC] mnt: add ability to clone mntns starting with the current root

2014-10-07 Thread Andrey Vagin
From: Andrey Vagin 

Currently when we create a new container with a separate root,
we need to clone the current mount namespace with all mounts and then
clean up it by using pivot_root(). A big part of mountpoints are cloned
only to be umounted.

Another problem is that rootfs can't be hidden from a container, because
rootfs can't be moved or umounted.

Here is an example how to get access to rootfs:
fd = open("/proc/self/ns/mnt", O_RDONLY)
umount2("/", MNT_DETACH);
setns(fd, CLONE_NEWNS)

rootfs may contain data, which should not be avaliable in CT-s.

I suggest to add ability to create a mount namespace with specified
mount points. A current task root can be used as a root for the new
mount namespace.

With this patch you can call chroot(ct->rootfs) and
unshare(UNSHARE_NEWNS2) to get a clean mount namespace.

UNSHARE_NEWNS2 can be used only with the unshare() syscall. The clone()
syscall doesn't have unused flags.

Here is an example how it looks like:
$ cat ../../unshare.c

int main(int argc, char **argv)
{
if (unshare(UNSHARE_NEWNS2))
return 1;

execl("/bin/bash", "/bin/bash", NULL);
return 1;
}
$ mount --bind test/ubuntu/ test/ubuntu/
$ cd test/ubuntu/
$ chroot .
$ ./unshare2
$ mount -t proc proc proc
$ cat /proc/self/mountinfo
55 55 252:1 /home/avagin/test/ubuntu / rw,relatime - ext4 
/dev/disk/by-uuid/d672b85f-533c-4868-9609-ca80be52d3c6 
rw,errors=remount-ro,data=ordered
56 55 0:3 / /proc rw,relatime - proc proc rw

Cc: Alexander Viro 
Cc: Andrew Morton 
Cc: "Eric W. Biederman" 
Cc: Cyrill Gorcunov 
Cc: Pavel Emelyanov 
Cc: Serge Hallyn 
Cc: Rob Landley 
Signed-off-by: Andrey Vagin 
---
 fs/namespace.c | 16 ++--
 include/uapi/linux/sched.h |  8 
 kernel/fork.c  | 11 ---
 kernel/nsproxy.c   |  2 +-
 4 files changed, 31 insertions(+), 6 deletions(-)

diff --git a/fs/namespace.c b/fs/namespace.c
index 730c50e..f50a848 100644
--- a/fs/namespace.c
+++ b/fs/namespace.c
@@ -2569,12 +2569,24 @@ struct mnt_namespace *copy_mnt_ns(unsigned long flags, 
struct mnt_namespace *ns,
 
BUG_ON(!ns);
 
-   if (likely(!(flags & CLONE_NEWNS))) {
+   if (likely(!(flags & (CLONE_NEWNS | UNSHARE_NEWNS2 {
get_mnt_ns(ns);
return ns;
}
 
-   old = ns->root;
+   if (flags & CLONE_NEWNS)
+   old = ns->root;
+   else { /* UNSHARE_NEWNS2 */
+   struct path root;
+
+   get_fs_root(current->fs, );
+   if (root.mnt->mnt_root != root.dentry) {
+   path_put();
+   return ERR_PTR(-EINVAL); /* not a mountpoint */
+   }
+   old = real_mount(root.mnt);
+   path_put();
+   }
 
new_ns = alloc_mnt_ns(user_ns);
if (IS_ERR(new_ns))
diff --git a/include/uapi/linux/sched.h b/include/uapi/linux/sched.h
index 34f9d73..8092e50 100644
--- a/include/uapi/linux/sched.h
+++ b/include/uapi/linux/sched.h
@@ -31,6 +31,14 @@
 #define CLONE_IO   0x8000  /* Clone io context */
 
 /*
+ * Following flags can be used only with unshare(), because
+ * they are intersected with CSIGNAL
+ */
+#define UNSHARE_NEWNS2 0x0001  /* Clone mnt namespace starting 
with the current task root. */
+
+#define UNSHARE_FLAGS  (UNSHARE_NEWNS2)
+
+/*
  * Scheduling policies
  */
 #define SCHED_NORMAL   0
diff --git a/kernel/fork.c b/kernel/fork.c
index 0cf9cdb..52f1fc0 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -1381,7 +1381,12 @@ static struct task_struct *copy_process(unsigned long 
clone_flags,
retval = copy_mm(clone_flags, p);
if (retval)
goto bad_fork_cleanup_signal;
-   retval = copy_namespaces(clone_flags, p);
+
+   /*
+* CSIGNAL and UNSHARE_FLAGS are intersected, but
+* UNSHARE_FLAGS can't be used with clone().
+*/
+   retval = copy_namespaces(clone_flags & ~UNSHARE_FLAGS, p);
if (retval)
goto bad_fork_cleanup_mm;
retval = copy_io(clone_flags, p);
@@ -1790,7 +1795,7 @@ static int check_unshare_flags(unsigned long 
unshare_flags)
if (unshare_flags & ~(CLONE_THREAD|CLONE_FS|CLONE_NEWNS|CLONE_SIGHAND|
CLONE_VM|CLONE_FILES|CLONE_SYSVSEM|
CLONE_NEWUTS|CLONE_NEWIPC|CLONE_NEWNET|
-   CLONE_NEWUSER|CLONE_NEWPID))
+   CLONE_NEWUSER|CLONE_NEWPID|UNSHARE_FLAGS))
return -EINVAL;
/*
 * Not implemented, but pretend it works if there is nothing to
@@ -1880,7 +1885,7 @@ SYSCALL_DEFINE1(unshare, unsigned long, unshare_flags)
/*
 * If unsharing namespace, must also unshare filesystem information.
 */
-   if (unshare_flags & CLONE_NEWNS)
+   if (unshare_flags & (CLONE_NEWNS | UNSHARE_NEWNS2))
unshare_flags |= CLONE_FS;
 

[PATCH] [RFC] mnt: add ability to clone mntns starting with the current root

2014-10-07 Thread Andrey Vagin
From: Andrey Vagin ava...@gmail.com

Currently when we create a new container with a separate root,
we need to clone the current mount namespace with all mounts and then
clean up it by using pivot_root(). A big part of mountpoints are cloned
only to be umounted.

Another problem is that rootfs can't be hidden from a container, because
rootfs can't be moved or umounted.

Here is an example how to get access to rootfs:
fd = open(/proc/self/ns/mnt, O_RDONLY)
umount2(/, MNT_DETACH);
setns(fd, CLONE_NEWNS)

rootfs may contain data, which should not be avaliable in CT-s.

I suggest to add ability to create a mount namespace with specified
mount points. A current task root can be used as a root for the new
mount namespace.

With this patch you can call chroot(ct-rootfs) and
unshare(UNSHARE_NEWNS2) to get a clean mount namespace.

UNSHARE_NEWNS2 can be used only with the unshare() syscall. The clone()
syscall doesn't have unused flags.

Here is an example how it looks like:
$ cat ../../unshare.c

int main(int argc, char **argv)
{
if (unshare(UNSHARE_NEWNS2))
return 1;

execl(/bin/bash, /bin/bash, NULL);
return 1;
}
$ mount --bind test/ubuntu/ test/ubuntu/
$ cd test/ubuntu/
$ chroot .
$ ./unshare2
$ mount -t proc proc proc
$ cat /proc/self/mountinfo
55 55 252:1 /home/avagin/test/ubuntu / rw,relatime - ext4 
/dev/disk/by-uuid/d672b85f-533c-4868-9609-ca80be52d3c6 
rw,errors=remount-ro,data=ordered
56 55 0:3 / /proc rw,relatime - proc proc rw

Cc: Alexander Viro v...@zeniv.linux.org.uk
Cc: Andrew Morton a...@linux-foundation.org
Cc: Eric W. Biederman ebied...@xmission.com
Cc: Cyrill Gorcunov gorcu...@openvz.org
Cc: Pavel Emelyanov xe...@parallels.com
Cc: Serge Hallyn serge.hal...@canonical.com
Cc: Rob Landley r...@landley.net
Signed-off-by: Andrey Vagin ava...@openvz.org
---
 fs/namespace.c | 16 ++--
 include/uapi/linux/sched.h |  8 
 kernel/fork.c  | 11 ---
 kernel/nsproxy.c   |  2 +-
 4 files changed, 31 insertions(+), 6 deletions(-)

diff --git a/fs/namespace.c b/fs/namespace.c
index 730c50e..f50a848 100644
--- a/fs/namespace.c
+++ b/fs/namespace.c
@@ -2569,12 +2569,24 @@ struct mnt_namespace *copy_mnt_ns(unsigned long flags, 
struct mnt_namespace *ns,
 
BUG_ON(!ns);
 
-   if (likely(!(flags  CLONE_NEWNS))) {
+   if (likely(!(flags  (CLONE_NEWNS | UNSHARE_NEWNS2 {
get_mnt_ns(ns);
return ns;
}
 
-   old = ns-root;
+   if (flags  CLONE_NEWNS)
+   old = ns-root;
+   else { /* UNSHARE_NEWNS2 */
+   struct path root;
+
+   get_fs_root(current-fs, root);
+   if (root.mnt-mnt_root != root.dentry) {
+   path_put(root);
+   return ERR_PTR(-EINVAL); /* not a mountpoint */
+   }
+   old = real_mount(root.mnt);
+   path_put(root);
+   }
 
new_ns = alloc_mnt_ns(user_ns);
if (IS_ERR(new_ns))
diff --git a/include/uapi/linux/sched.h b/include/uapi/linux/sched.h
index 34f9d73..8092e50 100644
--- a/include/uapi/linux/sched.h
+++ b/include/uapi/linux/sched.h
@@ -31,6 +31,14 @@
 #define CLONE_IO   0x8000  /* Clone io context */
 
 /*
+ * Following flags can be used only with unshare(), because
+ * they are intersected with CSIGNAL
+ */
+#define UNSHARE_NEWNS2 0x0001  /* Clone mnt namespace starting 
with the current task root. */
+
+#define UNSHARE_FLAGS  (UNSHARE_NEWNS2)
+
+/*
  * Scheduling policies
  */
 #define SCHED_NORMAL   0
diff --git a/kernel/fork.c b/kernel/fork.c
index 0cf9cdb..52f1fc0 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -1381,7 +1381,12 @@ static struct task_struct *copy_process(unsigned long 
clone_flags,
retval = copy_mm(clone_flags, p);
if (retval)
goto bad_fork_cleanup_signal;
-   retval = copy_namespaces(clone_flags, p);
+
+   /*
+* CSIGNAL and UNSHARE_FLAGS are intersected, but
+* UNSHARE_FLAGS can't be used with clone().
+*/
+   retval = copy_namespaces(clone_flags  ~UNSHARE_FLAGS, p);
if (retval)
goto bad_fork_cleanup_mm;
retval = copy_io(clone_flags, p);
@@ -1790,7 +1795,7 @@ static int check_unshare_flags(unsigned long 
unshare_flags)
if (unshare_flags  ~(CLONE_THREAD|CLONE_FS|CLONE_NEWNS|CLONE_SIGHAND|
CLONE_VM|CLONE_FILES|CLONE_SYSVSEM|
CLONE_NEWUTS|CLONE_NEWIPC|CLONE_NEWNET|
-   CLONE_NEWUSER|CLONE_NEWPID))
+   CLONE_NEWUSER|CLONE_NEWPID|UNSHARE_FLAGS))
return -EINVAL;
/*
 * Not implemented, but pretend it works if there is nothing to
@@ -1880,7 +1885,7 @@ SYSCALL_DEFINE1(unshare, unsigned long, unshare_flags)
/*
 * If unsharing namespace, must also unshare 

Re: [PATCH] [RFC] mnt: add ability to clone mntns starting with the current root

2014-10-07 Thread Al Viro
On Tue, Oct 07, 2014 at 04:12:57PM +0400, Andrey Vagin wrote:
 Another problem is that rootfs can't be hidden from a container, because
 rootfs can't be moved or umounted.

... which is a bug in mntns_install(), AFAICS.

 Here is an example how to get access to rootfs:
 fd = open(/proc/self/ns/mnt, O_RDONLY)
 umount2(/, MNT_DETACH);
 setns(fd, CLONE_NEWNS)
 
 rootfs may contain data, which should not be avaliable in CT-s.

Indeed.

 I suggest to add ability to create a mount namespace with specified
 mount points. A current task root can be used as a root for the new
 mount namespace.

Yecchh...  Frankly, you are opening a big can of worms - having rootfs
as absolute root of all namespaces simplifies a lot of things in there.
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] [RFC] mnt: add ability to clone mntns starting with the current root

2014-10-07 Thread Al Viro
On Tue, Oct 07, 2014 at 02:30:40PM +0100, Al Viro wrote:
 On Tue, Oct 07, 2014 at 04:12:57PM +0400, Andrey Vagin wrote:
  Another problem is that rootfs can't be hidden from a container, because
  rootfs can't be moved or umounted.
 
 ... which is a bug in mntns_install(), AFAICS.

Ability to get to exposed rootfs, that is.

  Here is an example how to get access to rootfs:
  fd = open(/proc/self/ns/mnt, O_RDONLY)
  umount2(/, MNT_DETACH);
  setns(fd, CLONE_NEWNS)
  
  rootfs may contain data, which should not be avaliable in CT-s.
 
 Indeed.

... and it looks like the above is what your mangled reproducer in previous
patch had been made of -
fd = open(/proc/self/ns/mnt, O_RDONLY)
umount2(/, MNT_DETACH);
setns(fd, CLONE_NEWNS)
umount2(/, MNT_DETACH);

IMO what it shows is setns() bug.  This switch root/cwd, no matter what
is wrong.
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] [RFC] mnt: add ability to clone mntns starting with the current root

2014-10-07 Thread Andrew Vagin
On Tue, Oct 07, 2014 at 02:33:39PM +0100, Al Viro wrote:
 On Tue, Oct 07, 2014 at 02:30:40PM +0100, Al Viro wrote:
  On Tue, Oct 07, 2014 at 04:12:57PM +0400, Andrey Vagin wrote:
   Another problem is that rootfs can't be hidden from a container, because
   rootfs can't be moved or umounted.
  
  ... which is a bug in mntns_install(), AFAICS.
 
 Ability to get to exposed rootfs, that is.
 
   Here is an example how to get access to rootfs:
   fd = open(/proc/self/ns/mnt, O_RDONLY)
   umount2(/, MNT_DETACH);
   setns(fd, CLONE_NEWNS)
   
   rootfs may contain data, which should not be avaliable in CT-s.
  
  Indeed.
 
 ... and it looks like the above is what your mangled reproducer in previous
 patch had been made of -
   fd = open(/proc/self/ns/mnt, O_RDONLY)
   umount2(/, MNT_DETACH);
   setns(fd, CLONE_NEWNS)
   umount2(/, MNT_DETACH);
 
 IMO what it shows is setns() bug.  This switch root/cwd, no matter what
 is wrong.

How can we move in another namespace without this switch? If the task
root and cwd isn't switched, they points to the old tree. In this case
how can we open something from the new tree?

Do you mean that root should not be changed if it points on a detached
mount or on a mount from another mount namespace? This looks reasonable.
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] [RFC] mnt: add ability to clone mntns starting with the current root

2014-10-07 Thread Eric W. Biederman
Al Viro v...@zeniv.linux.org.uk writes:

2 On Tue, Oct 07, 2014 at 02:30:40PM +0100, Al Viro wrote:
 On Tue, Oct 07, 2014 at 04:12:57PM +0400, Andrey Vagin wrote:
  Another problem is that rootfs can't be hidden from a container, because
  rootfs can't be moved or umounted.
 
 ... which is a bug in mntns_install(), AFAICS.

 Ability to get to exposed rootfs, that is.

The container side of this argument is pretty bogus.  It only applies
if user namespaces are not used for the container.

So it is only root (and not root in a container) who can get to the
exposed rootfs.

I have a vague memory someone actually had a real use in miminal systems
for being able to get back to the rootfs and being able to use rootfs as
the rootfs.  There was even a patch at that time that Andrew Morton was
carrying for a time to allow unmounting root and get at rootfs, and to
prevent the oops on rootfs unmount in some way.

So not only do I not think it is a bug to get back too rootfs, I think
it is a feature that some people have expressed at least half-way sane
uses for.

  Here is an example how to get access to rootfs:
  fd = open(/proc/self/ns/mnt, O_RDONLY)
  umount2(/, MNT_DETACH);
  setns(fd, CLONE_NEWNS)
  
  rootfs may contain data, which should not be avaliable in CT-s.
 
 Indeed.

 ... and it looks like the above is what your mangled reproducer in previous
 patch had been made of -
   fd = open(/proc/self/ns/mnt, O_RDONLY)
   umount2(/, MNT_DETACH);
   setns(fd, CLONE_NEWNS)
   umount2(/, MNT_DETACH);

 IMO what it shows is setns() bug.  This switch root/cwd, no matter what
 is wrong.

IMO the bug is allowing us to unmount things that should never be unmounted.

In a mount namespace created with just user namespace permissions we
can't get at rootfs because MNT_LOCKED is set on the root directory
and thus it can not be mounted.

Further if anyone has permission to call chroot and chdir on any mount
in a mount namespace (that isn't currently covered) they can get at all
of them that are not currently covered.  A mount namespace where no one
can get at any uncovered filesystem seems to be the definition of
useless and ridiculous.


Now there is a bug in that MNT_DETACH today does not currently enforce
MNT_LOCKED on submounts of the mount point that is detached. I am
currently looking at how to construct the appropriate permission check
to prevent that.  Unfortunately I can not disallow MNT_DETACH with
submounts all together as that breaks too many legitimate uses.

That failure to enforce MNT_LOCKED is my mistake. I had a naive notion
that submounts would remain mounted after a mount detach and I misread
the code when I did the original work.  My mistake. 

Eric
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] [RFC] mnt: add ability to clone mntns starting with the current root

2014-10-07 Thread Serge Hallyn
Quoting Eric W. Biederman (ebied...@xmission.com):
 Al Viro v...@zeniv.linux.org.uk writes:
 
 2 On Tue, Oct 07, 2014 at 02:30:40PM +0100, Al Viro wrote:
  On Tue, Oct 07, 2014 at 04:12:57PM +0400, Andrey Vagin wrote:
   Another problem is that rootfs can't be hidden from a container, because
   rootfs can't be moved or umounted.
  
  ... which is a bug in mntns_install(), AFAICS.
 
  Ability to get to exposed rootfs, that is.
 
 The container side of this argument is pretty bogus.  It only applies
 if user namespaces are not used for the container.

User namespaces are still far too restricted for many container use
cases.  We can't say we have user namespaces so now privileged
containers can be ignored.  Yes you never should have handed the
keys to a privileged container to an untrusted person, but we do
still try to protect the host from accidental damage due to a
privileged container.

 So it is only root (and not root in a container) who can get to the
 exposed rootfs.
 
 I have a vague memory someone actually had a real use in miminal systems
 for being able to get back to the rootfs and being able to use rootfs as
 the rootfs.  There was even a patch at that time that Andrew Morton was
 carrying for a time to allow unmounting root and get at rootfs, and to
 prevent the oops on rootfs unmount in some way.
 
 So not only do I not think it is a bug to get back too rootfs, I think
 it is a feature that some people have expressed at least half-way sane
 uses for.

They can still do that if they want, using chroot :)

   Here is an example how to get access to rootfs:
   fd = open(/proc/self/ns/mnt, O_RDONLY)
   umount2(/, MNT_DETACH);
   setns(fd, CLONE_NEWNS)
   
   rootfs may contain data, which should not be avaliable in CT-s.
  
  Indeed.
 
  ... and it looks like the above is what your mangled reproducer in previous
  patch had been made of -
  fd = open(/proc/self/ns/mnt, O_RDONLY)
  umount2(/, MNT_DETACH);
  setns(fd, CLONE_NEWNS)
  umount2(/, MNT_DETACH);
 
  IMO what it shows is setns() bug.  This switch root/cwd, no matter what
  is wrong.
 
 IMO the bug is allowing us to unmount things that should never be unmounted.
 
 In a mount namespace created with just user namespace permissions we
 can't get at rootfs because MNT_LOCKED is set on the root directory
 and thus it can not be mounted.
 
 Further if anyone has permission to call chroot and chdir on any mount
 in a mount namespace (that isn't currently covered) they can get at all
 of them that are not currently covered.  A mount namespace where no one
 can get at any uncovered filesystem seems to be the definition of
 useless and ridiculous.
 
 
 Now there is a bug in that MNT_DETACH today does not currently enforce
 MNT_LOCKED on submounts of the mount point that is detached. I am
 currently looking at how to construct the appropriate permission check
 to prevent that.  Unfortunately I can not disallow MNT_DETACH with
 submounts all together as that breaks too many legitimate uses.
 
 That failure to enforce MNT_LOCKED is my mistake. I had a naive notion
 that submounts would remain mounted after a mount detach and I misread
 the code when I did the original work.  My mistake. 

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] [RFC] mnt: add ability to clone mntns starting with the current root

2014-10-07 Thread Eric W. Biederman
Andrey Vagin ava...@openvz.org writes:

 From: Andrey Vagin ava...@gmail.com

 Currently when we create a new container with a separate root,
 we need to clone the current mount namespace with all mounts and then
 clean up it by using pivot_root(). A big part of mountpoints are cloned
 only to be umounted.

Is the motivation performance?  Because if that is the motivation we
need numbers.

 Another problem is that rootfs can't be hidden from a container, because
 rootfs can't be moved or umounted.

 Here is an example how to get access to rootfs:
 fd = open(/proc/self/ns/mnt, O_RDONLY)
 umount2(/, MNT_DETACH);
 setns(fd, CLONE_NEWNS)

 rootfs may contain data, which should not be avaliable in CT-s.

Well don't give those containers CAP_SYS_ADMIN.  If you aren't using
user namespaces there is no expectation of safety from those kinds of
problems.  Getting at rootfs is perfectly valid for root.

 I suggest to add ability to create a mount namespace with specified
 mount points. A current task root can be used as a root for the new
 mount namespace.

I really don't think you are going to like the result because you will
loose access to /proc and /sys.

 With this patch you can call chroot(ct-rootfs) and
 unshare(UNSHARE_NEWNS2) to get a clean mount namespace.

That is a little bit of an ugly way to smuggle a parameter into the
creation of a mount namespace.  Further I am pretty certain this patch
totally breaks the setting of new_fs-root and new_fs-pwd.


In net my opinion is that the code doesn't work and does not provide
sufficient justification for a new system call.

 UNSHARE_NEWNS2 can be used only with the unshare() syscall. The clone()
 syscall doesn't have unused flags.

 Here is an example how it looks like:
 $ cat ../../unshare.c

 int main(int argc, char **argv)
 {
/* You left out
 * mount --bind /some/root/path /some/root/path
 * chroot /some/root/path
 */

   if (unshare(UNSHARE_NEWNS2))
   return 1;

   execl(/bin/bash, /bin/bash, NULL);
   return 1;
 }
 $ mount --bind test/ubuntu/ test/ubuntu/
 $ cd test/ubuntu/
 $ chroot .
 $ ./unshare2
 $ mount -t proc proc proc
 $ cat /proc/self/mountinfo
 55 55 252:1 /home/avagin/test/ubuntu / rw,relatime - ext4 
 /dev/disk/by-uuid/d672b85f-533c-4868-9609-ca80be52d3c6 
 rw,errors=remount-ro,data=ordered
 56 55 0:3 / /proc rw,relatime - proc proc rw

 Cc: Alexander Viro v...@zeniv.linux.org.uk
 Cc: Andrew Morton a...@linux-foundation.org
 Cc: Eric W. Biederman ebied...@xmission.com
 Cc: Cyrill Gorcunov gorcu...@openvz.org
 Cc: Pavel Emelyanov xe...@parallels.com
 Cc: Serge Hallyn serge.hal...@canonical.com
 Cc: Rob Landley r...@landley.net
 Signed-off-by: Andrey Vagin ava...@openvz.org
 ---
  fs/namespace.c | 16 ++--
  include/uapi/linux/sched.h |  8 
  kernel/fork.c  | 11 ---
  kernel/nsproxy.c   |  2 +-
  4 files changed, 31 insertions(+), 6 deletions(-)

 diff --git a/fs/namespace.c b/fs/namespace.c
 index 730c50e..f50a848 100644
 --- a/fs/namespace.c
 +++ b/fs/namespace.c
 @@ -2569,12 +2569,24 @@ struct mnt_namespace *copy_mnt_ns(unsigned long 
 flags, struct mnt_namespace *ns,
  
   BUG_ON(!ns);
  
 - if (likely(!(flags  CLONE_NEWNS))) {
 + if (likely(!(flags  (CLONE_NEWNS | UNSHARE_NEWNS2 {
   get_mnt_ns(ns);
   return ns;
   }
  
 - old = ns-root;
 + if (flags  CLONE_NEWNS)
 + old = ns-root;
 + else { /* UNSHARE_NEWNS2 */
 + struct path root;
 +
 + get_fs_root(current-fs, root);
 + if (root.mnt-mnt_root != root.dentry) {
 + path_put(root);
 + return ERR_PTR(-EINVAL); /* not a mountpoint */
 + }
 + old = real_mount(root.mnt);
 + path_put(root);
 + }
  
   new_ns = alloc_mnt_ns(user_ns);
   if (IS_ERR(new_ns))
 diff --git a/include/uapi/linux/sched.h b/include/uapi/linux/sched.h
 index 34f9d73..8092e50 100644
 --- a/include/uapi/linux/sched.h
 +++ b/include/uapi/linux/sched.h
 @@ -31,6 +31,14 @@
  #define CLONE_IO 0x8000  /* Clone io context */
  
  /*
 + * Following flags can be used only with unshare(), because
 + * they are intersected with CSIGNAL
 + */
 +#define UNSHARE_NEWNS2   0x0001  /* Clone mnt namespace 
 starting with the current task root. */
 +
 +#define UNSHARE_FLAGS(UNSHARE_NEWNS2)
 +
 +/*
   * Scheduling policies
   */
  #define SCHED_NORMAL 0
 diff --git a/kernel/fork.c b/kernel/fork.c
 index 0cf9cdb..52f1fc0 100644
 --- a/kernel/fork.c
 +++ b/kernel/fork.c
 @@ -1381,7 +1381,12 @@ static struct task_struct *copy_process(unsigned long 
 clone_flags,
   retval = copy_mm(clone_flags, p);
   if (retval)
   goto bad_fork_cleanup_signal;
 - retval = copy_namespaces(clone_flags, p);
 +
 + /*
 +  * CSIGNAL and UNSHARE_FLAGS are intersected, but
 

Re: [PATCH] [RFC] mnt: add ability to clone mntns starting with the current root

2014-10-07 Thread Eric W. Biederman
Serge Hallyn serge.hal...@ubuntu.com writes:

 Quoting Eric W. Biederman (ebied...@xmission.com):
 Al Viro v...@zeniv.linux.org.uk writes:
 
 2 On Tue, Oct 07, 2014 at 02:30:40PM +0100, Al Viro wrote:
  On Tue, Oct 07, 2014 at 04:12:57PM +0400, Andrey Vagin wrote:
   Another problem is that rootfs can't be hidden from a container, because
   rootfs can't be moved or umounted.
  
  ... which is a bug in mntns_install(), AFAICS.
 
  Ability to get to exposed rootfs, that is.
 
 The container side of this argument is pretty bogus.  It only applies
 if user namespaces are not used for the container.

 User namespaces are still far too restricted for many container use
 cases.  We can't say we have user namespaces so now privileged
 containers can be ignored.  Yes you never should have handed the
 keys to a privileged container to an untrusted person, but we do
 still try to protect the host from accidental damage due to a
 privileged container.

What I meant is that it isn't about containers.   It is about something
root can do.  So this is not a container problem.

 So it is only root (and not root in a container) who can get to the
 exposed rootfs.
 
 I have a vague memory someone actually had a real use in miminal systems
 for being able to get back to the rootfs and being able to use rootfs as
 the rootfs.  There was even a patch at that time that Andrew Morton was
 carrying for a time to allow unmounting root and get at rootfs, and to
 prevent the oops on rootfs unmount in some way.
 
 So not only do I not think it is a bug to get back too rootfs, I think
 it is a feature that some people have expressed at least half-way sane
 uses for.

 They can still do that if they want, using chroot :)

It would take fchdir or fchroot and a directory file descriptor open on
rootfs.  Frequently there is no appropriate directory file descriptor.

Eric
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] [RFC] mnt: add ability to clone mntns starting with the current root

2014-10-07 Thread Andy Lutomirski
On Tue, Oct 7, 2014 at 1:30 PM, Eric W. Biederman ebied...@xmission.com wrote:
 Al Viro v...@zeniv.linux.org.uk writes:

 2 On Tue, Oct 07, 2014 at 02:30:40PM +0100, Al Viro wrote:
 On Tue, Oct 07, 2014 at 04:12:57PM +0400, Andrey Vagin wrote:
  Another problem is that rootfs can't be hidden from a container, because
  rootfs can't be moved or umounted.

 ... which is a bug in mntns_install(), AFAICS.

 Ability to get to exposed rootfs, that is.

 The container side of this argument is pretty bogus.  It only applies
 if user namespaces are not used for the container.

 So it is only root (and not root in a container) who can get to the
 exposed rootfs.

 I have a vague memory someone actually had a real use in miminal systems
 for being able to get back to the rootfs and being able to use rootfs as
 the rootfs.  There was even a patch at that time that Andrew Morton was
 carrying for a time to allow unmounting root and get at rootfs, and to
 prevent the oops on rootfs unmount in some way.

 So not only do I not think it is a bug to get back too rootfs, I think
 it is a feature that some people have expressed at least half-way sane
 uses for.

  Here is an example how to get access to rootfs:
  fd = open(/proc/self/ns/mnt, O_RDONLY)
  umount2(/, MNT_DETACH);
  setns(fd, CLONE_NEWNS)
 
  rootfs may contain data, which should not be avaliable in CT-s.

 Indeed.

 ... and it looks like the above is what your mangled reproducer in previous
 patch had been made of -
   fd = open(/proc/self/ns/mnt, O_RDONLY)
   umount2(/, MNT_DETACH);
   setns(fd, CLONE_NEWNS)
   umount2(/, MNT_DETACH);

 IMO what it shows is setns() bug.  This switch root/cwd, no matter what
 is wrong.

 IMO the bug is allowing us to unmount things that should never be unmounted.

 In a mount namespace created with just user namespace permissions we
 can't get at rootfs because MNT_LOCKED is set on the root directory
 and thus it can not be mounted.

 Further if anyone has permission to call chroot and chdir on any mount
 in a mount namespace (that isn't currently covered) they can get at all
 of them that are not currently covered.  A mount namespace where no one
 can get at any uncovered filesystem seems to be the definition of
 useless and ridiculous.


 Now there is a bug in that MNT_DETACH today does not currently enforce
 MNT_LOCKED on submounts of the mount point that is detached. I am
 currently looking at how to construct the appropriate permission check
 to prevent that.  Unfortunately I can not disallow MNT_DETACH with
 submounts all together as that breaks too many legitimate uses.

Why should MNT_LOCKED on submounts be enforced?

Is it because, if you retain a reference to the detached tree, then
you can see under the submounts?  If so, let's fix *that*.  Because
otherwise the whole model of pivot_root + detach will break.

Also, damn it, we need change_the_ns_root instead of pivot_root.  I
doubt that any container programs actually want to keep the old root
attached after pivot_root.

--Andy
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] [RFC] mnt: add ability to clone mntns starting with the current root

2014-10-07 Thread Eric W. Biederman
Andy Lutomirski l...@amacapital.net writes:

 Why should MNT_LOCKED on submounts be enforced?

 Is it because, if you retain a reference to the detached tree, then
 you can see under the submounts?  

Yes. MNT_DETACH is a recursive operation that detaches all of the mount
and all of it's submounts.  Which means you can see under the submounts
if you have a reference to a detached mount.

 If so, let's fix *that*.  Because
 otherwise the whole model of pivot_root + detach will break.

I am not certain what you are referring to.  pivot_root doesn't
manipulate the mount tree so you can see under anything.

What I believe is the appropriate fix is to fail umount2(...,MNT_DETACH)
if there are any referenced mount points being detached that have a
locked submount.

 Also, damn it, we need change_the_ns_root instead of pivot_root.  I
 doubt that any container programs actually want to keep the old root
 attached after pivot_root.

Shrug.  Except for chroot_fs_refs() pivot_root is a cheap.

I'm not particularly in favor of merging pivot_root and umount2.  The
number of weird cases in the current api are high.  A merged piece of
code would just make them higher.

I am hoping that one more round of bug fixing will at least get the bugs
for having unprivileged mounts fixed in the current API.

Eric
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] [RFC] mnt: add ability to clone mntns starting with the current root

2014-10-07 Thread Serge Hallyn
Quoting Eric W. Biederman (ebied...@xmission.com):
 Serge Hallyn serge.hal...@ubuntu.com writes:
 
  Quoting Eric W. Biederman (ebied...@xmission.com):
  Al Viro v...@zeniv.linux.org.uk writes:
  
  2 On Tue, Oct 07, 2014 at 02:30:40PM +0100, Al Viro wrote:
   On Tue, Oct 07, 2014 at 04:12:57PM +0400, Andrey Vagin wrote:
Another problem is that rootfs can't be hidden from a container, 
because
rootfs can't be moved or umounted.
   
   ... which is a bug in mntns_install(), AFAICS.
  
   Ability to get to exposed rootfs, that is.
  
  The container side of this argument is pretty bogus.  It only applies
  if user namespaces are not used for the container.
 
  User namespaces are still far too restricted for many container use
  cases.  We can't say we have user namespaces so now privileged
  containers can be ignored.  Yes you never should have handed the
  keys to a privileged container to an untrusted person, but we do
  still try to protect the host from accidental damage due to a
  privileged container.
 
 What I meant is that it isn't about containers.   It is about something
 root can do.  So this is not a container problem.

Oh, ok.  

Sorry, I'm getting the two thread confused anyway.  I'm going to bow out
here until I can pay proper attention.

  So it is only root (and not root in a container) who can get to the
  exposed rootfs.
  
  I have a vague memory someone actually had a real use in miminal systems
  for being able to get back to the rootfs and being able to use rootfs as
  the rootfs.  There was even a patch at that time that Andrew Morton was
  carrying for a time to allow unmounting root and get at rootfs, and to
  prevent the oops on rootfs unmount in some way.
  
  So not only do I not think it is a bug to get back too rootfs, I think
  it is a feature that some people have expressed at least half-way sane
  uses for.
 
  They can still do that if they want, using chroot :)
 
 It would take fchdir or fchroot and a directory file descriptor open on
 rootfs.  Frequently there is no appropriate directory file descriptor.

?  you can always escape if you're simply chrooted.  waterbuffalo :)
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] [RFC] mnt: add ability to clone mntns starting with the current root

2014-10-07 Thread Serge Hallyn
Quoting Andy Lutomirski (l...@amacapital.net):
 On Tue, Oct 7, 2014 at 1:30 PM, Eric W. Biederman ebied...@xmission.com 
 wrote:
  Al Viro v...@zeniv.linux.org.uk writes:
 
  2 On Tue, Oct 07, 2014 at 02:30:40PM +0100, Al Viro wrote:
  On Tue, Oct 07, 2014 at 04:12:57PM +0400, Andrey Vagin wrote:
   Another problem is that rootfs can't be hidden from a container, because
   rootfs can't be moved or umounted.
 
  ... which is a bug in mntns_install(), AFAICS.
 
  Ability to get to exposed rootfs, that is.
 
  The container side of this argument is pretty bogus.  It only applies
  if user namespaces are not used for the container.
 
  So it is only root (and not root in a container) who can get to the
  exposed rootfs.
 
  I have a vague memory someone actually had a real use in miminal systems
  for being able to get back to the rootfs and being able to use rootfs as
  the rootfs.  There was even a patch at that time that Andrew Morton was
  carrying for a time to allow unmounting root and get at rootfs, and to
  prevent the oops on rootfs unmount in some way.
 
  So not only do I not think it is a bug to get back too rootfs, I think
  it is a feature that some people have expressed at least half-way sane
  uses for.
 
   Here is an example how to get access to rootfs:
   fd = open(/proc/self/ns/mnt, O_RDONLY)
   umount2(/, MNT_DETACH);
   setns(fd, CLONE_NEWNS)
  
   rootfs may contain data, which should not be avaliable in CT-s.
 
  Indeed.
 
  ... and it looks like the above is what your mangled reproducer in previous
  patch had been made of -
fd = open(/proc/self/ns/mnt, O_RDONLY)
umount2(/, MNT_DETACH);
setns(fd, CLONE_NEWNS)
umount2(/, MNT_DETACH);
 
  IMO what it shows is setns() bug.  This switch root/cwd, no matter what
  is wrong.
 
  IMO the bug is allowing us to unmount things that should never be unmounted.
 
  In a mount namespace created with just user namespace permissions we
  can't get at rootfs because MNT_LOCKED is set on the root directory
  and thus it can not be mounted.
 
  Further if anyone has permission to call chroot and chdir on any mount
  in a mount namespace (that isn't currently covered) they can get at all
  of them that are not currently covered.  A mount namespace where no one
  can get at any uncovered filesystem seems to be the definition of
  useless and ridiculous.
 
 
  Now there is a bug in that MNT_DETACH today does not currently enforce
  MNT_LOCKED on submounts of the mount point that is detached. I am
  currently looking at how to construct the appropriate permission check
  to prevent that.  Unfortunately I can not disallow MNT_DETACH with
  submounts all together as that breaks too many legitimate uses.
 
 Why should MNT_LOCKED on submounts be enforced?
 
 Is it because, if you retain a reference to the detached tree, then
 you can see under the submounts?  If so, let's fix *that*.  Because
 otherwise the whole model of pivot_root + detach will break.
 
 Also, damn it, we need change_the_ns_root instead of pivot_root.  I
 doubt that any container programs actually want to keep the old root
 attached after pivot_root.

Right I think that'll fix the problem we were having, and I think
Andrey said the same thing in another list a few days ago.
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] [RFC] mnt: add ability to clone mntns starting with the current root

2014-10-07 Thread Andy Lutomirski
On Tue, Oct 7, 2014 at 2:26 PM, Eric W. Biederman ebied...@xmission.com wrote:
 Andy Lutomirski l...@amacapital.net writes:

 Why should MNT_LOCKED on submounts be enforced?

 Is it because, if you retain a reference to the detached tree, then
 you can see under the submounts?

 Yes. MNT_DETACH is a recursive operation that detaches all of the mount
 and all of it's submounts.  Which means you can see under the submounts
 if you have a reference to a detached mount.

 If so, let's fix *that*.  Because
 otherwise the whole model of pivot_root + detach will break.

 I am not certain what you are referring to.  pivot_root doesn't
 manipulate the mount tree so you can see under anything.

 What I believe is the appropriate fix is to fail umount2(...,MNT_DETACH)
 if there are any referenced mount points being detached that have a
 locked submount.

Most of the container-using things do, roughly:

Unshare userns and mountns
Mount some new stuff
pivot_root to the new stuff
MNT_DETACH the old.

That last step will almost always fail if you make this change.

--Andy
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] [RFC] mnt: add ability to clone mntns starting with the current root

2014-10-07 Thread Eric W. Biederman
Serge Hallyn serge.hal...@ubuntu.com writes:

 Quoting Eric W. Biederman (ebied...@xmission.com):

 What I meant is that it isn't about containers.   It is about something
 root can do.  So this is not a container problem.

 Oh, ok.  

 Sorry, I'm getting the two thread confused anyway.  I'm going to bow out
 here until I can pay proper attention.

I think there is half a point here that may be legit, when you are using
mount namespaces to jail applications, there may be a problem
with umounting / and making it to the underlying rootfs filesystem.

I am squinting and looking this way and that but while I can imagine
someone more clever than I can think up some unique property of rootfs
that makes it a little more exploitable than just mounting a ramfs,
but since you have to be root to exploit those properties I think the
game is pretty much lost.

  So it is only root (and not root in a container) who can get to the
  exposed rootfs.
  
  I have a vague memory someone actually had a real use in miminal systems
  for being able to get back to the rootfs and being able to use rootfs as
  the rootfs.  There was even a patch at that time that Andrew Morton was
  carrying for a time to allow unmounting root and get at rootfs, and to
  prevent the oops on rootfs unmount in some way.
  
  So not only do I not think it is a bug to get back too rootfs, I think
  it is a feature that some people have expressed at least half-way sane
  uses for.
 
  They can still do that if they want, using chroot :)
 
 It would take fchdir or fchroot and a directory file descriptor open on
 rootfs.  Frequently there is no appropriate directory file descriptor.

 ?  you can always escape if you're simply chrooted.  waterbuffalo :)

filesystem type rootfs.

Eric

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] [RFC] mnt: add ability to clone mntns starting with the current root

2014-10-07 Thread Eric W. Biederman
Andy Lutomirski l...@amacapital.net writes:

 On Tue, Oct 7, 2014 at 2:26 PM, Eric W. Biederman ebied...@xmission.com 
 wrote:
 Andy Lutomirski l...@amacapital.net writes:

 Why should MNT_LOCKED on submounts be enforced?

 Is it because, if you retain a reference to the detached tree, then
 you can see under the submounts?

 Yes. MNT_DETACH is a recursive operation that detaches all of the mount
 and all of it's submounts.  Which means you can see under the submounts
 if you have a reference to a detached mount.

 If so, let's fix *that*.  Because
 otherwise the whole model of pivot_root + detach will break.

 I am not certain what you are referring to.  pivot_root doesn't
 manipulate the mount tree so you can see under anything.

 What I believe is the appropriate fix is to fail umount2(...,MNT_DETACH)
 if there are any referenced mount points being detached that have a
 locked submount.

 Most of the container-using things do, roughly:

 Unshare userns and mountns
 Mount some new stuff
 pivot_root to the new stuff
 MNT_DETACH the old.

 That last step will almost always fail if you make this change.

I don't think so.

I expect I could add full busy detection of normal umounts and those
applications would not fail.

What I am proposing is a more targeted version of busy detection that
looks at each mount in the set that detach will unmount.  For each mount
if it is busy with non-submount references and it has at least one
locked submount fail the detach with -EBUSY.

Do you really think we have userspace references to the one or more of the
mounts under old?

Eric
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] [RFC] mnt: add ability to clone mntns starting with the current root

2014-10-07 Thread Andy Lutomirski
On Tue, Oct 7, 2014 at 2:50 PM, Eric W. Biederman ebied...@xmission.com wrote:
 Andy Lutomirski l...@amacapital.net writes:

 On Tue, Oct 7, 2014 at 2:26 PM, Eric W. Biederman ebied...@xmission.com 
 wrote:
 Andy Lutomirski l...@amacapital.net writes:

 Why should MNT_LOCKED on submounts be enforced?

 Is it because, if you retain a reference to the detached tree, then
 you can see under the submounts?

 Yes. MNT_DETACH is a recursive operation that detaches all of the mount
 and all of it's submounts.  Which means you can see under the submounts
 if you have a reference to a detached mount.

 If so, let's fix *that*.  Because
 otherwise the whole model of pivot_root + detach will break.

 I am not certain what you are referring to.  pivot_root doesn't
 manipulate the mount tree so you can see under anything.

 What I believe is the appropriate fix is to fail umount2(...,MNT_DETACH)
 if there are any referenced mount points being detached that have a
 locked submount.

 Most of the container-using things do, roughly:

 Unshare userns and mountns
 Mount some new stuff
 pivot_root to the new stuff
 MNT_DETACH the old.

 That last step will almost always fail if you make this change.

 I don't think so.

 I expect I could add full busy detection of normal umounts and those
 applications would not fail.

 What I am proposing is a more targeted version of busy detection that
 looks at each mount in the set that detach will unmount.  For each mount
 if it is busy with non-submount references and it has at least one
 locked submount fail the detach with -EBUSY.

 Do you really think we have userspace references to the one or more of the
 mounts under old?


I suspect that we have a userspace reference to old itself.

--Andy

 Eric



-- 
Andy Lutomirski
AMA Capital Management, LLC
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] [RFC] mnt: add ability to clone mntns starting with the current root

2014-10-07 Thread Andy Lutomirski
On Tue, Oct 7, 2014 at 2:42 PM, Eric W. Biederman ebied...@xmission.com wrote:

 I am squinting and looking this way and that but while I can imagine
 someone more clever than I can think up some unique property of rootfs
 that makes it a little more exploitable than just mounting a ramfs,
 but since you have to be root to exploit those properties I think the
 game is pretty much lost.

Yes.  rootfs might not be empty, it might have totally insane
permissions, and it's globally shared, which makes it into a wonderful
channel to pass things around that shouldn't be passed around.

Can non-root do this?  You'd need to be in a userns with a / that
isn't MNT_LOCKED.  Can this happen on any normal setup?

FWIW, I think we should unconditionally MNT_LOCKED the root on userns
unshare, even if it's the only mount.



  So it is only root (and not root in a container) who can get to the
  exposed rootfs.
 
  I have a vague memory someone actually had a real use in miminal systems
  for being able to get back to the rootfs and being able to use rootfs as
  the rootfs.  There was even a patch at that time that Andrew Morton was
  carrying for a time to allow unmounting root and get at rootfs, and to
  prevent the oops on rootfs unmount in some way.
 
  So not only do I not think it is a bug to get back too rootfs, I think
  it is a feature that some people have expressed at least half-way sane
  uses for.
 
  They can still do that if they want, using chroot :)

 It would take fchdir or fchroot and a directory file descriptor open on
 rootfs.  Frequently there is no appropriate directory file descriptor.

 ?  you can always escape if you're simply chrooted.  waterbuffalo :)

 filesystem type rootfs.

 Eric

 --
 To unsubscribe from this list: send the line unsubscribe linux-api in
 the body of a message to majord...@vger.kernel.org
 More majordomo info at  http://vger.kernel.org/majordomo-info.html



-- 
Andy Lutomirski
AMA Capital Management, LLC
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] [RFC] mnt: add ability to clone mntns starting with the current root

2014-10-07 Thread Eric W. Biederman
Andy Lutomirski l...@amacapital.net writes:

 On Tue, Oct 7, 2014 at 2:42 PM, Eric W. Biederman ebied...@xmission.com 
 wrote:

 I am squinting and looking this way and that but while I can imagine
 someone more clever than I can think up some unique property of rootfs
 that makes it a little more exploitable than just mounting a ramfs,
 but since you have to be root to exploit those properties I think the
 game is pretty much lost.

 Yes.  rootfs might not be empty, it might have totally insane
 permissions, and it's globally shared, which makes it into a wonderful
 channel to pass things around that shouldn't be passed around.

But if only root with proc mounted can reach it...  I don't know.
There might be a case for setting MNT_LOCKED when we overmount /
as root but I don't yet see it.

 Can non-root do this?  You'd need to be in a userns with a / that
 isn't MNT_LOCKED.  Can this happen on any normal setup?

 FWIW, I think we should unconditionally MNT_LOCKED the root on userns
 unshare, even if it's the only mount.

To the best of my knowledge MNT_LOCKED is set uncondintially on userns
unshare.

Eric

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] [RFC] mnt: add ability to clone mntns starting with the current root

2014-10-07 Thread Andy Lutomirski
On Tue, Oct 7, 2014 at 3:42 PM, Eric W. Biederman ebied...@xmission.com wrote:
 Andy Lutomirski l...@amacapital.net writes:

 On Tue, Oct 7, 2014 at 2:42 PM, Eric W. Biederman ebied...@xmission.com 
 wrote:

 I am squinting and looking this way and that but while I can imagine
 someone more clever than I can think up some unique property of rootfs
 that makes it a little more exploitable than just mounting a ramfs,
 but since you have to be root to exploit those properties I think the
 game is pretty much lost.

 Yes.  rootfs might not be empty, it might have totally insane
 permissions, and it's globally shared, which makes it into a wonderful
 channel to pass things around that shouldn't be passed around.

 But if only root with proc mounted can reach it...  I don't know.

It doesn't have to be global root.  It could be userns root.

 There might be a case for setting MNT_LOCKED when we overmount /
 as root but I don't yet see it.

 Can non-root do this?  You'd need to be in a userns with a / that
 isn't MNT_LOCKED.  Can this happen on any normal setup?

 FWIW, I think we should unconditionally MNT_LOCKED the root on userns
 unshare, even if it's the only mount.

 To the best of my knowledge MNT_LOCKED is set uncondintially on userns
 unshare.

Only if list_empty(old-mnt_expire), whatever that means, I think.

--Andy
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] [RFC] mnt: add ability to clone mntns starting with the current root

2014-10-07 Thread Eric W. Biederman
Andy Lutomirski l...@amacapital.net writes:

 On Tue, Oct 7, 2014 at 3:42 PM, Eric W. Biederman ebied...@xmission.com 
 wrote:
 Andy Lutomirski l...@amacapital.net writes:

 On Tue, Oct 7, 2014 at 2:42 PM, Eric W. Biederman ebied...@xmission.com 
 wrote:

 I am squinting and looking this way and that but while I can imagine
 someone more clever than I can think up some unique property of rootfs
 that makes it a little more exploitable than just mounting a ramfs,
 but since you have to be root to exploit those properties I think the
 game is pretty much lost.

 Yes.  rootfs might not be empty, it might have totally insane
 permissions, and it's globally shared, which makes it into a wonderful
 channel to pass things around that shouldn't be passed around.

 But if only root with proc mounted can reach it...  I don't know.

 It doesn't have to be global root.  It could be userns root.

 There might be a case for setting MNT_LOCKED when we overmount /
 as root but I don't yet see it.

 Can non-root do this?  You'd need to be in a userns with a / that
 isn't MNT_LOCKED.  Can this happen on any normal setup?

 FWIW, I think we should unconditionally MNT_LOCKED the root on userns
 unshare, even if it's the only mount.

 To the best of my knowledge MNT_LOCKED is set uncondintially on userns
 unshare.

 Only if list_empty(old-mnt_expire), whatever that means, I think.

An autofs or nfs automounted mount.  Can those ever become root?

Eric
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] [RFC] mnt: add ability to clone mntns starting with the current root

2014-10-07 Thread Andy Lutomirski
On Tue, Oct 7, 2014 at 4:42 PM, Eric W. Biederman ebied...@xmission.com wrote:
 Andy Lutomirski l...@amacapital.net writes:

 On Tue, Oct 7, 2014 at 3:42 PM, Eric W. Biederman ebied...@xmission.com 
 wrote:
 Andy Lutomirski l...@amacapital.net writes:

 On Tue, Oct 7, 2014 at 2:42 PM, Eric W. Biederman ebied...@xmission.com 
 wrote:

 I am squinting and looking this way and that but while I can imagine
 someone more clever than I can think up some unique property of rootfs
 that makes it a little more exploitable than just mounting a ramfs,
 but since you have to be root to exploit those properties I think the
 game is pretty much lost.

 Yes.  rootfs might not be empty, it might have totally insane
 permissions, and it's globally shared, which makes it into a wonderful
 channel to pass things around that shouldn't be passed around.

 But if only root with proc mounted can reach it...  I don't know.

 It doesn't have to be global root.  It could be userns root.

 There might be a case for setting MNT_LOCKED when we overmount /
 as root but I don't yet see it.

 Can non-root do this?  You'd need to be in a userns with a / that
 isn't MNT_LOCKED.  Can this happen on any normal setup?

 FWIW, I think we should unconditionally MNT_LOCKED the root on userns
 unshare, even if it's the only mount.

 To the best of my knowledge MNT_LOCKED is set uncondintially on userns
 unshare.

 Only if list_empty(old-mnt_expire), whatever that means, I think.

 An autofs or nfs automounted mount.  Can those ever become root?

Dunno.

I thought that this code was what set MNT_LOCKED for things that
should be locked:

/* Don't allow unprivileged users to reveal what is under a mount */
if ((flag  CL_UNPRIVILEGED)  list_empty(old-mnt_expire))
mnt-mnt.mnt_flags |= MNT_LOCKED;

Now I'm confused.  Shouldn't that be checking for submounts?  Is that
what it's doing?

Anyway, I think that this should unconditionally set MNT_LOCKED on the
thing that mounted on rootfs.

--Andy
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] [RFC] mnt: add ability to clone mntns starting with the current root

2014-10-07 Thread Eric W. Biederman
Andy Lutomirski l...@amacapital.net writes:

 On Tue, Oct 7, 2014 at 4:42 PM, Eric W. Biederman ebied...@xmission.com 
 wrote:
 Andy Lutomirski l...@amacapital.net writes:

 On Tue, Oct 7, 2014 at 3:42 PM, Eric W. Biederman ebied...@xmission.com 
 wrote:
 Andy Lutomirski l...@amacapital.net writes:

 On Tue, Oct 7, 2014 at 2:42 PM, Eric W. Biederman ebied...@xmission.com 
 wrote:

 I am squinting and looking this way and that but while I can imagine
 someone more clever than I can think up some unique property of rootfs
 that makes it a little more exploitable than just mounting a ramfs,
 but since you have to be root to exploit those properties I think the
 game is pretty much lost.

 Yes.  rootfs might not be empty, it might have totally insane
 permissions, and it's globally shared, which makes it into a wonderful
 channel to pass things around that shouldn't be passed around.

 But if only root with proc mounted can reach it...  I don't know.

 It doesn't have to be global root.  It could be userns root.

 There might be a case for setting MNT_LOCKED when we overmount /
 as root but I don't yet see it.

 Can non-root do this?  You'd need to be in a userns with a / that
 isn't MNT_LOCKED.  Can this happen on any normal setup?

 FWIW, I think we should unconditionally MNT_LOCKED the root on userns
 unshare, even if it's the only mount.

 To the best of my knowledge MNT_LOCKED is set uncondintially on userns
 unshare.

 Only if list_empty(old-mnt_expire), whatever that means, I think.

 An autofs or nfs automounted mount.  Can those ever become root?

 Dunno.

 I thought that this code was what set MNT_LOCKED for things that
 should be locked:

It does.

 /* Don't allow unprivileged users to reveal what is under a mount */
 if ((flag  CL_UNPRIVILEGED)  list_empty(old-mnt_expire))
 mnt-mnt.mnt_flags |= MNT_LOCKED;

 Now I'm confused.  Shouldn't that be checking for submounts?  Is that
 what it's doing?

As it copies each mount (mostly submounts) it sets MNT_LOCKED.

 Anyway, I think that this should unconditionally set MNT_LOCKED on the
 thing that mounted on rootfs.

As I read the code mnt_set_expiry is only for nfs, cifs, and afs
submounts (and perhaps proc should use them). So as they are generated
mnt_expiry should never start on the root of filesystem of the mount tree.

Furthermore mnt_expiry is cleared when a mount is moved, and when
it is bind mounted, or propagated.

pivot_root does look as though it may be missing the proper mnt_expiry
handling list_del_init(old-mnt_expire), but pivot_root does have
proper MNT_LOCKED handling.

Looking that test was slightly off and it should be:
if ((flag  CL_UNPRIVILEGED) 
(!(flag  CL_EXPIRE) || list_empty(old-mnt_expire))
mnt-mnt.mnt_flags |= MNT_LOCKED;

So on that note we might clear CL_EXPIRE when CL_UNPRIVILEGED is set
in copy_tree but I don't see that as being really needed.

Eric
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] [RFC] mnt: add ability to clone mntns starting with the current root

2014-10-07 Thread Andy Lutomirski
On Tue, Oct 7, 2014 at 5:20 PM, Eric W. Biederman ebied...@xmission.com wrote:
 Andy Lutomirski l...@amacapital.net writes:

 On Tue, Oct 7, 2014 at 4:42 PM, Eric W. Biederman ebied...@xmission.com 
 wrote:
 Andy Lutomirski l...@amacapital.net writes:

 On Tue, Oct 7, 2014 at 3:42 PM, Eric W. Biederman ebied...@xmission.com 
 wrote:
 Andy Lutomirski l...@amacapital.net writes:

 On Tue, Oct 7, 2014 at 2:42 PM, Eric W. Biederman 
 ebied...@xmission.com wrote:

 I am squinting and looking this way and that but while I can imagine
 someone more clever than I can think up some unique property of rootfs
 that makes it a little more exploitable than just mounting a ramfs,
 but since you have to be root to exploit those properties I think the
 game is pretty much lost.

 Yes.  rootfs might not be empty, it might have totally insane
 permissions, and it's globally shared, which makes it into a wonderful
 channel to pass things around that shouldn't be passed around.

 But if only root with proc mounted can reach it...  I don't know.

 It doesn't have to be global root.  It could be userns root.

 There might be a case for setting MNT_LOCKED when we overmount /
 as root but I don't yet see it.

 Can non-root do this?  You'd need to be in a userns with a / that
 isn't MNT_LOCKED.  Can this happen on any normal setup?

 FWIW, I think we should unconditionally MNT_LOCKED the root on userns
 unshare, even if it's the only mount.

 To the best of my knowledge MNT_LOCKED is set uncondintially on userns
 unshare.

 Only if list_empty(old-mnt_expire), whatever that means, I think.

 An autofs or nfs automounted mount.  Can those ever become root?

 Dunno.

 I thought that this code was what set MNT_LOCKED for things that
 should be locked:

 It does.

 /* Don't allow unprivileged users to reveal what is under a mount */
 if ((flag  CL_UNPRIVILEGED)  list_empty(old-mnt_expire))
 mnt-mnt.mnt_flags |= MNT_LOCKED;

 Now I'm confused.  Shouldn't that be checking for submounts?  Is that
 what it's doing?

 As it copies each mount (mostly submounts) it sets MNT_LOCKED.

Oh.  They're *all* MNT_LOCKED.  Duh.


 Anyway, I think that this should unconditionally set MNT_LOCKED on the
 thing that mounted on rootfs.

 As I read the code mnt_set_expiry is only for nfs, cifs, and afs
 submounts (and perhaps proc should use them). So as they are generated
 mnt_expiry should never start on the root of filesystem of the mount tree.

 Furthermore mnt_expiry is cleared when a mount is moved, and when
 it is bind mounted, or propagated.

 pivot_root does look as though it may be missing the proper mnt_expiry
 handling list_del_init(old-mnt_expire), but pivot_root does have
 proper MNT_LOCKED handling.

pivot_root is quite broken, as noted in my other email.  It's just not
broken like this, I think. :)


 Looking that test was slightly off and it should be:
 if ((flag  CL_UNPRIVILEGED) 
 (!(flag  CL_EXPIRE) || list_empty(old-mnt_expire))
 mnt-mnt.mnt_flags |= MNT_LOCKED;

 So on that note we might clear CL_EXPIRE when CL_UNPRIVILEGED is set
 in copy_tree but I don't see that as being really needed.

 Eric



-- 
Andy Lutomirski
AMA Capital Management, LLC
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/