Re: [CRIU] Introspecting userns relationships to other namespaces?

2016-07-12 Thread W. Trevor King
On Tue, Jul 12, 2016 at 05:08:43PM -0700, Andrew Vagin wrote:
> Here is a patch to get an owning user namespace:
> https://github.com/avagin/linux-task-diag/commit/7fad8ff3fc4110bebf0920cec2388390b3bd2238
> https://github.com/avagin/linux-task-diag/commit/2663bc803d324785e328261f3c07a0fef37d2088
>
> Here is an example how it looks from user-space:
> https://github.com/avagin/linux-task-diag/blob/namespaces/tools/testing/selftests/nsfs/owner.c#L49

Overall this looks good to me (I left a handful of uninformed comments
inline ;).

It doesn't make it easy to walk leafward, but it doesn't look like the
kernel has a convenient way to list child namespaces either.
Something like /proc//task//children (with
CONFIG_PROC_CHILDREN) for namespaces would make it easier to get a
complete system overview (as far as your credentials and position in
the namespace hierarchies allow).  But looking at the
CONFIG_PROC_CHILDREN implementation doesn't make me all that excited
about mimicking it for namespaces ;).

You can still brute-force it in userspace by walking the root-most
procfs's you can find and peeking at all the /proc//ns/… entries
(but yuck ;).  With mount and other namespaces not being hierarchical,
the “leafword” idea may not be all that useful anyway, but having a
more compact collection of mount namepaces (say) that you know about
would be nice.  Where “know about” should probably means “know it
exists” but not necessarily “have permission to enter”.  Still,
getting that figured out can happen independently to this parent/owner
work.

Cheers,
Trevor

-- 
This email may be signed or encrypted with GnuPG (http://www.gnupg.org).
For more information, see http://en.wikipedia.org/wiki/Pretty_Good_Privacy


signature.asc
Description: OpenPGP digital signature


Re: [CRIU] Introspecting userns relationships to other namespaces?

2016-07-12 Thread W. Trevor King
On Tue, Jul 12, 2016 at 05:08:43PM -0700, Andrew Vagin wrote:
> Here is a patch to get an owning user namespace:
> https://github.com/avagin/linux-task-diag/commit/7fad8ff3fc4110bebf0920cec2388390b3bd2238
> https://github.com/avagin/linux-task-diag/commit/2663bc803d324785e328261f3c07a0fef37d2088
>
> Here is an example how it looks from user-space:
> https://github.com/avagin/linux-task-diag/blob/namespaces/tools/testing/selftests/nsfs/owner.c#L49

Overall this looks good to me (I left a handful of uninformed comments
inline ;).

It doesn't make it easy to walk leafward, but it doesn't look like the
kernel has a convenient way to list child namespaces either.
Something like /proc//task//children (with
CONFIG_PROC_CHILDREN) for namespaces would make it easier to get a
complete system overview (as far as your credentials and position in
the namespace hierarchies allow).  But looking at the
CONFIG_PROC_CHILDREN implementation doesn't make me all that excited
about mimicking it for namespaces ;).

You can still brute-force it in userspace by walking the root-most
procfs's you can find and peeking at all the /proc//ns/… entries
(but yuck ;).  With mount and other namespaces not being hierarchical,
the “leafword” idea may not be all that useful anyway, but having a
more compact collection of mount namepaces (say) that you know about
would be nice.  Where “know about” should probably means “know it
exists” but not necessarily “have permission to enter”.  Still,
getting that figured out can happen independently to this parent/owner
work.

Cheers,
Trevor

-- 
This email may be signed or encrypted with GnuPG (http://www.gnupg.org).
For more information, see http://en.wikipedia.org/wiki/Pretty_Good_Privacy


signature.asc
Description: OpenPGP digital signature


Re: [CRIU] Introspecting userns relationships to other namespaces?

2016-07-12 Thread Andrew Vagin
On Sat, Jul 09, 2016 at 01:29:20PM -0500, Eric W. Biederman wrote:
> ebied...@xmission.com (Eric W. Biederman) writes:
> 
> > Andrew Vagin  writes:
> >
> >> All these thoughts about security make me thinking that kcmp is what we
> >> should use here. It's maybe something like this:
> >>
> >> kcmp(pid1, pid2, KCMP_NS_USERNS, fd1, fd2)
> >>
> >> - to check if userns of the fd1 namepsace is equal to the fd2 userns
> >>
> >> kcmp(pid1, pid2, KCMP_NS_PARENT, fd1, fd2)
> >>
> >> - to check if a parent namespace of the fd1 pidns is equal to fd pidns.
> >>
> >> fd1 and fd2 is file descriptors to namespace files.
> >>
> >> So if we want to build a hierarchy, we need to collect all namespaces
> >> and then enumerate them to check dependencies with help of kcmp.
> >
> > That is certainly one way to go.
> >
> > There is a funny case where we would want to compare a user namespace
> > file descriptor to a parent user namespace file descriptor.
> >
> >
> > Grumble, Grumble.  I think this may actually a case for creating ioctls
> > for these two cases.  Now that random nsfs file descriptors are bind
> > mountable the original reason for using proc files is not as pressing.
> >
> > One ioctl for the user namespace that owns a file descriptor.
> > One ioctl for the parent namespace of a namespace file descriptor.
> >
> > We also need some way to get a command file descriptor for a file system
> > super block.  Al Viro has a pet project for cleaning up the mount API
> > and this might be the idea excuse to start looking at that.
> >
> > (In principle we might be able to run commands through the namespace
> >  file descriptor and using an ioctl feels dirty.  But an ioctl that
> >  only uses the fd and request argument does not suffer from the same
> >  problems that ioctls that have to pass additional arguments suffer
> >  from.)
> 
> Of course it should be an error perhaps -EINVAL to get a user
> namespace owner or parent namespace that is outside of a processes
> current user namespace or pid namespace.  That way thing stay bounded
> within the current namespaces the process is in.  Which prevents any
> leak possibilities, and keeps CRIU working.

I prepared patches with ioctl-s to understand how it looks like.

Here is a whole series:
https://github.com/avagin/linux-task-diag/commits/namespaces

Here is a patch to get an owning user namespace:
https://github.com/avagin/linux-task-diag/commit/7fad8ff3fc4110bebf0920cec2388390b3bd2238
https://github.com/avagin/linux-task-diag/commit/2663bc803d324785e328261f3c07a0fef37d2088

Here is an example how it looks from user-space:
https://github.com/avagin/linux-task-diag/blob/namespaces/tools/testing/selftests/nsfs/owner.c#L49

I like the idea with ioctl-s. James, Michael, Trevor, what is your
opinion about this?

> 
> Eric


Re: [CRIU] Introspecting userns relationships to other namespaces?

2016-07-12 Thread Andrew Vagin
On Sat, Jul 09, 2016 at 01:29:20PM -0500, Eric W. Biederman wrote:
> ebied...@xmission.com (Eric W. Biederman) writes:
> 
> > Andrew Vagin  writes:
> >
> >> All these thoughts about security make me thinking that kcmp is what we
> >> should use here. It's maybe something like this:
> >>
> >> kcmp(pid1, pid2, KCMP_NS_USERNS, fd1, fd2)
> >>
> >> - to check if userns of the fd1 namepsace is equal to the fd2 userns
> >>
> >> kcmp(pid1, pid2, KCMP_NS_PARENT, fd1, fd2)
> >>
> >> - to check if a parent namespace of the fd1 pidns is equal to fd pidns.
> >>
> >> fd1 and fd2 is file descriptors to namespace files.
> >>
> >> So if we want to build a hierarchy, we need to collect all namespaces
> >> and then enumerate them to check dependencies with help of kcmp.
> >
> > That is certainly one way to go.
> >
> > There is a funny case where we would want to compare a user namespace
> > file descriptor to a parent user namespace file descriptor.
> >
> >
> > Grumble, Grumble.  I think this may actually a case for creating ioctls
> > for these two cases.  Now that random nsfs file descriptors are bind
> > mountable the original reason for using proc files is not as pressing.
> >
> > One ioctl for the user namespace that owns a file descriptor.
> > One ioctl for the parent namespace of a namespace file descriptor.
> >
> > We also need some way to get a command file descriptor for a file system
> > super block.  Al Viro has a pet project for cleaning up the mount API
> > and this might be the idea excuse to start looking at that.
> >
> > (In principle we might be able to run commands through the namespace
> >  file descriptor and using an ioctl feels dirty.  But an ioctl that
> >  only uses the fd and request argument does not suffer from the same
> >  problems that ioctls that have to pass additional arguments suffer
> >  from.)
> 
> Of course it should be an error perhaps -EINVAL to get a user
> namespace owner or parent namespace that is outside of a processes
> current user namespace or pid namespace.  That way thing stay bounded
> within the current namespaces the process is in.  Which prevents any
> leak possibilities, and keeps CRIU working.

I prepared patches with ioctl-s to understand how it looks like.

Here is a whole series:
https://github.com/avagin/linux-task-diag/commits/namespaces

Here is a patch to get an owning user namespace:
https://github.com/avagin/linux-task-diag/commit/7fad8ff3fc4110bebf0920cec2388390b3bd2238
https://github.com/avagin/linux-task-diag/commit/2663bc803d324785e328261f3c07a0fef37d2088

Here is an example how it looks from user-space:
https://github.com/avagin/linux-task-diag/blob/namespaces/tools/testing/selftests/nsfs/owner.c#L49

I like the idea with ioctl-s. James, Michael, Trevor, what is your
opinion about this?

> 
> Eric


Re: [CRIU] Introspecting userns relationships to other namespaces?

2016-07-11 Thread Andrew Vagin
On Mon, Jul 11, 2016 at 06:06:48AM +0900, James Bottomley wrote:
> On Sun, 2016-07-10 at 15:29 -0500, Eric W. Biederman wrote:
> > Andrew Vagin  writes:
> > 
> > > On Fri, Jul 08, 2016 at 10:13:08PM -0500, Eric W. Biederman wrote:
> > > > "W. Trevor King"  writes:
> > > > 
> > > > > On Thu, Jul 07, 2016 at 08:01:52AM -0700, James Bottomley
> > > > > wrote:
> > > > > > In theory, we could get nsfs to show this information as an
> > > > > > option
> > > > > > (just add a show_options entry to the superblock ops), but
> > > > > > the
> > > > > > problem is that although each namespace has a parent user_ns,
> > > > > > there's no way to get it without digging in the namespace
> > > > > > specific
> > > > > > structure.  Probably we should restructure to move it into
> > > > > > ns_common, then we could display it (and enforce all
> > > > > > namespaces
> > > > > > having owning user_ns) but it would be a reasonably large
> > > > > > (but
> > > > > > mechanical) change.
> > > > > 
> > > > > It sounds like everyone is either positive or or neutral on
> > > > > this
> > > > > groundwork, even if we haven't decided if/how to expose the
> > > > > information to userspace.  I'm happy to work up a patch while
> > > > > the rest
> > > > > of the discussion continues.  I'm also happy to let someone
> > > > > else work
> > > > > up the patch, if anyone else is chomping at the bit ;).
> > > > 
> > > > I am dubious on moving all of the user namespace members into
> > > > ns_common.
> > > > 
> > > > I would happy to be proved wrong but I suspect in the cases where
> > > > we
> > > > actually use that user namespace the code will become uglier. 
> > > >  Making
> > > > the ordinary uses uglier to make a rare corner case nicer is the
> > > > wrong
> > > > trade off.
> > > > 
> > > > But feel free to try it is certainly worth doing if it doesn't
> > > > make the
> > > > code that uses the user namespaces uglier.
> > > 
> > > If it's interesting for someone, I have this patch in my tree
> > > https://github.com/avagin/linux-task-diag/commit/63b32df68ae8d3a384
> > > 2bae42bbcae3468db76d85
> > > 
> > > I can't say that it makes something uglier.
> > 
> > I have only skimmed things but overall it looks better than I had
> > feared.
> 
> It looks about as messy as I feared, but since someone else has done
> all the hard work, I'm happy.
> 
> > At the same time I really really don't like losing the parent pointer 
> > in the user namespace case.  That is seriously obfuscating.

We can do something like this:

@@ -27,11 +27,13 @@ struct user_namespace {
...
-   struct ns_commonns;
+   union {
+   struct user_namespace   *parent;
+   struct ns_commonns;
+   };
unsigned long   flags;
...
@@ -97,6 +97,7 @@ int create_user_ns(struct cred *new)
...
atomic_set(>count, 1);
/* Leave the new->user_ns reference with the new user namespace. */
+   BUILD_BUG_ON(>ns.user_ns != >parent);
ns->parent = parent_ns;

> 
> Because it has a slightly different meaning from all other namespaces? 
>  If I assume that's what you mean, I think looking at it in a different
> way can solve the problem:  The pointer in ns_common is always to the
> owning user_ns, so we can label it as such.  Even for a child user_ns,
> the owning user_ns is simply the parent.  I think it makes logical
> sense to think of all user_ns to namespace relationships as
> owning/owned rather than most as owning/owned and some as parent/child.

I think we can rename ns.user_ns to ns.owner or ns.owner_ns.


Thanks,
Andrew

> 
> James
> 
> > Eric
> > 
> > ___
> > Containers mailing list
> > contain...@lists.linux-foundation.org
> > https://lists.linuxfoundation.org/mailman/listinfo/containers
> > 
> 


Re: [CRIU] Introspecting userns relationships to other namespaces?

2016-07-11 Thread Andrew Vagin
On Mon, Jul 11, 2016 at 06:06:48AM +0900, James Bottomley wrote:
> On Sun, 2016-07-10 at 15:29 -0500, Eric W. Biederman wrote:
> > Andrew Vagin  writes:
> > 
> > > On Fri, Jul 08, 2016 at 10:13:08PM -0500, Eric W. Biederman wrote:
> > > > "W. Trevor King"  writes:
> > > > 
> > > > > On Thu, Jul 07, 2016 at 08:01:52AM -0700, James Bottomley
> > > > > wrote:
> > > > > > In theory, we could get nsfs to show this information as an
> > > > > > option
> > > > > > (just add a show_options entry to the superblock ops), but
> > > > > > the
> > > > > > problem is that although each namespace has a parent user_ns,
> > > > > > there's no way to get it without digging in the namespace
> > > > > > specific
> > > > > > structure.  Probably we should restructure to move it into
> > > > > > ns_common, then we could display it (and enforce all
> > > > > > namespaces
> > > > > > having owning user_ns) but it would be a reasonably large
> > > > > > (but
> > > > > > mechanical) change.
> > > > > 
> > > > > It sounds like everyone is either positive or or neutral on
> > > > > this
> > > > > groundwork, even if we haven't decided if/how to expose the
> > > > > information to userspace.  I'm happy to work up a patch while
> > > > > the rest
> > > > > of the discussion continues.  I'm also happy to let someone
> > > > > else work
> > > > > up the patch, if anyone else is chomping at the bit ;).
> > > > 
> > > > I am dubious on moving all of the user namespace members into
> > > > ns_common.
> > > > 
> > > > I would happy to be proved wrong but I suspect in the cases where
> > > > we
> > > > actually use that user namespace the code will become uglier. 
> > > >  Making
> > > > the ordinary uses uglier to make a rare corner case nicer is the
> > > > wrong
> > > > trade off.
> > > > 
> > > > But feel free to try it is certainly worth doing if it doesn't
> > > > make the
> > > > code that uses the user namespaces uglier.
> > > 
> > > If it's interesting for someone, I have this patch in my tree
> > > https://github.com/avagin/linux-task-diag/commit/63b32df68ae8d3a384
> > > 2bae42bbcae3468db76d85
> > > 
> > > I can't say that it makes something uglier.
> > 
> > I have only skimmed things but overall it looks better than I had
> > feared.
> 
> It looks about as messy as I feared, but since someone else has done
> all the hard work, I'm happy.
> 
> > At the same time I really really don't like losing the parent pointer 
> > in the user namespace case.  That is seriously obfuscating.

We can do something like this:

@@ -27,11 +27,13 @@ struct user_namespace {
...
-   struct ns_commonns;
+   union {
+   struct user_namespace   *parent;
+   struct ns_commonns;
+   };
unsigned long   flags;
...
@@ -97,6 +97,7 @@ int create_user_ns(struct cred *new)
...
atomic_set(>count, 1);
/* Leave the new->user_ns reference with the new user namespace. */
+   BUILD_BUG_ON(>ns.user_ns != >parent);
ns->parent = parent_ns;

> 
> Because it has a slightly different meaning from all other namespaces? 
>  If I assume that's what you mean, I think looking at it in a different
> way can solve the problem:  The pointer in ns_common is always to the
> owning user_ns, so we can label it as such.  Even for a child user_ns,
> the owning user_ns is simply the parent.  I think it makes logical
> sense to think of all user_ns to namespace relationships as
> owning/owned rather than most as owning/owned and some as parent/child.

I think we can rename ns.user_ns to ns.owner or ns.owner_ns.


Thanks,
Andrew

> 
> James
> 
> > Eric
> > 
> > ___
> > Containers mailing list
> > contain...@lists.linux-foundation.org
> > https://lists.linuxfoundation.org/mailman/listinfo/containers
> > 
> 


Re: [CRIU] Introspecting userns relationships to other namespaces?

2016-07-10 Thread James Bottomley
On Sun, 2016-07-10 at 15:29 -0500, Eric W. Biederman wrote:
> Andrew Vagin  writes:
> 
> > On Fri, Jul 08, 2016 at 10:13:08PM -0500, Eric W. Biederman wrote:
> > > "W. Trevor King"  writes:
> > > 
> > > > On Thu, Jul 07, 2016 at 08:01:52AM -0700, James Bottomley
> > > > wrote:
> > > > > In theory, we could get nsfs to show this information as an
> > > > > option
> > > > > (just add a show_options entry to the superblock ops), but
> > > > > the
> > > > > problem is that although each namespace has a parent user_ns,
> > > > > there's no way to get it without digging in the namespace
> > > > > specific
> > > > > structure.  Probably we should restructure to move it into
> > > > > ns_common, then we could display it (and enforce all
> > > > > namespaces
> > > > > having owning user_ns) but it would be a reasonably large
> > > > > (but
> > > > > mechanical) change.
> > > > 
> > > > It sounds like everyone is either positive or or neutral on
> > > > this
> > > > groundwork, even if we haven't decided if/how to expose the
> > > > information to userspace.  I'm happy to work up a patch while
> > > > the rest
> > > > of the discussion continues.  I'm also happy to let someone
> > > > else work
> > > > up the patch, if anyone else is chomping at the bit ;).
> > > 
> > > I am dubious on moving all of the user namespace members into
> > > ns_common.
> > > 
> > > I would happy to be proved wrong but I suspect in the cases where
> > > we
> > > actually use that user namespace the code will become uglier. 
> > >  Making
> > > the ordinary uses uglier to make a rare corner case nicer is the
> > > wrong
> > > trade off.
> > > 
> > > But feel free to try it is certainly worth doing if it doesn't
> > > make the
> > > code that uses the user namespaces uglier.
> > 
> > If it's interesting for someone, I have this patch in my tree
> > https://github.com/avagin/linux-task-diag/commit/63b32df68ae8d3a384
> > 2bae42bbcae3468db76d85
> > 
> > I can't say that it makes something uglier.
> 
> I have only skimmed things but overall it looks better than I had
> feared.

It looks about as messy as I feared, but since someone else has done
all the hard work, I'm happy.

> At the same time I really really don't like losing the parent pointer 
> in the user namespace case.  That is seriously obfuscating.

Because it has a slightly different meaning from all other namespaces? 
 If I assume that's what you mean, I think looking at it in a different
way can solve the problem:  The pointer in ns_common is always to the
owning user_ns, so we can label it as such.  Even for a child user_ns,
the owning user_ns is simply the parent.  I think it makes logical
sense to think of all user_ns to namespace relationships as
owning/owned rather than most as owning/owned and some as parent/child.

James

> Eric
> 
> ___
> Containers mailing list
> contain...@lists.linux-foundation.org
> https://lists.linuxfoundation.org/mailman/listinfo/containers
> 



Re: [CRIU] Introspecting userns relationships to other namespaces?

2016-07-10 Thread James Bottomley
On Sun, 2016-07-10 at 15:29 -0500, Eric W. Biederman wrote:
> Andrew Vagin  writes:
> 
> > On Fri, Jul 08, 2016 at 10:13:08PM -0500, Eric W. Biederman wrote:
> > > "W. Trevor King"  writes:
> > > 
> > > > On Thu, Jul 07, 2016 at 08:01:52AM -0700, James Bottomley
> > > > wrote:
> > > > > In theory, we could get nsfs to show this information as an
> > > > > option
> > > > > (just add a show_options entry to the superblock ops), but
> > > > > the
> > > > > problem is that although each namespace has a parent user_ns,
> > > > > there's no way to get it without digging in the namespace
> > > > > specific
> > > > > structure.  Probably we should restructure to move it into
> > > > > ns_common, then we could display it (and enforce all
> > > > > namespaces
> > > > > having owning user_ns) but it would be a reasonably large
> > > > > (but
> > > > > mechanical) change.
> > > > 
> > > > It sounds like everyone is either positive or or neutral on
> > > > this
> > > > groundwork, even if we haven't decided if/how to expose the
> > > > information to userspace.  I'm happy to work up a patch while
> > > > the rest
> > > > of the discussion continues.  I'm also happy to let someone
> > > > else work
> > > > up the patch, if anyone else is chomping at the bit ;).
> > > 
> > > I am dubious on moving all of the user namespace members into
> > > ns_common.
> > > 
> > > I would happy to be proved wrong but I suspect in the cases where
> > > we
> > > actually use that user namespace the code will become uglier. 
> > >  Making
> > > the ordinary uses uglier to make a rare corner case nicer is the
> > > wrong
> > > trade off.
> > > 
> > > But feel free to try it is certainly worth doing if it doesn't
> > > make the
> > > code that uses the user namespaces uglier.
> > 
> > If it's interesting for someone, I have this patch in my tree
> > https://github.com/avagin/linux-task-diag/commit/63b32df68ae8d3a384
> > 2bae42bbcae3468db76d85
> > 
> > I can't say that it makes something uglier.
> 
> I have only skimmed things but overall it looks better than I had
> feared.

It looks about as messy as I feared, but since someone else has done
all the hard work, I'm happy.

> At the same time I really really don't like losing the parent pointer 
> in the user namespace case.  That is seriously obfuscating.

Because it has a slightly different meaning from all other namespaces? 
 If I assume that's what you mean, I think looking at it in a different
way can solve the problem:  The pointer in ns_common is always to the
owning user_ns, so we can label it as such.  Even for a child user_ns,
the owning user_ns is simply the parent.  I think it makes logical
sense to think of all user_ns to namespace relationships as
owning/owned rather than most as owning/owned and some as parent/child.

James

> Eric
> 
> ___
> Containers mailing list
> contain...@lists.linux-foundation.org
> https://lists.linuxfoundation.org/mailman/listinfo/containers
> 



Re: [CRIU] Introspecting userns relationships to other namespaces?

2016-07-10 Thread Eric W. Biederman
Andrew Vagin  writes:

> On Fri, Jul 08, 2016 at 10:13:08PM -0500, Eric W. Biederman wrote:
>> "W. Trevor King"  writes:
>> 
>> > On Thu, Jul 07, 2016 at 08:01:52AM -0700, James Bottomley wrote:
>> >> In theory, we could get nsfs to show this information as an option
>> >> (just add a show_options entry to the superblock ops), but the
>> >> problem is that although each namespace has a parent user_ns,
>> >> there's no way to get it without digging in the namespace specific
>> >> structure.  Probably we should restructure to move it into
>> >> ns_common, then we could display it (and enforce all namespaces
>> >> having owning user_ns) but it would be a reasonably large (but
>> >> mechanical) change.
>> >
>> > It sounds like everyone is either positive or or neutral on this
>> > groundwork, even if we haven't decided if/how to expose the
>> > information to userspace.  I'm happy to work up a patch while the rest
>> > of the discussion continues.  I'm also happy to let someone else work
>> > up the patch, if anyone else is chomping at the bit ;).
>> 
>> I am dubious on moving all of the user namespace members into ns_common.
>> 
>> I would happy to be proved wrong but I suspect in the cases where we
>> actually use that user namespace the code will become uglier.  Making
>> the ordinary uses uglier to make a rare corner case nicer is the wrong
>> trade off.
>> 
>> But feel free to try it is certainly worth doing if it doesn't make the
>> code that uses the user namespaces uglier.
>
> If it's interesting for someone, I have this patch in my tree
> https://github.com/avagin/linux-task-diag/commit/63b32df68ae8d3a3842bae42bbcae3468db76d85
>
> I can't say that it makes something uglier.

I have only skimmed things but overall it looks better than I had
feared.

At the same time I really really don't like losing the parent pointer in
the user namespace case.  That is seriously obfuscating.

Eric



Re: [CRIU] Introspecting userns relationships to other namespaces?

2016-07-10 Thread Eric W. Biederman
Andrew Vagin  writes:

> On Fri, Jul 08, 2016 at 10:13:08PM -0500, Eric W. Biederman wrote:
>> "W. Trevor King"  writes:
>> 
>> > On Thu, Jul 07, 2016 at 08:01:52AM -0700, James Bottomley wrote:
>> >> In theory, we could get nsfs to show this information as an option
>> >> (just add a show_options entry to the superblock ops), but the
>> >> problem is that although each namespace has a parent user_ns,
>> >> there's no way to get it without digging in the namespace specific
>> >> structure.  Probably we should restructure to move it into
>> >> ns_common, then we could display it (and enforce all namespaces
>> >> having owning user_ns) but it would be a reasonably large (but
>> >> mechanical) change.
>> >
>> > It sounds like everyone is either positive or or neutral on this
>> > groundwork, even if we haven't decided if/how to expose the
>> > information to userspace.  I'm happy to work up a patch while the rest
>> > of the discussion continues.  I'm also happy to let someone else work
>> > up the patch, if anyone else is chomping at the bit ;).
>> 
>> I am dubious on moving all of the user namespace members into ns_common.
>> 
>> I would happy to be proved wrong but I suspect in the cases where we
>> actually use that user namespace the code will become uglier.  Making
>> the ordinary uses uglier to make a rare corner case nicer is the wrong
>> trade off.
>> 
>> But feel free to try it is certainly worth doing if it doesn't make the
>> code that uses the user namespaces uglier.
>
> If it's interesting for someone, I have this patch in my tree
> https://github.com/avagin/linux-task-diag/commit/63b32df68ae8d3a3842bae42bbcae3468db76d85
>
> I can't say that it makes something uglier.

I have only skimmed things but overall it looks better than I had
feared.

At the same time I really really don't like losing the parent pointer in
the user namespace case.  That is seriously obfuscating.

Eric



Re: [CRIU] Introspecting userns relationships to other namespaces?

2016-07-09 Thread Andrew Vagin
On Fri, Jul 08, 2016 at 10:13:08PM -0500, Eric W. Biederman wrote:
> "W. Trevor King"  writes:
> 
> > On Thu, Jul 07, 2016 at 08:01:52AM -0700, James Bottomley wrote:
> >> In theory, we could get nsfs to show this information as an option
> >> (just add a show_options entry to the superblock ops), but the
> >> problem is that although each namespace has a parent user_ns,
> >> there's no way to get it without digging in the namespace specific
> >> structure.  Probably we should restructure to move it into
> >> ns_common, then we could display it (and enforce all namespaces
> >> having owning user_ns) but it would be a reasonably large (but
> >> mechanical) change.
> >
> > It sounds like everyone is either positive or or neutral on this
> > groundwork, even if we haven't decided if/how to expose the
> > information to userspace.  I'm happy to work up a patch while the rest
> > of the discussion continues.  I'm also happy to let someone else work
> > up the patch, if anyone else is chomping at the bit ;).
> 
> I am dubious on moving all of the user namespace members into ns_common.
> 
> I would happy to be proved wrong but I suspect in the cases where we
> actually use that user namespace the code will become uglier.  Making
> the ordinary uses uglier to make a rare corner case nicer is the wrong
> trade off.
> 
> But feel free to try it is certainly worth doing if it doesn't make the
> code that uses the user namespaces uglier.

If it's interesting for someone, I have this patch in my tree
https://github.com/avagin/linux-task-diag/commit/63b32df68ae8d3a3842bae42bbcae3468db76d85

I can't say that it makes something uglier.

> 
> Eric
> 
> ___
> CRIU mailing list
> c...@openvz.org
> https://lists.openvz.org/mailman/listinfo/criu


Re: [CRIU] Introspecting userns relationships to other namespaces?

2016-07-09 Thread Andrew Vagin
On Fri, Jul 08, 2016 at 10:13:08PM -0500, Eric W. Biederman wrote:
> "W. Trevor King"  writes:
> 
> > On Thu, Jul 07, 2016 at 08:01:52AM -0700, James Bottomley wrote:
> >> In theory, we could get nsfs to show this information as an option
> >> (just add a show_options entry to the superblock ops), but the
> >> problem is that although each namespace has a parent user_ns,
> >> there's no way to get it without digging in the namespace specific
> >> structure.  Probably we should restructure to move it into
> >> ns_common, then we could display it (and enforce all namespaces
> >> having owning user_ns) but it would be a reasonably large (but
> >> mechanical) change.
> >
> > It sounds like everyone is either positive or or neutral on this
> > groundwork, even if we haven't decided if/how to expose the
> > information to userspace.  I'm happy to work up a patch while the rest
> > of the discussion continues.  I'm also happy to let someone else work
> > up the patch, if anyone else is chomping at the bit ;).
> 
> I am dubious on moving all of the user namespace members into ns_common.
> 
> I would happy to be proved wrong but I suspect in the cases where we
> actually use that user namespace the code will become uglier.  Making
> the ordinary uses uglier to make a rare corner case nicer is the wrong
> trade off.
> 
> But feel free to try it is certainly worth doing if it doesn't make the
> code that uses the user namespaces uglier.

If it's interesting for someone, I have this patch in my tree
https://github.com/avagin/linux-task-diag/commit/63b32df68ae8d3a3842bae42bbcae3468db76d85

I can't say that it makes something uglier.

> 
> Eric
> 
> ___
> CRIU mailing list
> c...@openvz.org
> https://lists.openvz.org/mailman/listinfo/criu


Re: [CRIU] Introspecting userns relationships to other namespaces?

2016-07-09 Thread Eric W. Biederman
ebied...@xmission.com (Eric W. Biederman) writes:

> Andrew Vagin  writes:
>
>> All these thoughts about security make me thinking that kcmp is what we
>> should use here. It's maybe something like this:
>>
>> kcmp(pid1, pid2, KCMP_NS_USERNS, fd1, fd2)
>>
>> - to check if userns of the fd1 namepsace is equal to the fd2 userns
>>
>> kcmp(pid1, pid2, KCMP_NS_PARENT, fd1, fd2)
>>
>> - to check if a parent namespace of the fd1 pidns is equal to fd pidns.
>>
>> fd1 and fd2 is file descriptors to namespace files.
>>
>> So if we want to build a hierarchy, we need to collect all namespaces
>> and then enumerate them to check dependencies with help of kcmp.
>
> That is certainly one way to go.
>
> There is a funny case where we would want to compare a user namespace
> file descriptor to a parent user namespace file descriptor.
>
>
> Grumble, Grumble.  I think this may actually a case for creating ioctls
> for these two cases.  Now that random nsfs file descriptors are bind
> mountable the original reason for using proc files is not as pressing.
>
> One ioctl for the user namespace that owns a file descriptor.
> One ioctl for the parent namespace of a namespace file descriptor.
>
> We also need some way to get a command file descriptor for a file system
> super block.  Al Viro has a pet project for cleaning up the mount API
> and this might be the idea excuse to start looking at that.
>
> (In principle we might be able to run commands through the namespace
>  file descriptor and using an ioctl feels dirty.  But an ioctl that
>  only uses the fd and request argument does not suffer from the same
>  problems that ioctls that have to pass additional arguments suffer
>  from.)

Of course it should be an error perhaps -EINVAL to get a user
namespace owner or parent namespace that is outside of a processes
current user namespace or pid namespace.  That way thing stay bounded
within the current namespaces the process is in.  Which prevents any
leak possibilities, and keeps CRIU working.

Eric


Re: [CRIU] Introspecting userns relationships to other namespaces?

2016-07-09 Thread Eric W. Biederman
ebied...@xmission.com (Eric W. Biederman) writes:

> Andrew Vagin  writes:
>
>> All these thoughts about security make me thinking that kcmp is what we
>> should use here. It's maybe something like this:
>>
>> kcmp(pid1, pid2, KCMP_NS_USERNS, fd1, fd2)
>>
>> - to check if userns of the fd1 namepsace is equal to the fd2 userns
>>
>> kcmp(pid1, pid2, KCMP_NS_PARENT, fd1, fd2)
>>
>> - to check if a parent namespace of the fd1 pidns is equal to fd pidns.
>>
>> fd1 and fd2 is file descriptors to namespace files.
>>
>> So if we want to build a hierarchy, we need to collect all namespaces
>> and then enumerate them to check dependencies with help of kcmp.
>
> That is certainly one way to go.
>
> There is a funny case where we would want to compare a user namespace
> file descriptor to a parent user namespace file descriptor.
>
>
> Grumble, Grumble.  I think this may actually a case for creating ioctls
> for these two cases.  Now that random nsfs file descriptors are bind
> mountable the original reason for using proc files is not as pressing.
>
> One ioctl for the user namespace that owns a file descriptor.
> One ioctl for the parent namespace of a namespace file descriptor.
>
> We also need some way to get a command file descriptor for a file system
> super block.  Al Viro has a pet project for cleaning up the mount API
> and this might be the idea excuse to start looking at that.
>
> (In principle we might be able to run commands through the namespace
>  file descriptor and using an ioctl feels dirty.  But an ioctl that
>  only uses the fd and request argument does not suffer from the same
>  problems that ioctls that have to pass additional arguments suffer
>  from.)

Of course it should be an error perhaps -EINVAL to get a user
namespace owner or parent namespace that is outside of a processes
current user namespace or pid namespace.  That way thing stay bounded
within the current namespaces the process is in.  Which prevents any
leak possibilities, and keeps CRIU working.

Eric


Re: [CRIU] Introspecting userns relationships to other namespaces?

2016-07-09 Thread Eric W. Biederman
Andrew Vagin  writes:

> All these thoughts about security make me thinking that kcmp is what we
> should use here. It's maybe something like this:
>
> kcmp(pid1, pid2, KCMP_NS_USERNS, fd1, fd2)
>
> - to check if userns of the fd1 namepsace is equal to the fd2 userns
>
> kcmp(pid1, pid2, KCMP_NS_PARENT, fd1, fd2)
>
> - to check if a parent namespace of the fd1 pidns is equal to fd pidns.
>
> fd1 and fd2 is file descriptors to namespace files.
>
> So if we want to build a hierarchy, we need to collect all namespaces
> and then enumerate them to check dependencies with help of kcmp.

That is certainly one way to go.

There is a funny case where we would want to compare a user namespace
file descriptor to a parent user namespace file descriptor.


Grumble, Grumble.  I think this may actually a case for creating ioctls
for these two cases.  Now that random nsfs file descriptors are bind
mountable the original reason for using proc files is not as pressing.

One ioctl for the user namespace that owns a file descriptor.
One ioctl for the parent namespace of a namespace file descriptor.

We also need some way to get a command file descriptor for a file system
super block.  Al Viro has a pet project for cleaning up the mount API
and this might be the idea excuse to start looking at that.

(In principle we might be able to run commands through the namespace
 file descriptor and using an ioctl feels dirty.  But an ioctl that
 only uses the fd and request argument does not suffer from the same
 problems that ioctls that have to pass additional arguments suffer
 from.)

Eric


Re: [CRIU] Introspecting userns relationships to other namespaces?

2016-07-09 Thread Eric W. Biederman
Andrew Vagin  writes:

> All these thoughts about security make me thinking that kcmp is what we
> should use here. It's maybe something like this:
>
> kcmp(pid1, pid2, KCMP_NS_USERNS, fd1, fd2)
>
> - to check if userns of the fd1 namepsace is equal to the fd2 userns
>
> kcmp(pid1, pid2, KCMP_NS_PARENT, fd1, fd2)
>
> - to check if a parent namespace of the fd1 pidns is equal to fd pidns.
>
> fd1 and fd2 is file descriptors to namespace files.
>
> So if we want to build a hierarchy, we need to collect all namespaces
> and then enumerate them to check dependencies with help of kcmp.

That is certainly one way to go.

There is a funny case where we would want to compare a user namespace
file descriptor to a parent user namespace file descriptor.


Grumble, Grumble.  I think this may actually a case for creating ioctls
for these two cases.  Now that random nsfs file descriptors are bind
mountable the original reason for using proc files is not as pressing.

One ioctl for the user namespace that owns a file descriptor.
One ioctl for the parent namespace of a namespace file descriptor.

We also need some way to get a command file descriptor for a file system
super block.  Al Viro has a pet project for cleaning up the mount API
and this might be the idea excuse to start looking at that.

(In principle we might be able to run commands through the namespace
 file descriptor and using an ioctl feels dirty.  But an ioctl that
 only uses the fd and request argument does not suffer from the same
 problems that ioctls that have to pass additional arguments suffer
 from.)

Eric


Re: [CRIU] Introspecting userns relationships to other namespaces?

2016-07-09 Thread James Bottomley
On July 9, 2016 4:26:28 PM GMT+09:00, Andrew Vagin  wrote:
>On Fri, Jul 08, 2016 at 10:05:18PM -0500, Eric W. Biederman wrote:
>> James Bottomley  writes:
>> 
>> > On Fri, 2016-07-08 at 18:52 -0500, Eric W. Biederman wrote:
>> >> James Bottomley  writes:
>> >> 
>> >> > On July 8, 2016 1:38:19 PM PDT, Andrew Vagin
>
>> >> > wrote:
>> >> 
>> >> > > What do you think about the idea to mount nsfs and be able to 
>> >> > > look up any alive namespace by inum:
>> >> > 
>> >> > I think I like it.  It will give us a way to enter any extant
>> >> > namespace.  It will work for Eric's fs namespaces as well. 
>Perhaps 
>> >> > a /process/ns/ Directory?
>> >
>> > As you understood, I meant /proc/ns/ (damn mobile phone
>> > completions).
>> >
>> >> *Shivers*
>> >> 
>> >> That makes it very easy to bypass any existing controls that exist
>
>> >> for getting at namespaces.  It is true that everything of that
>kind 
>> >> is directory based but still.
>> >> 
>> >> Plus I think it would serve as information leak to information 
>> >> outside of the container.
>> >> 
>> >> An operation to get a user namespace file descriptor from some
>kernel
>> >> object sounds reasonably sane.
>> >> 
>> >> A great big list of things sounds about as scary as it can get. 
>This 
>> >> is not the time to be making it easier to escape from containers.
>> >
>> > To be honest, I think this argument is rubbish.  If we're afraid of
>> > giving out a list of all the namespaces, it means we're afraid
>there's
>> > some security bug and we're trying to obscure it by making the list
>> > hard to get.  All we've done is allayed fears about the bug but the
>> > hackers still know the portals to get through.
>> >
>> > If such a bug exists, it will be possible to exploit it by simply
>> > reconstructing the information from the individual process
>directories,
>> > so obscurity doesn't protect us and all it does is give us a false
>> > sense of security.   If such a bug doesn't exist, then all the
>security
>> > mechanisms currently in place (like no re-entry to prior namespace)
>> > should protect us and we can give out the list.
>> >
>> > Let's deal with the world as we'd like it to be (no obscure
>namespace
>> > bugs) and accept the consequences and the responsibility for fixing
>> > them if we turn out to be slightly incorrect.  We'll end up in a
>far
>> > better place than security by obscurity would land us.
>> 
>> No.  That is not the fear.  The permission checks on
>/proc/self/ns/xxx
>> are different than if the namespace is bind mounted somewhere.
>> 
>> That was done deliberately and with a reasonable amount of
>forethought.
>> You are asking to throw those permission checks out.   The answer is
>no.
>> 
>> Furthermore there is a much clearer reason not to go with a list of
>all
>> namespaces. A list of all namespaces breaks CRIU.  As you have
>described
>> it the list will change depending upon which machine you restore a
>> checkpoint on.  I honestly don't know what kind of havoc that will
>cause
>> but it is certainly something we won't be able to checkpoint no
>matter
>> how hard we try.
>
>It's right. I hadn't thought about this.

Me neither.  Sorry for the prior outburst.

I think this means we're back to exposing owning userns in the /proc //ns 
directory. 

>> 
>> A global list of namespaces especially of the kind that you can open
>> and get a handle to the namespace is just not appropriate.
>> 
>> I know inode numbers comes darn close to names but they aren't really
>> names and if it comes to it we can figure out how to preserve an
>> applications view of it all across a checkpoint/restart.  So far it
>> hasn't proven necessary to preserve any inode numbers across
>> checkpoint/restart but again it is theoretically possible if it
>becomes
>> necessary.
>> 
>> Throwing away checkpoint/restart support for the sake of
>> checkpoint/restart is a no-go.
>> 
>> Containers fundamentally imply you don't have global visibility,
>> and that is a good thing.
>
>All these thoughts about security make me thinking that kcmp is what we
>should use here. It's maybe something like this:
>
>kcmp(pid1, pid2, KCMP_NS_USERNS, fd1, fd2)
>
>- to check if userns of the fd1 namepsace is equal to the fd2 userns
>
>kcmp(pid1, pid2, KCMP_NS_PARENT, fd1, fd2)
>
>- to check if a parent namespace of the fd1 pidns is equal to fd pidns.
>
>fd1 and fd2 is file descriptors to namespace files.
>
>So if we want to build a hierarchy, we need to collect all namespaces
>and then enumerate them to check dependencies with help of kcmp.

Sure, but we need a method for opening the filehandles first .. .

James 

>> 
>> Eric


-- 
Sent from my Android device with K-9 Mail. Please excuse my brevity.


Re: [CRIU] Introspecting userns relationships to other namespaces?

2016-07-09 Thread James Bottomley
On July 9, 2016 4:26:28 PM GMT+09:00, Andrew Vagin  wrote:
>On Fri, Jul 08, 2016 at 10:05:18PM -0500, Eric W. Biederman wrote:
>> James Bottomley  writes:
>> 
>> > On Fri, 2016-07-08 at 18:52 -0500, Eric W. Biederman wrote:
>> >> James Bottomley  writes:
>> >> 
>> >> > On July 8, 2016 1:38:19 PM PDT, Andrew Vagin
>
>> >> > wrote:
>> >> 
>> >> > > What do you think about the idea to mount nsfs and be able to 
>> >> > > look up any alive namespace by inum:
>> >> > 
>> >> > I think I like it.  It will give us a way to enter any extant
>> >> > namespace.  It will work for Eric's fs namespaces as well. 
>Perhaps 
>> >> > a /process/ns/ Directory?
>> >
>> > As you understood, I meant /proc/ns/ (damn mobile phone
>> > completions).
>> >
>> >> *Shivers*
>> >> 
>> >> That makes it very easy to bypass any existing controls that exist
>
>> >> for getting at namespaces.  It is true that everything of that
>kind 
>> >> is directory based but still.
>> >> 
>> >> Plus I think it would serve as information leak to information 
>> >> outside of the container.
>> >> 
>> >> An operation to get a user namespace file descriptor from some
>kernel
>> >> object sounds reasonably sane.
>> >> 
>> >> A great big list of things sounds about as scary as it can get. 
>This 
>> >> is not the time to be making it easier to escape from containers.
>> >
>> > To be honest, I think this argument is rubbish.  If we're afraid of
>> > giving out a list of all the namespaces, it means we're afraid
>there's
>> > some security bug and we're trying to obscure it by making the list
>> > hard to get.  All we've done is allayed fears about the bug but the
>> > hackers still know the portals to get through.
>> >
>> > If such a bug exists, it will be possible to exploit it by simply
>> > reconstructing the information from the individual process
>directories,
>> > so obscurity doesn't protect us and all it does is give us a false
>> > sense of security.   If such a bug doesn't exist, then all the
>security
>> > mechanisms currently in place (like no re-entry to prior namespace)
>> > should protect us and we can give out the list.
>> >
>> > Let's deal with the world as we'd like it to be (no obscure
>namespace
>> > bugs) and accept the consequences and the responsibility for fixing
>> > them if we turn out to be slightly incorrect.  We'll end up in a
>far
>> > better place than security by obscurity would land us.
>> 
>> No.  That is not the fear.  The permission checks on
>/proc/self/ns/xxx
>> are different than if the namespace is bind mounted somewhere.
>> 
>> That was done deliberately and with a reasonable amount of
>forethought.
>> You are asking to throw those permission checks out.   The answer is
>no.
>> 
>> Furthermore there is a much clearer reason not to go with a list of
>all
>> namespaces. A list of all namespaces breaks CRIU.  As you have
>described
>> it the list will change depending upon which machine you restore a
>> checkpoint on.  I honestly don't know what kind of havoc that will
>cause
>> but it is certainly something we won't be able to checkpoint no
>matter
>> how hard we try.
>
>It's right. I hadn't thought about this.

Me neither.  Sorry for the prior outburst.

I think this means we're back to exposing owning userns in the /proc //ns 
directory. 

>> 
>> A global list of namespaces especially of the kind that you can open
>> and get a handle to the namespace is just not appropriate.
>> 
>> I know inode numbers comes darn close to names but they aren't really
>> names and if it comes to it we can figure out how to preserve an
>> applications view of it all across a checkpoint/restart.  So far it
>> hasn't proven necessary to preserve any inode numbers across
>> checkpoint/restart but again it is theoretically possible if it
>becomes
>> necessary.
>> 
>> Throwing away checkpoint/restart support for the sake of
>> checkpoint/restart is a no-go.
>> 
>> Containers fundamentally imply you don't have global visibility,
>> and that is a good thing.
>
>All these thoughts about security make me thinking that kcmp is what we
>should use here. It's maybe something like this:
>
>kcmp(pid1, pid2, KCMP_NS_USERNS, fd1, fd2)
>
>- to check if userns of the fd1 namepsace is equal to the fd2 userns
>
>kcmp(pid1, pid2, KCMP_NS_PARENT, fd1, fd2)
>
>- to check if a parent namespace of the fd1 pidns is equal to fd pidns.
>
>fd1 and fd2 is file descriptors to namespace files.
>
>So if we want to build a hierarchy, we need to collect all namespaces
>and then enumerate them to check dependencies with help of kcmp.

Sure, but we need a method for opening the filehandles first .. .

James 

>> 
>> Eric


-- 
Sent from my Android device with K-9 Mail. Please excuse my brevity.


Re: [CRIU] Introspecting userns relationships to other namespaces?

2016-07-09 Thread James Bottomley
On July 9, 2016 4:26:28 PM GMT+09:00, Andrew Vagin  wrote:
>On Fri, Jul 08, 2016 at 10:05:18PM -0500, Eric W. Biederman wrote:
>> James Bottomley  writes:
>> 
>> > On Fri, 2016-07-08 at 18:52 -0500, Eric W. Biederman wrote:
>> >> James Bottomley  writes:
>> >> 
>> >> > On July 8, 2016 1:38:19 PM PDT, Andrew Vagin
>
>> >> > wrote:
>> >> 
>> >> > > What do you think about the idea to mount nsfs and be able to 
>> >> > > look up any alive namespace by inum:
>> >> > 
>> >> > I think I like it.  It will give us a way to enter any extant
>> >> > namespace.  It will work for Eric's fs namespaces as well. 
>Perhaps 
>> >> > a /process/ns/ Directory?
>> >
>> > As you understood, I meant /proc/ns/ (damn mobile phone
>> > completions).
>> >
>> >> *Shivers*
>> >> 
>> >> That makes it very easy to bypass any existing controls that exist
>
>> >> for getting at namespaces.  It is true that everything of that
>kind 
>> >> is directory based but still.
>> >> 
>> >> Plus I think it would serve as information leak to information 
>> >> outside of the container.
>> >> 
>> >> An operation to get a user namespace file descriptor from some
>kernel
>> >> object sounds reasonably sane.
>> >> 
>> >> A great big list of things sounds about as scary as it can get. 
>This 
>> >> is not the time to be making it easier to escape from containers.
>> >
>> > To be honest, I think this argument is rubbish.  If we're afraid of
>> > giving out a list of all the namespaces, it means we're afraid
>there's
>> > some security bug and we're trying to obscure it by making the list
>> > hard to get.  All we've done is allayed fears about the bug but the
>> > hackers still know the portals to get through.
>> >
>> > If such a bug exists, it will be possible to exploit it by simply
>> > reconstructing the information from the individual process
>directories,
>> > so obscurity doesn't protect us and all it does is give us a false
>> > sense of security.   If such a bug doesn't exist, then all the
>security
>> > mechanisms currently in place (like no re-entry to prior namespace)
>> > should protect us and we can give out the list.
>> >
>> > Let's deal with the world as we'd like it to be (no obscure
>namespace
>> > bugs) and accept the consequences and the responsibility for fixing
>> > them if we turn out to be slightly incorrect.  We'll end up in a
>far
>> > better place than security by obscurity would land us.
>> 
>> No.  That is not the fear.  The permission checks on
>/proc/self/ns/xxx
>> are different than if the namespace is bind mounted somewhere.
>> 
>> That was done deliberately and with a reasonable amount of
>forethought.
>> You are asking to throw those permission checks out.   The answer is
>no.
>> 
>> Furthermore there is a much clearer reason not to go with a list of
>all
>> namespaces. A list of all namespaces breaks CRIU.  As you have
>described
>> it the list will change depending upon which machine you restore a
>> checkpoint on.  I honestly don't know what kind of havoc that will
>cause
>> but it is certainly something we won't be able to checkpoint no
>matter
>> how hard we try.
>
>It's right. I hadn't thought about this.

Me neither.  Sorry for the prior outburst.

I think this means we're back to exposing owning userns in the /proc //ns 
directory. 

>> 
>> A global list of namespaces especially of the kind that you can open
>> and get a handle to the namespace is just not appropriate.
>> 
>> I know inode numbers comes darn close to names but they aren't really
>> names and if it comes to it we can figure out how to preserve an
>> applications view of it all across a checkpoint/restart.  So far it
>> hasn't proven necessary to preserve any inode numbers across
>> checkpoint/restart but again it is theoretically possible if it
>becomes
>> necessary.
>> 
>> Throwing away checkpoint/restart support for the sake of
>> checkpoint/restart is a no-go.
>> 
>> Containers fundamentally imply you don't have global visibility,
>> and that is a good thing.
>
>All these thoughts about security make me thinking that kcmp is what we
>should use here. It's maybe something like this:
>
>kcmp(pid1, pid2, KCMP_NS_USERNS, fd1, fd2)
>
>- to check if userns of the fd1 namepsace is equal to the fd2 userns
>
>kcmp(pid1, pid2, KCMP_NS_PARENT, fd1, fd2)
>
>- to check if a parent namespace of the fd1 pidns is equal to fd pidns.
>
>fd1 and fd2 is file descriptors to namespace files.
>
>So if we want to build a hierarchy, we need to collect all namespaces
>and then enumerate them to check dependencies with help of kcmp.

Sure, but we need a method for opening the filehandles first .. .

James 

>> 
>> Eric


-- 
Sent from my Android device with K-9 Mail. Please excuse my brevity.


Re: [CRIU] Introspecting userns relationships to other namespaces?

2016-07-09 Thread James Bottomley
On July 9, 2016 4:26:28 PM GMT+09:00, Andrew Vagin  wrote:
>On Fri, Jul 08, 2016 at 10:05:18PM -0500, Eric W. Biederman wrote:
>> James Bottomley  writes:
>> 
>> > On Fri, 2016-07-08 at 18:52 -0500, Eric W. Biederman wrote:
>> >> James Bottomley  writes:
>> >> 
>> >> > On July 8, 2016 1:38:19 PM PDT, Andrew Vagin
>
>> >> > wrote:
>> >> 
>> >> > > What do you think about the idea to mount nsfs and be able to 
>> >> > > look up any alive namespace by inum:
>> >> > 
>> >> > I think I like it.  It will give us a way to enter any extant
>> >> > namespace.  It will work for Eric's fs namespaces as well. 
>Perhaps 
>> >> > a /process/ns/ Directory?
>> >
>> > As you understood, I meant /proc/ns/ (damn mobile phone
>> > completions).
>> >
>> >> *Shivers*
>> >> 
>> >> That makes it very easy to bypass any existing controls that exist
>
>> >> for getting at namespaces.  It is true that everything of that
>kind 
>> >> is directory based but still.
>> >> 
>> >> Plus I think it would serve as information leak to information 
>> >> outside of the container.
>> >> 
>> >> An operation to get a user namespace file descriptor from some
>kernel
>> >> object sounds reasonably sane.
>> >> 
>> >> A great big list of things sounds about as scary as it can get. 
>This 
>> >> is not the time to be making it easier to escape from containers.
>> >
>> > To be honest, I think this argument is rubbish.  If we're afraid of
>> > giving out a list of all the namespaces, it means we're afraid
>there's
>> > some security bug and we're trying to obscure it by making the list
>> > hard to get.  All we've done is allayed fears about the bug but the
>> > hackers still know the portals to get through.
>> >
>> > If such a bug exists, it will be possible to exploit it by simply
>> > reconstructing the information from the individual process
>directories,
>> > so obscurity doesn't protect us and all it does is give us a false
>> > sense of security.   If such a bug doesn't exist, then all the
>security
>> > mechanisms currently in place (like no re-entry to prior namespace)
>> > should protect us and we can give out the list.
>> >
>> > Let's deal with the world as we'd like it to be (no obscure
>namespace
>> > bugs) and accept the consequences and the responsibility for fixing
>> > them if we turn out to be slightly incorrect.  We'll end up in a
>far
>> > better place than security by obscurity would land us.
>> 
>> No.  That is not the fear.  The permission checks on
>/proc/self/ns/xxx
>> are different than if the namespace is bind mounted somewhere.
>> 
>> That was done deliberately and with a reasonable amount of
>forethought.
>> You are asking to throw those permission checks out.   The answer is
>no.
>> 
>> Furthermore there is a much clearer reason not to go with a list of
>all
>> namespaces. A list of all namespaces breaks CRIU.  As you have
>described
>> it the list will change depending upon which machine you restore a
>> checkpoint on.  I honestly don't know what kind of havoc that will
>cause
>> but it is certainly something we won't be able to checkpoint no
>matter
>> how hard we try.
>
>It's right. I hadn't thought about this.

Me neither.  Sorry for the prior outburst.

I think this means we're back to exposing owning userns in the /proc //ns 
directory. 

>> 
>> A global list of namespaces especially of the kind that you can open
>> and get a handle to the namespace is just not appropriate.
>> 
>> I know inode numbers comes darn close to names but they aren't really
>> names and if it comes to it we can figure out how to preserve an
>> applications view of it all across a checkpoint/restart.  So far it
>> hasn't proven necessary to preserve any inode numbers across
>> checkpoint/restart but again it is theoretically possible if it
>becomes
>> necessary.
>> 
>> Throwing away checkpoint/restart support for the sake of
>> checkpoint/restart is a no-go.
>> 
>> Containers fundamentally imply you don't have global visibility,
>> and that is a good thing.
>
>All these thoughts about security make me thinking that kcmp is what we
>should use here. It's maybe something like this:
>
>kcmp(pid1, pid2, KCMP_NS_USERNS, fd1, fd2)
>
>- to check if userns of the fd1 namepsace is equal to the fd2 userns
>
>kcmp(pid1, pid2, KCMP_NS_PARENT, fd1, fd2)
>
>- to check if a parent namespace of the fd1 pidns is equal to fd pidns.
>
>fd1 and fd2 is file descriptors to namespace files.
>
>So if we want to build a hierarchy, we need to collect all namespaces
>and then enumerate them to check dependencies with help of kcmp.

Sure, but we need a method for opening the filehandles first .. .

James 

>> 
>> Eric


-- 
Sent from my Android device with K-9 Mail. Please excuse my brevity.


Re: [CRIU] Introspecting userns relationships to other namespaces?

2016-07-09 Thread Andrew Vagin
On Fri, Jul 08, 2016 at 10:05:18PM -0500, Eric W. Biederman wrote:
> James Bottomley  writes:
> 
> > On Fri, 2016-07-08 at 18:52 -0500, Eric W. Biederman wrote:
> >> James Bottomley  writes:
> >> 
> >> > On July 8, 2016 1:38:19 PM PDT, Andrew Vagin 
> >> > wrote:
> >> 
> >> > > What do you think about the idea to mount nsfs and be able to 
> >> > > look up any alive namespace by inum:
> >> > 
> >> > I think I like it.  It will give us a way to enter any extant
> >> > namespace.  It will work for Eric's fs namespaces as well.  Perhaps 
> >> > a /process/ns/ Directory?
> >
> > As you understood, I meant /proc/ns/ (damn mobile phone
> > completions).
> >
> >> *Shivers*
> >> 
> >> That makes it very easy to bypass any existing controls that exist 
> >> for getting at namespaces.  It is true that everything of that kind 
> >> is directory based but still.
> >> 
> >> Plus I think it would serve as information leak to information 
> >> outside of the container.
> >> 
> >> An operation to get a user namespace file descriptor from some kernel
> >> object sounds reasonably sane.
> >> 
> >> A great big list of things sounds about as scary as it can get.  This 
> >> is not the time to be making it easier to escape from containers.
> >
> > To be honest, I think this argument is rubbish.  If we're afraid of
> > giving out a list of all the namespaces, it means we're afraid there's
> > some security bug and we're trying to obscure it by making the list
> > hard to get.  All we've done is allayed fears about the bug but the
> > hackers still know the portals to get through.
> >
> > If such a bug exists, it will be possible to exploit it by simply
> > reconstructing the information from the individual process directories,
> > so obscurity doesn't protect us and all it does is give us a false
> > sense of security.   If such a bug doesn't exist, then all the security
> > mechanisms currently in place (like no re-entry to prior namespace)
> > should protect us and we can give out the list.
> >
> > Let's deal with the world as we'd like it to be (no obscure namespace
> > bugs) and accept the consequences and the responsibility for fixing
> > them if we turn out to be slightly incorrect.  We'll end up in a far
> > better place than security by obscurity would land us.
> 
> No.  That is not the fear.  The permission checks on /proc/self/ns/xxx
> are different than if the namespace is bind mounted somewhere.
> 
> That was done deliberately and with a reasonable amount of forethought.
> You are asking to throw those permission checks out.   The answer is no.
> 
> Furthermore there is a much clearer reason not to go with a list of all
> namespaces. A list of all namespaces breaks CRIU.  As you have described
> it the list will change depending upon which machine you restore a
> checkpoint on.  I honestly don't know what kind of havoc that will cause
> but it is certainly something we won't be able to checkpoint no matter
> how hard we try.

It's right. I hadn't thought about this.

> 
> A global list of namespaces especially of the kind that you can open
> and get a handle to the namespace is just not appropriate.
> 
> I know inode numbers comes darn close to names but they aren't really
> names and if it comes to it we can figure out how to preserve an
> applications view of it all across a checkpoint/restart.  So far it
> hasn't proven necessary to preserve any inode numbers across
> checkpoint/restart but again it is theoretically possible if it becomes
> necessary.
> 
> Throwing away checkpoint/restart support for the sake of
> checkpoint/restart is a no-go.
> 
> Containers fundamentally imply you don't have global visibility,
> and that is a good thing.

All these thoughts about security make me thinking that kcmp is what we
should use here. It's maybe something like this:

kcmp(pid1, pid2, KCMP_NS_USERNS, fd1, fd2)

- to check if userns of the fd1 namepsace is equal to the fd2 userns

kcmp(pid1, pid2, KCMP_NS_PARENT, fd1, fd2)

- to check if a parent namespace of the fd1 pidns is equal to fd pidns.

fd1 and fd2 is file descriptors to namespace files.

So if we want to build a hierarchy, we need to collect all namespaces
and then enumerate them to check dependencies with help of kcmp.

> 
> Eric


Re: [CRIU] Introspecting userns relationships to other namespaces?

2016-07-09 Thread Andrew Vagin
On Fri, Jul 08, 2016 at 10:05:18PM -0500, Eric W. Biederman wrote:
> James Bottomley  writes:
> 
> > On Fri, 2016-07-08 at 18:52 -0500, Eric W. Biederman wrote:
> >> James Bottomley  writes:
> >> 
> >> > On July 8, 2016 1:38:19 PM PDT, Andrew Vagin 
> >> > wrote:
> >> 
> >> > > What do you think about the idea to mount nsfs and be able to 
> >> > > look up any alive namespace by inum:
> >> > 
> >> > I think I like it.  It will give us a way to enter any extant
> >> > namespace.  It will work for Eric's fs namespaces as well.  Perhaps 
> >> > a /process/ns/ Directory?
> >
> > As you understood, I meant /proc/ns/ (damn mobile phone
> > completions).
> >
> >> *Shivers*
> >> 
> >> That makes it very easy to bypass any existing controls that exist 
> >> for getting at namespaces.  It is true that everything of that kind 
> >> is directory based but still.
> >> 
> >> Plus I think it would serve as information leak to information 
> >> outside of the container.
> >> 
> >> An operation to get a user namespace file descriptor from some kernel
> >> object sounds reasonably sane.
> >> 
> >> A great big list of things sounds about as scary as it can get.  This 
> >> is not the time to be making it easier to escape from containers.
> >
> > To be honest, I think this argument is rubbish.  If we're afraid of
> > giving out a list of all the namespaces, it means we're afraid there's
> > some security bug and we're trying to obscure it by making the list
> > hard to get.  All we've done is allayed fears about the bug but the
> > hackers still know the portals to get through.
> >
> > If such a bug exists, it will be possible to exploit it by simply
> > reconstructing the information from the individual process directories,
> > so obscurity doesn't protect us and all it does is give us a false
> > sense of security.   If such a bug doesn't exist, then all the security
> > mechanisms currently in place (like no re-entry to prior namespace)
> > should protect us and we can give out the list.
> >
> > Let's deal with the world as we'd like it to be (no obscure namespace
> > bugs) and accept the consequences and the responsibility for fixing
> > them if we turn out to be slightly incorrect.  We'll end up in a far
> > better place than security by obscurity would land us.
> 
> No.  That is not the fear.  The permission checks on /proc/self/ns/xxx
> are different than if the namespace is bind mounted somewhere.
> 
> That was done deliberately and with a reasonable amount of forethought.
> You are asking to throw those permission checks out.   The answer is no.
> 
> Furthermore there is a much clearer reason not to go with a list of all
> namespaces. A list of all namespaces breaks CRIU.  As you have described
> it the list will change depending upon which machine you restore a
> checkpoint on.  I honestly don't know what kind of havoc that will cause
> but it is certainly something we won't be able to checkpoint no matter
> how hard we try.

It's right. I hadn't thought about this.

> 
> A global list of namespaces especially of the kind that you can open
> and get a handle to the namespace is just not appropriate.
> 
> I know inode numbers comes darn close to names but they aren't really
> names and if it comes to it we can figure out how to preserve an
> applications view of it all across a checkpoint/restart.  So far it
> hasn't proven necessary to preserve any inode numbers across
> checkpoint/restart but again it is theoretically possible if it becomes
> necessary.
> 
> Throwing away checkpoint/restart support for the sake of
> checkpoint/restart is a no-go.
> 
> Containers fundamentally imply you don't have global visibility,
> and that is a good thing.

All these thoughts about security make me thinking that kcmp is what we
should use here. It's maybe something like this:

kcmp(pid1, pid2, KCMP_NS_USERNS, fd1, fd2)

- to check if userns of the fd1 namepsace is equal to the fd2 userns

kcmp(pid1, pid2, KCMP_NS_PARENT, fd1, fd2)

- to check if a parent namespace of the fd1 pidns is equal to fd pidns.

fd1 and fd2 is file descriptors to namespace files.

So if we want to build a hierarchy, we need to collect all namespaces
and then enumerate them to check dependencies with help of kcmp.

> 
> Eric


Re: [CRIU] Introspecting userns relationships to other namespaces?

2016-07-08 Thread Andrew Vagin
On Fri, Jul 08, 2016 at 07:35:33AM -0700, James Bottomley wrote:
> On Fri, 2016-07-08 at 02:44 -0500, Eric W. Biederman wrote:
> > Andrew Vagin  writes:
> > 
> > > On Wed, Jul 06, 2016 at 10:46:33AM -0500, Eric W. Biederman wrote:
> > > > "Serge E. Hallyn"  writes:
> > > > 
> > > > > On Wed, Jul 06, 2016 at 10:41:48AM +0200, Michael Kerrisk (man
> > > > > -pages) wrote:
> > > > > > [Rats! Doing now what I should have down to start with.
> > > > > > Looping some
> > > > > > lists and CRIU and other possibly relevant people into this
> > > > > > conversation]
> > > > > > 
> > > > > > Hi Eric,
> > > > > > 
> > > > > > On 5 July 2016 at 23:47, Eric W. Biederman <
> > > > > > ebied...@xmission.com> wrote:
> > > > > > > "Michael Kerrisk (man-pages)" 
> > > > > > > writes:
> > > > > > > 
> > > > > > > > Hi Eric,
> > > > > > > > 
> > > > > > > > I have a question. Is there any way currently to discover 
> > > > > > > > which user namespace a particular nonuser namespace is 
> > > > > > > > governed by? Maybe I am missing something, but there does 
> > > > > > > > not seem to be a way to do this. Also, can one discover 
> > > > > > > > which userns is the parent of a given userns? Again, I 
> > > > > > > > can't see a way to do this.
> > > > > > > > 
> > > > > > > > The point here is introspecting so that a process might 
> > > > > > > > determine what its capabilities are when operating on 
> > > > > > > > some resource governed by a (nonuser) namespace.
> > > > > > > 
> > > > > > > To the best of my knowledge that there is not an interface 
> > > > > > > to get that information.  It would be good to have such an 
> > > > > > > interface for no other reason than the CRIU folks are going 
> > > > > > > to need it at some point.  I am a bit surprised they have
> > > > > > > not complained yet.
> > > > > 
> > > > > I don't think they need it.  They do in fact have what they 
> > > > > need.  Assume you have tasks T1, T2, T1_1 and T2_1;  T1 and T2 
> > > > > are in init_user_ns;  T1 spawned T1_1 in a new userns;  T2 
> > > > > spawned T2_1 which setns()d to T1_1's ns. There's some
> > > > > {handwave} uid mapping, does not matter.
> > > > > 
> > > > > At restart, it doesn't matter which task originally created the 
> > > > > new userns. criu knows T1_1 and T2_1 are in the same userns; 
> > > > >  it creates the userns, sets up the mapping, and T1_1 and T2_1
> > > > > setns() to it.
> > > > 
> > > > Given that the simple cases are so easy it probably doesn't 
> > > > matter in that sense.
> > > > 
> > > > However we now have the case where user namespaces own pid 
> > > > namespaces, and uts namespaces, and network namespaces, and ipc 
> > > > namespaces, and filesystems.  Throw in some mount propagation and 
> > > > use of setns and things could get confusing.   It is something 
> > > > that will need to be figured out if CRIU is going to properly 
> > > > checkpoint containers containing containers containing containers 
> > > > containing containers.
> > > 
> > > It isn't a joke:). We have a few requests to support CR of 
> > > containers with Docker containers inside. And we are going to start 
> > > this task in a near future, so we would like to have interface to 
> > > get dependencies between namespaces too.
> > > 
> > > BTW: CRIU already supports nested mount namespaces, because systemd
> > > creates them for services.
> > 
> > The tricky part about this and what messes up James proposed plan is
> > that the interface needs to be something that returns a namespace 
> > file descriptor.  So we can't print something out in a simple text
> > file.
> 
> I actually described two problems: the first was how we get the
> information in the first place.  Currently the owning or parent user_ns
> is tucked inside an opaque structure.  I think we need to move that to
> ns_common where it would be the owning userns for all non-user
> namespaces and the parent for the userns.

I'm agree with this.

> 
> Once we actually have the information, we can also add a set of proc
> links, say either
> 
> /proc//ns/X-userns
> 
> Which might be a bit messy since it doubles the number of files, or
> perhaps in a simple directory.

In this case we will need to enter into each namespace to build a full
chain of dependencies.

It's tricky, because if we enter into a child userns, we can't to enter
into a parent userns from the same process, so to get the next branch,
we will need to create a new process.

process A
|
init_user_ns->child_user_ns_1->child_userns_2

fork() -> B
  B: setns(/proc/A/ns/userns-parent)
readlink(/proc/B/ns/userns)

fork() -> C
  C: setns(/proc/B/ns/userns-parent)
readlink(/proc/C/ns/userns)


> 
> > Well I suppose we could print an device number and inode number pair.
> > But then someone would still have to scour processes looking for a 
> > user namespace so that is likely 

Re: [CRIU] Introspecting userns relationships to other namespaces?

2016-07-08 Thread Andrew Vagin
On Fri, Jul 08, 2016 at 07:35:33AM -0700, James Bottomley wrote:
> On Fri, 2016-07-08 at 02:44 -0500, Eric W. Biederman wrote:
> > Andrew Vagin  writes:
> > 
> > > On Wed, Jul 06, 2016 at 10:46:33AM -0500, Eric W. Biederman wrote:
> > > > "Serge E. Hallyn"  writes:
> > > > 
> > > > > On Wed, Jul 06, 2016 at 10:41:48AM +0200, Michael Kerrisk (man
> > > > > -pages) wrote:
> > > > > > [Rats! Doing now what I should have down to start with.
> > > > > > Looping some
> > > > > > lists and CRIU and other possibly relevant people into this
> > > > > > conversation]
> > > > > > 
> > > > > > Hi Eric,
> > > > > > 
> > > > > > On 5 July 2016 at 23:47, Eric W. Biederman <
> > > > > > ebied...@xmission.com> wrote:
> > > > > > > "Michael Kerrisk (man-pages)" 
> > > > > > > writes:
> > > > > > > 
> > > > > > > > Hi Eric,
> > > > > > > > 
> > > > > > > > I have a question. Is there any way currently to discover 
> > > > > > > > which user namespace a particular nonuser namespace is 
> > > > > > > > governed by? Maybe I am missing something, but there does 
> > > > > > > > not seem to be a way to do this. Also, can one discover 
> > > > > > > > which userns is the parent of a given userns? Again, I 
> > > > > > > > can't see a way to do this.
> > > > > > > > 
> > > > > > > > The point here is introspecting so that a process might 
> > > > > > > > determine what its capabilities are when operating on 
> > > > > > > > some resource governed by a (nonuser) namespace.
> > > > > > > 
> > > > > > > To the best of my knowledge that there is not an interface 
> > > > > > > to get that information.  It would be good to have such an 
> > > > > > > interface for no other reason than the CRIU folks are going 
> > > > > > > to need it at some point.  I am a bit surprised they have
> > > > > > > not complained yet.
> > > > > 
> > > > > I don't think they need it.  They do in fact have what they 
> > > > > need.  Assume you have tasks T1, T2, T1_1 and T2_1;  T1 and T2 
> > > > > are in init_user_ns;  T1 spawned T1_1 in a new userns;  T2 
> > > > > spawned T2_1 which setns()d to T1_1's ns. There's some
> > > > > {handwave} uid mapping, does not matter.
> > > > > 
> > > > > At restart, it doesn't matter which task originally created the 
> > > > > new userns. criu knows T1_1 and T2_1 are in the same userns; 
> > > > >  it creates the userns, sets up the mapping, and T1_1 and T2_1
> > > > > setns() to it.
> > > > 
> > > > Given that the simple cases are so easy it probably doesn't 
> > > > matter in that sense.
> > > > 
> > > > However we now have the case where user namespaces own pid 
> > > > namespaces, and uts namespaces, and network namespaces, and ipc 
> > > > namespaces, and filesystems.  Throw in some mount propagation and 
> > > > use of setns and things could get confusing.   It is something 
> > > > that will need to be figured out if CRIU is going to properly 
> > > > checkpoint containers containing containers containing containers 
> > > > containing containers.
> > > 
> > > It isn't a joke:). We have a few requests to support CR of 
> > > containers with Docker containers inside. And we are going to start 
> > > this task in a near future, so we would like to have interface to 
> > > get dependencies between namespaces too.
> > > 
> > > BTW: CRIU already supports nested mount namespaces, because systemd
> > > creates them for services.
> > 
> > The tricky part about this and what messes up James proposed plan is
> > that the interface needs to be something that returns a namespace 
> > file descriptor.  So we can't print something out in a simple text
> > file.
> 
> I actually described two problems: the first was how we get the
> information in the first place.  Currently the owning or parent user_ns
> is tucked inside an opaque structure.  I think we need to move that to
> ns_common where it would be the owning userns for all non-user
> namespaces and the parent for the userns.

I'm agree with this.

> 
> Once we actually have the information, we can also add a set of proc
> links, say either
> 
> /proc//ns/X-userns
> 
> Which might be a bit messy since it doubles the number of files, or
> perhaps in a simple directory.

In this case we will need to enter into each namespace to build a full
chain of dependencies.

It's tricky, because if we enter into a child userns, we can't to enter
into a parent userns from the same process, so to get the next branch,
we will need to create a new process.

process A
|
init_user_ns->child_user_ns_1->child_userns_2

fork() -> B
  B: setns(/proc/A/ns/userns-parent)
readlink(/proc/B/ns/userns)

fork() -> C
  C: setns(/proc/B/ns/userns-parent)
readlink(/proc/C/ns/userns)


> 
> > Well I suppose we could print an device number and inode number pair.
> > But then someone would still have to scour processes looking for a 
> > user namespace so that is likely less than ideal.
> 
> There's no reason any of the proposed 

Re: Introspecting userns relationships to other namespaces?

2016-07-08 Thread Eric W. Biederman
"W. Trevor King"  writes:

> On Thu, Jul 07, 2016 at 08:01:52AM -0700, James Bottomley wrote:
>> In theory, we could get nsfs to show this information as an option
>> (just add a show_options entry to the superblock ops), but the
>> problem is that although each namespace has a parent user_ns,
>> there's no way to get it without digging in the namespace specific
>> structure.  Probably we should restructure to move it into
>> ns_common, then we could display it (and enforce all namespaces
>> having owning user_ns) but it would be a reasonably large (but
>> mechanical) change.
>
> It sounds like everyone is either positive or or neutral on this
> groundwork, even if we haven't decided if/how to expose the
> information to userspace.  I'm happy to work up a patch while the rest
> of the discussion continues.  I'm also happy to let someone else work
> up the patch, if anyone else is chomping at the bit ;).

I am dubious on moving all of the user namespace members into ns_common.

I would happy to be proved wrong but I suspect in the cases where we
actually use that user namespace the code will become uglier.  Making
the ordinary uses uglier to make a rare corner case nicer is the wrong
trade off.

But feel free to try it is certainly worth doing if it doesn't make the
code that uses the user namespaces uglier.

Eric



Re: Introspecting userns relationships to other namespaces?

2016-07-08 Thread Eric W. Biederman
"W. Trevor King"  writes:

> On Thu, Jul 07, 2016 at 08:01:52AM -0700, James Bottomley wrote:
>> In theory, we could get nsfs to show this information as an option
>> (just add a show_options entry to the superblock ops), but the
>> problem is that although each namespace has a parent user_ns,
>> there's no way to get it without digging in the namespace specific
>> structure.  Probably we should restructure to move it into
>> ns_common, then we could display it (and enforce all namespaces
>> having owning user_ns) but it would be a reasonably large (but
>> mechanical) change.
>
> It sounds like everyone is either positive or or neutral on this
> groundwork, even if we haven't decided if/how to expose the
> information to userspace.  I'm happy to work up a patch while the rest
> of the discussion continues.  I'm also happy to let someone else work
> up the patch, if anyone else is chomping at the bit ;).

I am dubious on moving all of the user namespace members into ns_common.

I would happy to be proved wrong but I suspect in the cases where we
actually use that user namespace the code will become uglier.  Making
the ordinary uses uglier to make a rare corner case nicer is the wrong
trade off.

But feel free to try it is certainly worth doing if it doesn't make the
code that uses the user namespaces uglier.

Eric



Re: [CRIU] Introspecting userns relationships to other namespaces?

2016-07-08 Thread Eric W. Biederman
James Bottomley  writes:

> On Fri, 2016-07-08 at 18:52 -0500, Eric W. Biederman wrote:
>> James Bottomley  writes:
>> 
>> > On July 8, 2016 1:38:19 PM PDT, Andrew Vagin 
>> > wrote:
>> 
>> > > What do you think about the idea to mount nsfs and be able to 
>> > > look up any alive namespace by inum:
>> > 
>> > I think I like it.  It will give us a way to enter any extant
>> > namespace.  It will work for Eric's fs namespaces as well.  Perhaps 
>> > a /process/ns/ Directory?
>
> As you understood, I meant /proc/ns/ (damn mobile phone
> completions).
>
>> *Shivers*
>> 
>> That makes it very easy to bypass any existing controls that exist 
>> for getting at namespaces.  It is true that everything of that kind 
>> is directory based but still.
>> 
>> Plus I think it would serve as information leak to information 
>> outside of the container.
>> 
>> An operation to get a user namespace file descriptor from some kernel
>> object sounds reasonably sane.
>> 
>> A great big list of things sounds about as scary as it can get.  This 
>> is not the time to be making it easier to escape from containers.
>
> To be honest, I think this argument is rubbish.  If we're afraid of
> giving out a list of all the namespaces, it means we're afraid there's
> some security bug and we're trying to obscure it by making the list
> hard to get.  All we've done is allayed fears about the bug but the
> hackers still know the portals to get through.
>
> If such a bug exists, it will be possible to exploit it by simply
> reconstructing the information from the individual process directories,
> so obscurity doesn't protect us and all it does is give us a false
> sense of security.   If such a bug doesn't exist, then all the security
> mechanisms currently in place (like no re-entry to prior namespace)
> should protect us and we can give out the list.
>
> Let's deal with the world as we'd like it to be (no obscure namespace
> bugs) and accept the consequences and the responsibility for fixing
> them if we turn out to be slightly incorrect.  We'll end up in a far
> better place than security by obscurity would land us.

No.  That is not the fear.  The permission checks on /proc/self/ns/xxx
are different than if the namespace is bind mounted somewhere.

That was done deliberately and with a reasonable amount of forethought.
You are asking to throw those permission checks out.   The answer is no.

Furthermore there is a much clearer reason not to go with a list of all
namespaces. A list of all namespaces breaks CRIU.  As you have described
it the list will change depending upon which machine you restore a
checkpoint on.  I honestly don't know what kind of havoc that will cause
but it is certainly something we won't be able to checkpoint no matter
how hard we try.

A global list of namespaces especially of the kind that you can open
and get a handle to the namespace is just not appropriate.

I know inode numbers comes darn close to names but they aren't really
names and if it comes to it we can figure out how to preserve an
applications view of it all across a checkpoint/restart.  So far it
hasn't proven necessary to preserve any inode numbers across
checkpoint/restart but again it is theoretically possible if it becomes
necessary.

Throwing away checkpoint/restart support for the sake of
checkpoint/restart is a no-go.

Containers fundamentally imply you don't have global visibility,
and that is a good thing.

Eric


Re: [CRIU] Introspecting userns relationships to other namespaces?

2016-07-08 Thread Eric W. Biederman
James Bottomley  writes:

> On Fri, 2016-07-08 at 18:52 -0500, Eric W. Biederman wrote:
>> James Bottomley  writes:
>> 
>> > On July 8, 2016 1:38:19 PM PDT, Andrew Vagin 
>> > wrote:
>> 
>> > > What do you think about the idea to mount nsfs and be able to 
>> > > look up any alive namespace by inum:
>> > 
>> > I think I like it.  It will give us a way to enter any extant
>> > namespace.  It will work for Eric's fs namespaces as well.  Perhaps 
>> > a /process/ns/ Directory?
>
> As you understood, I meant /proc/ns/ (damn mobile phone
> completions).
>
>> *Shivers*
>> 
>> That makes it very easy to bypass any existing controls that exist 
>> for getting at namespaces.  It is true that everything of that kind 
>> is directory based but still.
>> 
>> Plus I think it would serve as information leak to information 
>> outside of the container.
>> 
>> An operation to get a user namespace file descriptor from some kernel
>> object sounds reasonably sane.
>> 
>> A great big list of things sounds about as scary as it can get.  This 
>> is not the time to be making it easier to escape from containers.
>
> To be honest, I think this argument is rubbish.  If we're afraid of
> giving out a list of all the namespaces, it means we're afraid there's
> some security bug and we're trying to obscure it by making the list
> hard to get.  All we've done is allayed fears about the bug but the
> hackers still know the portals to get through.
>
> If such a bug exists, it will be possible to exploit it by simply
> reconstructing the information from the individual process directories,
> so obscurity doesn't protect us and all it does is give us a false
> sense of security.   If such a bug doesn't exist, then all the security
> mechanisms currently in place (like no re-entry to prior namespace)
> should protect us and we can give out the list.
>
> Let's deal with the world as we'd like it to be (no obscure namespace
> bugs) and accept the consequences and the responsibility for fixing
> them if we turn out to be slightly incorrect.  We'll end up in a far
> better place than security by obscurity would land us.

No.  That is not the fear.  The permission checks on /proc/self/ns/xxx
are different than if the namespace is bind mounted somewhere.

That was done deliberately and with a reasonable amount of forethought.
You are asking to throw those permission checks out.   The answer is no.

Furthermore there is a much clearer reason not to go with a list of all
namespaces. A list of all namespaces breaks CRIU.  As you have described
it the list will change depending upon which machine you restore a
checkpoint on.  I honestly don't know what kind of havoc that will cause
but it is certainly something we won't be able to checkpoint no matter
how hard we try.

A global list of namespaces especially of the kind that you can open
and get a handle to the namespace is just not appropriate.

I know inode numbers comes darn close to names but they aren't really
names and if it comes to it we can figure out how to preserve an
applications view of it all across a checkpoint/restart.  So far it
hasn't proven necessary to preserve any inode numbers across
checkpoint/restart but again it is theoretically possible if it becomes
necessary.

Throwing away checkpoint/restart support for the sake of
checkpoint/restart is a no-go.

Containers fundamentally imply you don't have global visibility,
and that is a good thing.

Eric


Re: Introspecting userns relationships to other namespaces?

2016-07-08 Thread W. Trevor King
On Thu, Jul 07, 2016 at 08:01:52AM -0700, James Bottomley wrote:
> In theory, we could get nsfs to show this information as an option
> (just add a show_options entry to the superblock ops), but the
> problem is that although each namespace has a parent user_ns,
> there's no way to get it without digging in the namespace specific
> structure.  Probably we should restructure to move it into
> ns_common, then we could display it (and enforce all namespaces
> having owning user_ns) but it would be a reasonably large (but
> mechanical) change.

It sounds like everyone is either positive or or neutral on this
groundwork, even if we haven't decided if/how to expose the
information to userspace.  I'm happy to work up a patch while the rest
of the discussion continues.  I'm also happy to let someone else work
up the patch, if anyone else is chomping at the bit ;).

Cheers,
Trevor

-- 
This email may be signed or encrypted with GnuPG (http://www.gnupg.org).
For more information, see http://en.wikipedia.org/wiki/Pretty_Good_Privacy


signature.asc
Description: OpenPGP digital signature


Re: Introspecting userns relationships to other namespaces?

2016-07-08 Thread W. Trevor King
On Thu, Jul 07, 2016 at 08:01:52AM -0700, James Bottomley wrote:
> In theory, we could get nsfs to show this information as an option
> (just add a show_options entry to the superblock ops), but the
> problem is that although each namespace has a parent user_ns,
> there's no way to get it without digging in the namespace specific
> structure.  Probably we should restructure to move it into
> ns_common, then we could display it (and enforce all namespaces
> having owning user_ns) but it would be a reasonably large (but
> mechanical) change.

It sounds like everyone is either positive or or neutral on this
groundwork, even if we haven't decided if/how to expose the
information to userspace.  I'm happy to work up a patch while the rest
of the discussion continues.  I'm also happy to let someone else work
up the patch, if anyone else is chomping at the bit ;).

Cheers,
Trevor

-- 
This email may be signed or encrypted with GnuPG (http://www.gnupg.org).
For more information, see http://en.wikipedia.org/wiki/Pretty_Good_Privacy


signature.asc
Description: OpenPGP digital signature


Re: [CRIU] Introspecting userns relationships to other namespaces?

2016-07-08 Thread James Bottomley
On Fri, 2016-07-08 at 18:52 -0500, Eric W. Biederman wrote:
> James Bottomley  writes:
> 
> > On July 8, 2016 1:38:19 PM PDT, Andrew Vagin 
> > wrote:
> 
> > > What do you think about the idea to mount nsfs and be able to 
> > > look up any alive namespace by inum:
> > 
> > I think I like it.  It will give us a way to enter any extant
> > namespace.  It will work for Eric's fs namespaces as well.  Perhaps 
> > a /process/ns/ Directory?

As you understood, I meant /proc/ns/ (damn mobile phone
completions).

> *Shivers*
> 
> That makes it very easy to bypass any existing controls that exist 
> for getting at namespaces.  It is true that everything of that kind 
> is directory based but still.
> 
> Plus I think it would serve as information leak to information 
> outside of the container.
> 
> An operation to get a user namespace file descriptor from some kernel
> object sounds reasonably sane.
> 
> A great big list of things sounds about as scary as it can get.  This 
> is not the time to be making it easier to escape from containers.

To be honest, I think this argument is rubbish.  If we're afraid of
giving out a list of all the namespaces, it means we're afraid there's
some security bug and we're trying to obscure it by making the list
hard to get.  All we've done is allayed fears about the bug but the
hackers still know the portals to get through.

If such a bug exists, it will be possible to exploit it by simply
reconstructing the information from the individual process directories,
so obscurity doesn't protect us and all it does is give us a false
sense of security.   If such a bug doesn't exist, then all the security
mechanisms currently in place (like no re-entry to prior namespace)
should protect us and we can give out the list.

Let's deal with the world as we'd like it to be (no obscure namespace
bugs) and accept the consequences and the responsibility for fixing
them if we turn out to be slightly incorrect.  We'll end up in a far
better place than security by obscurity would land us.

James



Re: [CRIU] Introspecting userns relationships to other namespaces?

2016-07-08 Thread James Bottomley
On Fri, 2016-07-08 at 18:52 -0500, Eric W. Biederman wrote:
> James Bottomley  writes:
> 
> > On July 8, 2016 1:38:19 PM PDT, Andrew Vagin 
> > wrote:
> 
> > > What do you think about the idea to mount nsfs and be able to 
> > > look up any alive namespace by inum:
> > 
> > I think I like it.  It will give us a way to enter any extant
> > namespace.  It will work for Eric's fs namespaces as well.  Perhaps 
> > a /process/ns/ Directory?

As you understood, I meant /proc/ns/ (damn mobile phone
completions).

> *Shivers*
> 
> That makes it very easy to bypass any existing controls that exist 
> for getting at namespaces.  It is true that everything of that kind 
> is directory based but still.
> 
> Plus I think it would serve as information leak to information 
> outside of the container.
> 
> An operation to get a user namespace file descriptor from some kernel
> object sounds reasonably sane.
> 
> A great big list of things sounds about as scary as it can get.  This 
> is not the time to be making it easier to escape from containers.

To be honest, I think this argument is rubbish.  If we're afraid of
giving out a list of all the namespaces, it means we're afraid there's
some security bug and we're trying to obscure it by making the list
hard to get.  All we've done is allayed fears about the bug but the
hackers still know the portals to get through.

If such a bug exists, it will be possible to exploit it by simply
reconstructing the information from the individual process directories,
so obscurity doesn't protect us and all it does is give us a false
sense of security.   If such a bug doesn't exist, then all the security
mechanisms currently in place (like no re-entry to prior namespace)
should protect us and we can give out the list.

Let's deal with the world as we'd like it to be (no obscure namespace
bugs) and accept the consequences and the responsibility for fixing
them if we turn out to be slightly incorrect.  We'll end up in a far
better place than security by obscurity would land us.

James



Re: [CRIU] Introspecting userns relationships to other namespaces?

2016-07-08 Thread Eric W. Biederman
James Bottomley  writes:

> On July 8, 2016 1:38:19 PM PDT, Andrew Vagin  wrote:

>>What do you think about the idea to mount nsfs and be able to look up
>>any alive namespace by inum:
>
> I think I like it.  It will give us a way to enter any extant
> namespace.  It will work for Eric's fs namespaces as well.  Perhaps a
> /process/ns/ Directory?

*Shivers*

That makes it very easy to bypass any existing controls that exist for
getting at namespaces.  It is true that everything of that kind is
directory based but still.

Plus I think it would serve as information leak to information outside
of the container.

An operation to get a user namespace file descriptor from some kernel
object sounds reasonably sane.

A great big list of things sounds about as scary as it can get.  This is
not the time to be making it easier to escape from containers.

Eric


Re: [CRIU] Introspecting userns relationships to other namespaces?

2016-07-08 Thread Eric W. Biederman
James Bottomley  writes:

> On July 8, 2016 1:38:19 PM PDT, Andrew Vagin  wrote:

>>What do you think about the idea to mount nsfs and be able to look up
>>any alive namespace by inum:
>
> I think I like it.  It will give us a way to enter any extant
> namespace.  It will work for Eric's fs namespaces as well.  Perhaps a
> /process/ns/ Directory?

*Shivers*

That makes it very easy to bypass any existing controls that exist for
getting at namespaces.  It is true that everything of that kind is
directory based but still.

Plus I think it would serve as information leak to information outside
of the container.

An operation to get a user namespace file descriptor from some kernel
object sounds reasonably sane.

A great big list of things sounds about as scary as it can get.  This is
not the time to be making it easier to escape from containers.

Eric


Re: [CRIU] Introspecting userns relationships to other namespaces?

2016-07-08 Thread James Bottomley
On July 8, 2016 1:38:19 PM PDT, Andrew Vagin  wrote:
>On Fri, Jul 08, 2016 at 07:35:33AM -0700, James Bottomley wrote:
>> On Fri, 2016-07-08 at 02:44 -0500, Eric W. Biederman wrote:
>> > Andrew Vagin  writes:
>> > 
>> > > On Wed, Jul 06, 2016 at 10:46:33AM -0500, Eric W. Biederman
>wrote:
>> > > > "Serge E. Hallyn"  writes:
>> > > > 
>> > > > > On Wed, Jul 06, 2016 at 10:41:48AM +0200, Michael Kerrisk
>(man
>> > > > > -pages) wrote:
>> > > > > > [Rats! Doing now what I should have down to start with.
>> > > > > > Looping some
>> > > > > > lists and CRIU and other possibly relevant people into this
>> > > > > > conversation]
>> > > > > > 
>> > > > > > Hi Eric,
>> > > > > > 
>> > > > > > On 5 July 2016 at 23:47, Eric W. Biederman <
>> > > > > > ebied...@xmission.com> wrote:
>> > > > > > > "Michael Kerrisk (man-pages)" 
>> > > > > > > writes:
>> > > > > > > 
>> > > > > > > > Hi Eric,
>> > > > > > > > 
>> > > > > > > > I have a question. Is there any way currently to
>discover 
>> > > > > > > > which user namespace a particular nonuser namespace is 
>> > > > > > > > governed by? Maybe I am missing something, but there
>does 
>> > > > > > > > not seem to be a way to do this. Also, can one discover
>
>> > > > > > > > which userns is the parent of a given userns? Again, I 
>> > > > > > > > can't see a way to do this.
>> > > > > > > > 
>> > > > > > > > The point here is introspecting so that a process might
>
>> > > > > > > > determine what its capabilities are when operating on 
>> > > > > > > > some resource governed by a (nonuser) namespace.
>> > > > > > > 
>> > > > > > > To the best of my knowledge that there is not an
>interface 
>> > > > > > > to get that information.  It would be good to have such
>an 
>> > > > > > > interface for no other reason than the CRIU folks are
>going 
>> > > > > > > to need it at some point.  I am a bit surprised they have
>> > > > > > > not complained yet.
>> > > > > 
>> > > > > I don't think they need it.  They do in fact have what they 
>> > > > > need.  Assume you have tasks T1, T2, T1_1 and T2_1;  T1 and
>T2 
>> > > > > are in init_user_ns;  T1 spawned T1_1 in a new userns;  T2 
>> > > > > spawned T2_1 which setns()d to T1_1's ns. There's some
>> > > > > {handwave} uid mapping, does not matter.
>> > > > > 
>> > > > > At restart, it doesn't matter which task originally created
>the 
>> > > > > new userns. criu knows T1_1 and T2_1 are in the same userns; 
>> > > > >  it creates the userns, sets up the mapping, and T1_1 and
>T2_1
>> > > > > setns() to it.
>> > > > 
>> > > > Given that the simple cases are so easy it probably doesn't 
>> > > > matter in that sense.
>> > > > 
>> > > > However we now have the case where user namespaces own pid 
>> > > > namespaces, and uts namespaces, and network namespaces, and ipc
>
>> > > > namespaces, and filesystems.  Throw in some mount propagation
>and 
>> > > > use of setns and things could get confusing.   It is something 
>> > > > that will need to be figured out if CRIU is going to properly 
>> > > > checkpoint containers containing containers containing
>containers 
>> > > > containing containers.
>> > > 
>> > > It isn't a joke:). We have a few requests to support CR of 
>> > > containers with Docker containers inside. And we are going to
>start 
>> > > this task in a near future, so we would like to have interface to
>
>> > > get dependencies between namespaces too.
>> > > 
>> > > BTW: CRIU already supports nested mount namespaces, because
>systemd
>> > > creates them for services.
>> > 
>> > The tricky part about this and what messes up James proposed plan
>is
>> > that the interface needs to be something that returns a namespace 
>> > file descriptor.  So we can't print something out in a simple text
>> > file.
>> 
>> I actually described two problems: the first was how we get the
>> information in the first place.  Currently the owning or parent
>user_ns
>> is tucked inside an opaque structure.  I think we need to move that
>to
>> ns_common where it would be the owning userns for all non-user
>> namespaces and the parent for the userns.
>
>I'm agree with this.
>
>> 
>> Once we actually have the information, we can also add a set of proc
>> links, say either
>> 
>> /proc//ns/X-userns
>> 
>> Which might be a bit messy since it doubles the number of files, or
>> perhaps in a simple directory.
>
>In this case we will need to enter into each namespace to build a full
>chain of dependencies.
>
>It's tricky, because if we enter into a child userns, we can't to enter
>into a parent userns from the same process, so to get the next branch,
>we will need to create a new process.
>
>   process A
>   |
>init_user_ns->child_user_ns_1->child_userns_2
>
>fork() -> B
>  B: setns(/proc/A/ns/userns-parent)
>readlink(/proc/B/ns/userns)
>
>fork() -> C
>  C: setns(/proc/B/ns/userns-parent)

Re: [CRIU] Introspecting userns relationships to other namespaces?

2016-07-08 Thread James Bottomley
On July 8, 2016 1:38:19 PM PDT, Andrew Vagin  wrote:
>On Fri, Jul 08, 2016 at 07:35:33AM -0700, James Bottomley wrote:
>> On Fri, 2016-07-08 at 02:44 -0500, Eric W. Biederman wrote:
>> > Andrew Vagin  writes:
>> > 
>> > > On Wed, Jul 06, 2016 at 10:46:33AM -0500, Eric W. Biederman
>wrote:
>> > > > "Serge E. Hallyn"  writes:
>> > > > 
>> > > > > On Wed, Jul 06, 2016 at 10:41:48AM +0200, Michael Kerrisk
>(man
>> > > > > -pages) wrote:
>> > > > > > [Rats! Doing now what I should have down to start with.
>> > > > > > Looping some
>> > > > > > lists and CRIU and other possibly relevant people into this
>> > > > > > conversation]
>> > > > > > 
>> > > > > > Hi Eric,
>> > > > > > 
>> > > > > > On 5 July 2016 at 23:47, Eric W. Biederman <
>> > > > > > ebied...@xmission.com> wrote:
>> > > > > > > "Michael Kerrisk (man-pages)" 
>> > > > > > > writes:
>> > > > > > > 
>> > > > > > > > Hi Eric,
>> > > > > > > > 
>> > > > > > > > I have a question. Is there any way currently to
>discover 
>> > > > > > > > which user namespace a particular nonuser namespace is 
>> > > > > > > > governed by? Maybe I am missing something, but there
>does 
>> > > > > > > > not seem to be a way to do this. Also, can one discover
>
>> > > > > > > > which userns is the parent of a given userns? Again, I 
>> > > > > > > > can't see a way to do this.
>> > > > > > > > 
>> > > > > > > > The point here is introspecting so that a process might
>
>> > > > > > > > determine what its capabilities are when operating on 
>> > > > > > > > some resource governed by a (nonuser) namespace.
>> > > > > > > 
>> > > > > > > To the best of my knowledge that there is not an
>interface 
>> > > > > > > to get that information.  It would be good to have such
>an 
>> > > > > > > interface for no other reason than the CRIU folks are
>going 
>> > > > > > > to need it at some point.  I am a bit surprised they have
>> > > > > > > not complained yet.
>> > > > > 
>> > > > > I don't think they need it.  They do in fact have what they 
>> > > > > need.  Assume you have tasks T1, T2, T1_1 and T2_1;  T1 and
>T2 
>> > > > > are in init_user_ns;  T1 spawned T1_1 in a new userns;  T2 
>> > > > > spawned T2_1 which setns()d to T1_1's ns. There's some
>> > > > > {handwave} uid mapping, does not matter.
>> > > > > 
>> > > > > At restart, it doesn't matter which task originally created
>the 
>> > > > > new userns. criu knows T1_1 and T2_1 are in the same userns; 
>> > > > >  it creates the userns, sets up the mapping, and T1_1 and
>T2_1
>> > > > > setns() to it.
>> > > > 
>> > > > Given that the simple cases are so easy it probably doesn't 
>> > > > matter in that sense.
>> > > > 
>> > > > However we now have the case where user namespaces own pid 
>> > > > namespaces, and uts namespaces, and network namespaces, and ipc
>
>> > > > namespaces, and filesystems.  Throw in some mount propagation
>and 
>> > > > use of setns and things could get confusing.   It is something 
>> > > > that will need to be figured out if CRIU is going to properly 
>> > > > checkpoint containers containing containers containing
>containers 
>> > > > containing containers.
>> > > 
>> > > It isn't a joke:). We have a few requests to support CR of 
>> > > containers with Docker containers inside. And we are going to
>start 
>> > > this task in a near future, so we would like to have interface to
>
>> > > get dependencies between namespaces too.
>> > > 
>> > > BTW: CRIU already supports nested mount namespaces, because
>systemd
>> > > creates them for services.
>> > 
>> > The tricky part about this and what messes up James proposed plan
>is
>> > that the interface needs to be something that returns a namespace 
>> > file descriptor.  So we can't print something out in a simple text
>> > file.
>> 
>> I actually described two problems: the first was how we get the
>> information in the first place.  Currently the owning or parent
>user_ns
>> is tucked inside an opaque structure.  I think we need to move that
>to
>> ns_common where it would be the owning userns for all non-user
>> namespaces and the parent for the userns.
>
>I'm agree with this.
>
>> 
>> Once we actually have the information, we can also add a set of proc
>> links, say either
>> 
>> /proc//ns/X-userns
>> 
>> Which might be a bit messy since it doubles the number of files, or
>> perhaps in a simple directory.
>
>In this case we will need to enter into each namespace to build a full
>chain of dependencies.
>
>It's tricky, because if we enter into a child userns, we can't to enter
>into a parent userns from the same process, so to get the next branch,
>we will need to create a new process.
>
>   process A
>   |
>init_user_ns->child_user_ns_1->child_userns_2
>
>fork() -> B
>  B: setns(/proc/A/ns/userns-parent)
>readlink(/proc/B/ns/userns)
>
>fork() -> C
>  C: setns(/proc/B/ns/userns-parent)
>readlink(/proc/C/ns/userns)
>
>
>> 
>> > Well I suppose we could print an device number and inode 

Re: [CRIU] Introspecting userns relationships to other namespaces?

2016-07-08 Thread James Bottomley
On July 8, 2016 1:38:19 PM PDT, Andrew Vagin  wrote:
>On Fri, Jul 08, 2016 at 07:35:33AM -0700, James Bottomley wrote:
>> On Fri, 2016-07-08 at 02:44 -0500, Eric W. Biederman wrote:
>> > Andrew Vagin  writes:
>> > 
>> > > On Wed, Jul 06, 2016 at 10:46:33AM -0500, Eric W. Biederman
>wrote:
>> > > > "Serge E. Hallyn"  writes:
>> > > > 
>> > > > > On Wed, Jul 06, 2016 at 10:41:48AM +0200, Michael Kerrisk
>(man
>> > > > > -pages) wrote:
>> > > > > > [Rats! Doing now what I should have down to start with.
>> > > > > > Looping some
>> > > > > > lists and CRIU and other possibly relevant people into this
>> > > > > > conversation]
>> > > > > > 
>> > > > > > Hi Eric,
>> > > > > > 
>> > > > > > On 5 July 2016 at 23:47, Eric W. Biederman <
>> > > > > > ebied...@xmission.com> wrote:
>> > > > > > > "Michael Kerrisk (man-pages)" 
>> > > > > > > writes:
>> > > > > > > 
>> > > > > > > > Hi Eric,
>> > > > > > > > 
>> > > > > > > > I have a question. Is there any way currently to
>discover 
>> > > > > > > > which user namespace a particular nonuser namespace is 
>> > > > > > > > governed by? Maybe I am missing something, but there
>does 
>> > > > > > > > not seem to be a way to do this. Also, can one discover
>
>> > > > > > > > which userns is the parent of a given userns? Again, I 
>> > > > > > > > can't see a way to do this.
>> > > > > > > > 
>> > > > > > > > The point here is introspecting so that a process might
>
>> > > > > > > > determine what its capabilities are when operating on 
>> > > > > > > > some resource governed by a (nonuser) namespace.
>> > > > > > > 
>> > > > > > > To the best of my knowledge that there is not an
>interface 
>> > > > > > > to get that information.  It would be good to have such
>an 
>> > > > > > > interface for no other reason than the CRIU folks are
>going 
>> > > > > > > to need it at some point.  I am a bit surprised they have
>> > > > > > > not complained yet.
>> > > > > 
>> > > > > I don't think they need it.  They do in fact have what they 
>> > > > > need.  Assume you have tasks T1, T2, T1_1 and T2_1;  T1 and
>T2 
>> > > > > are in init_user_ns;  T1 spawned T1_1 in a new userns;  T2 
>> > > > > spawned T2_1 which setns()d to T1_1's ns. There's some
>> > > > > {handwave} uid mapping, does not matter.
>> > > > > 
>> > > > > At restart, it doesn't matter which task originally created
>the 
>> > > > > new userns. criu knows T1_1 and T2_1 are in the same userns; 
>> > > > >  it creates the userns, sets up the mapping, and T1_1 and
>T2_1
>> > > > > setns() to it.
>> > > > 
>> > > > Given that the simple cases are so easy it probably doesn't 
>> > > > matter in that sense.
>> > > > 
>> > > > However we now have the case where user namespaces own pid 
>> > > > namespaces, and uts namespaces, and network namespaces, and ipc
>
>> > > > namespaces, and filesystems.  Throw in some mount propagation
>and 
>> > > > use of setns and things could get confusing.   It is something 
>> > > > that will need to be figured out if CRIU is going to properly 
>> > > > checkpoint containers containing containers containing
>containers 
>> > > > containing containers.
>> > > 
>> > > It isn't a joke:). We have a few requests to support CR of 
>> > > containers with Docker containers inside. And we are going to
>start 
>> > > this task in a near future, so we would like to have interface to
>
>> > > get dependencies between namespaces too.
>> > > 
>> > > BTW: CRIU already supports nested mount namespaces, because
>systemd
>> > > creates them for services.
>> > 
>> > The tricky part about this and what messes up James proposed plan
>is
>> > that the interface needs to be something that returns a namespace 
>> > file descriptor.  So we can't print something out in a simple text
>> > file.
>> 
>> I actually described two problems: the first was how we get the
>> information in the first place.  Currently the owning or parent
>user_ns
>> is tucked inside an opaque structure.  I think we need to move that
>to
>> ns_common where it would be the owning userns for all non-user
>> namespaces and the parent for the userns.
>
>I'm agree with this.
>
>> 
>> Once we actually have the information, we can also add a set of proc
>> links, say either
>> 
>> /proc//ns/X-userns
>> 
>> Which might be a bit messy since it doubles the number of files, or
>> perhaps in a simple directory.
>
>In this case we will need to enter into each namespace to build a full
>chain of dependencies.
>
>It's tricky, because if we enter into a child userns, we can't to enter
>into a parent userns from the same process, so to get the next branch,
>we will need to create a new process.
>
>   process A
>   |
>init_user_ns->child_user_ns_1->child_userns_2
>
>fork() -> B
>  B: setns(/proc/A/ns/userns-parent)
>readlink(/proc/B/ns/userns)
>
>fork() -> C
>  C: setns(/proc/B/ns/userns-parent)

Re: [CRIU] Introspecting userns relationships to other namespaces?

2016-07-08 Thread James Bottomley
On July 8, 2016 1:38:19 PM PDT, Andrew Vagin  wrote:
>On Fri, Jul 08, 2016 at 07:35:33AM -0700, James Bottomley wrote:
>> On Fri, 2016-07-08 at 02:44 -0500, Eric W. Biederman wrote:
>> > Andrew Vagin  writes:
>> > 
>> > > On Wed, Jul 06, 2016 at 10:46:33AM -0500, Eric W. Biederman
>wrote:
>> > > > "Serge E. Hallyn"  writes:
>> > > > 
>> > > > > On Wed, Jul 06, 2016 at 10:41:48AM +0200, Michael Kerrisk
>(man
>> > > > > -pages) wrote:
>> > > > > > [Rats! Doing now what I should have down to start with.
>> > > > > > Looping some
>> > > > > > lists and CRIU and other possibly relevant people into this
>> > > > > > conversation]
>> > > > > > 
>> > > > > > Hi Eric,
>> > > > > > 
>> > > > > > On 5 July 2016 at 23:47, Eric W. Biederman <
>> > > > > > ebied...@xmission.com> wrote:
>> > > > > > > "Michael Kerrisk (man-pages)" 
>> > > > > > > writes:
>> > > > > > > 
>> > > > > > > > Hi Eric,
>> > > > > > > > 
>> > > > > > > > I have a question. Is there any way currently to
>discover 
>> > > > > > > > which user namespace a particular nonuser namespace is 
>> > > > > > > > governed by? Maybe I am missing something, but there
>does 
>> > > > > > > > not seem to be a way to do this. Also, can one discover
>
>> > > > > > > > which userns is the parent of a given userns? Again, I 
>> > > > > > > > can't see a way to do this.
>> > > > > > > > 
>> > > > > > > > The point here is introspecting so that a process might
>
>> > > > > > > > determine what its capabilities are when operating on 
>> > > > > > > > some resource governed by a (nonuser) namespace.
>> > > > > > > 
>> > > > > > > To the best of my knowledge that there is not an
>interface 
>> > > > > > > to get that information.  It would be good to have such
>an 
>> > > > > > > interface for no other reason than the CRIU folks are
>going 
>> > > > > > > to need it at some point.  I am a bit surprised they have
>> > > > > > > not complained yet.
>> > > > > 
>> > > > > I don't think they need it.  They do in fact have what they 
>> > > > > need.  Assume you have tasks T1, T2, T1_1 and T2_1;  T1 and
>T2 
>> > > > > are in init_user_ns;  T1 spawned T1_1 in a new userns;  T2 
>> > > > > spawned T2_1 which setns()d to T1_1's ns. There's some
>> > > > > {handwave} uid mapping, does not matter.
>> > > > > 
>> > > > > At restart, it doesn't matter which task originally created
>the 
>> > > > > new userns. criu knows T1_1 and T2_1 are in the same userns; 
>> > > > >  it creates the userns, sets up the mapping, and T1_1 and
>T2_1
>> > > > > setns() to it.
>> > > > 
>> > > > Given that the simple cases are so easy it probably doesn't 
>> > > > matter in that sense.
>> > > > 
>> > > > However we now have the case where user namespaces own pid 
>> > > > namespaces, and uts namespaces, and network namespaces, and ipc
>
>> > > > namespaces, and filesystems.  Throw in some mount propagation
>and 
>> > > > use of setns and things could get confusing.   It is something 
>> > > > that will need to be figured out if CRIU is going to properly 
>> > > > checkpoint containers containing containers containing
>containers 
>> > > > containing containers.
>> > > 
>> > > It isn't a joke:). We have a few requests to support CR of 
>> > > containers with Docker containers inside. And we are going to
>start 
>> > > this task in a near future, so we would like to have interface to
>
>> > > get dependencies between namespaces too.
>> > > 
>> > > BTW: CRIU already supports nested mount namespaces, because
>systemd
>> > > creates them for services.
>> > 
>> > The tricky part about this and what messes up James proposed plan
>is
>> > that the interface needs to be something that returns a namespace 
>> > file descriptor.  So we can't print something out in a simple text
>> > file.
>> 
>> I actually described two problems: the first was how we get the
>> information in the first place.  Currently the owning or parent
>user_ns
>> is tucked inside an opaque structure.  I think we need to move that
>to
>> ns_common where it would be the owning userns for all non-user
>> namespaces and the parent for the userns.
>
>I'm agree with this.
>
>> 
>> Once we actually have the information, we can also add a set of proc
>> links, say either
>> 
>> /proc//ns/X-userns
>> 
>> Which might be a bit messy since it doubles the number of files, or
>> perhaps in a simple directory.
>
>In this case we will need to enter into each namespace to build a full
>chain of dependencies.
>
>It's tricky, because if we enter into a child userns, we can't to enter
>into a parent userns from the same process, so to get the next branch,
>we will need to create a new process.
>
>   process A
>   |
>init_user_ns->child_user_ns_1->child_userns_2
>
>fork() -> B
>  B: setns(/proc/A/ns/userns-parent)
>readlink(/proc/B/ns/userns)
>
>fork() -> C
>  C: setns(/proc/B/ns/userns-parent)
>readlink(/proc/C/ns/userns)
>
>
>> 
>> > Well I suppose we could print an device number and inode 

Re: [CRIU] Introspecting userns relationships to other namespaces?

2016-07-08 Thread W. Trevor King
On Fri, Jul 08, 2016 at 01:38:19PM -0700, Andrew Vagin wrote:
> What do you think about the idea to mount nsfs and be able to look up
> any alive namespace by inum:
>
>   $ tree .
>   .
>   ├── mnt{inum}
>   │   └── user -> ../user{inum}
>   ├── pid{inum}
>   │   ├── pid{inum}
>   │   │   └── user -> ../../user{inum}/user{inum}
>   │   └── user -> ../user{inum}
>   └── user{inum}
>   └── user{inum}
>
> https://lkml.org/lkml/2016/7/8/59
>
> I think it solves all requirements which were mentioned in this thread.

It may need an additional entry per directory for the bit you setns.
Maybe ‘handle’?

  $ tree .
  .
  ├── mnt{inum}
  │   ├── handle -> mnt:[{inum}]
  │   └── user -> ../user{inum}
  …

but that's not a major revision.

> On Fri, Jul 08, 2016 at 07:35:33AM -0700, James Bottomley wrote:
> > On Fri, 2016-07-08 at 02:44 -0500, Eric W. Biederman wrote:
> > > Starting with 4.8 we are also going to need to be able to
> > > retrieve the user namespace owner of filesystems.  That will be
> > > an interesting mix.
> >
> > This is per mount point, isn't it? so it can't be in /proc/fs/ and
> > it would have to be per local mount tree.  Yes, that is a bit
> > nasty.  Sounds like we might need to unfold mount or mountinfo
> > into something that has one directory per entry?
>
> If we will be able to look up namespaces in nsfs by inum, we can
> print an userns inum in mountinfo.

With the tree view you can find a namespace by inum (if it's one of
your descendants), but it's not going to be particularly efficient
(you'll have to walk the tree).  Folks that need to do that quickly
can index the tree (which would be fairly straightforward if the nsfs
mount supports inotify), but it would be nice to have a more elegant
solution for this use-case.

Cheers,
Trevor

-- 
This email may be signed or encrypted with GnuPG (http://www.gnupg.org).
For more information, see http://en.wikipedia.org/wiki/Pretty_Good_Privacy


signature.asc
Description: OpenPGP digital signature


Re: [CRIU] Introspecting userns relationships to other namespaces?

2016-07-08 Thread W. Trevor King
On Fri, Jul 08, 2016 at 01:38:19PM -0700, Andrew Vagin wrote:
> What do you think about the idea to mount nsfs and be able to look up
> any alive namespace by inum:
>
>   $ tree .
>   .
>   ├── mnt{inum}
>   │   └── user -> ../user{inum}
>   ├── pid{inum}
>   │   ├── pid{inum}
>   │   │   └── user -> ../../user{inum}/user{inum}
>   │   └── user -> ../user{inum}
>   └── user{inum}
>   └── user{inum}
>
> https://lkml.org/lkml/2016/7/8/59
>
> I think it solves all requirements which were mentioned in this thread.

It may need an additional entry per directory for the bit you setns.
Maybe ‘handle’?

  $ tree .
  .
  ├── mnt{inum}
  │   ├── handle -> mnt:[{inum}]
  │   └── user -> ../user{inum}
  …

but that's not a major revision.

> On Fri, Jul 08, 2016 at 07:35:33AM -0700, James Bottomley wrote:
> > On Fri, 2016-07-08 at 02:44 -0500, Eric W. Biederman wrote:
> > > Starting with 4.8 we are also going to need to be able to
> > > retrieve the user namespace owner of filesystems.  That will be
> > > an interesting mix.
> >
> > This is per mount point, isn't it? so it can't be in /proc/fs/ and
> > it would have to be per local mount tree.  Yes, that is a bit
> > nasty.  Sounds like we might need to unfold mount or mountinfo
> > into something that has one directory per entry?
>
> If we will be able to look up namespaces in nsfs by inum, we can
> print an userns inum in mountinfo.

With the tree view you can find a namespace by inum (if it's one of
your descendants), but it's not going to be particularly efficient
(you'll have to walk the tree).  Folks that need to do that quickly
can index the tree (which would be fairly straightforward if the nsfs
mount supports inotify), but it would be nice to have a more elegant
solution for this use-case.

Cheers,
Trevor

-- 
This email may be signed or encrypted with GnuPG (http://www.gnupg.org).
For more information, see http://en.wikipedia.org/wiki/Pretty_Good_Privacy


signature.asc
Description: OpenPGP digital signature


Re: [CRIU] Introspecting userns relationships to other namespaces?

2016-07-08 Thread Andrew Vagin
On Wed, Jul 06, 2016 at 10:46:33AM -0500, Eric W. Biederman wrote:
> "Serge E. Hallyn"  writes:
> 
> > On Wed, Jul 06, 2016 at 10:41:48AM +0200, Michael Kerrisk (man-pages) wrote:
> >> [Rats! Doing now what I should have down to start with. Looping some
> >> lists and CRIU and other possibly relevant people into this
> >> conversation]
> >> 
> >> Hi Eric,
> >> 
> >> On 5 July 2016 at 23:47, Eric W. Biederman  wrote:
> >> > "Michael Kerrisk (man-pages)"  writes:
> >> >
> >> >> Hi Eric,
> >> >>
> >> >> I have a question. Is there any way currently to discover which
> >> >> user namespace a particular nonuser namespace is governed by?
> >> >> Maybe I am missing something, but there does not seem to be a
> >> >> way to do this. Also, can one discover which userns is the
> >> >> parent of a given userns? Again, I can't see a way to do this.
> >> >>
> >> >> The point here is introspecting so that a process might determine
> >> >> what its capabilities are when operating on some resource governed
> >> >> by a (nonuser) namespace.
> >> >
> >> > To the best of my knowledge that there is not an interface to get that
> >> > information.  It would be good to have such an interface for no other
> >> > reason than the CRIU folks are going to need it at some point.  I am a
> >> > bit surprised they have not complained yet.
> >
> > I don't think they need it.  They do in fact have what they need.  Assume
> > you have tasks T1, T2, T1_1 and T2_1;  T1 and T2 are in init_user_ns;  T1
> > spawned T1_1 in a new userns;  T2 spawned T2_1 which setns()d to T1_1's ns.
> > There's some {handwave} uid mapping, does not matter.
> >
> > At restart, it doesn't matter which task originally created the new userns.
> > criu knows T1_1 and T2_1 are in the same userns;  it creates the userns, 
> > sets
> > up the mapping, and T1_1 and T2_1 setns() to it.
> 
> Given that the simple cases are so easy it probably doesn't matter in
> that sense.
> 
> However we now have the case where user namespaces own pid namespaces,
> and uts namespaces, and network namespaces, and ipc namespaces, and
> filesystems.  Throw in some mount propagation and use of setns and
> things could get confusing.   It is something that will need to be
> figured out if CRIU is going to properly checkpoint containers
> containing containers containing containers containing containers.

It isn't a joke:). We have a few requests to support CR of containers with
Docker containers inside. And we are going to start this task in a near
future, so we would like to have interface to get dependencies between
namespaces too.

BTW: CRIU already supports nested mount namespaces, because systemd
creates them for services.

> 
> Did I mention I like recursion?
> 
> >> > That said in a normal use scenario I don't think that information is
> >> > needed.
> >> >
> >> > Do you have a particular use case besides checkpoint/restart where this
> >> > is useful?  That might help in coming up with a good userspace interface
> >> > for this information.
> >> 
> >> So, I spend a moderate amount of time working with people to introduce
> >> them to the namespaces infrastructure, and one topic that comes up now
> >> and this introspection/visualization tools. For example,
> >> nowadays--thanks to the (bizarrely misnamed) NStgid and NSpid fields
> >> in /proc/PID--it's possible to (and someone I was working with did)
> >> write tools that introspect the PID namespace hierarchy to show all of
> >> process's and their PIDs in the various namespace instance. It's a
> >> natural enough thing to want to do, when confronted with the
> >> complexity of the namespaces.
> >> 
> >> Someone else then asked me a question that led me to wonder about
> >> generally introspecting on the parental relationships between user
> >> namespaces and the association of other namespaces types with user
> >> namespaces. One use would be visualization, in order to understand the
> >> running system. Another would be to answer the question I already
> >> mentioned: what capability does process X have to perform operations
> >> on a resource governed by namespace Y?
> >
> > I agree they'll probably want it, but if we want for a real need and
> > use case we can do a better job of providing what's needed.
> 
> That two which is why I mentioned CRIU.  But yeah it will probably take
> a little while to get there.
> 
> Eric
> ___
> CRIU mailing list
> c...@openvz.org
> https://lists.openvz.org/mailman/listinfo/criu


Re: [CRIU] Introspecting userns relationships to other namespaces?

2016-07-08 Thread Andrew Vagin
On Wed, Jul 06, 2016 at 10:46:33AM -0500, Eric W. Biederman wrote:
> "Serge E. Hallyn"  writes:
> 
> > On Wed, Jul 06, 2016 at 10:41:48AM +0200, Michael Kerrisk (man-pages) wrote:
> >> [Rats! Doing now what I should have down to start with. Looping some
> >> lists and CRIU and other possibly relevant people into this
> >> conversation]
> >> 
> >> Hi Eric,
> >> 
> >> On 5 July 2016 at 23:47, Eric W. Biederman  wrote:
> >> > "Michael Kerrisk (man-pages)"  writes:
> >> >
> >> >> Hi Eric,
> >> >>
> >> >> I have a question. Is there any way currently to discover which
> >> >> user namespace a particular nonuser namespace is governed by?
> >> >> Maybe I am missing something, but there does not seem to be a
> >> >> way to do this. Also, can one discover which userns is the
> >> >> parent of a given userns? Again, I can't see a way to do this.
> >> >>
> >> >> The point here is introspecting so that a process might determine
> >> >> what its capabilities are when operating on some resource governed
> >> >> by a (nonuser) namespace.
> >> >
> >> > To the best of my knowledge that there is not an interface to get that
> >> > information.  It would be good to have such an interface for no other
> >> > reason than the CRIU folks are going to need it at some point.  I am a
> >> > bit surprised they have not complained yet.
> >
> > I don't think they need it.  They do in fact have what they need.  Assume
> > you have tasks T1, T2, T1_1 and T2_1;  T1 and T2 are in init_user_ns;  T1
> > spawned T1_1 in a new userns;  T2 spawned T2_1 which setns()d to T1_1's ns.
> > There's some {handwave} uid mapping, does not matter.
> >
> > At restart, it doesn't matter which task originally created the new userns.
> > criu knows T1_1 and T2_1 are in the same userns;  it creates the userns, 
> > sets
> > up the mapping, and T1_1 and T2_1 setns() to it.
> 
> Given that the simple cases are so easy it probably doesn't matter in
> that sense.
> 
> However we now have the case where user namespaces own pid namespaces,
> and uts namespaces, and network namespaces, and ipc namespaces, and
> filesystems.  Throw in some mount propagation and use of setns and
> things could get confusing.   It is something that will need to be
> figured out if CRIU is going to properly checkpoint containers
> containing containers containing containers containing containers.

It isn't a joke:). We have a few requests to support CR of containers with
Docker containers inside. And we are going to start this task in a near
future, so we would like to have interface to get dependencies between
namespaces too.

BTW: CRIU already supports nested mount namespaces, because systemd
creates them for services.

> 
> Did I mention I like recursion?
> 
> >> > That said in a normal use scenario I don't think that information is
> >> > needed.
> >> >
> >> > Do you have a particular use case besides checkpoint/restart where this
> >> > is useful?  That might help in coming up with a good userspace interface
> >> > for this information.
> >> 
> >> So, I spend a moderate amount of time working with people to introduce
> >> them to the namespaces infrastructure, and one topic that comes up now
> >> and this introspection/visualization tools. For example,
> >> nowadays--thanks to the (bizarrely misnamed) NStgid and NSpid fields
> >> in /proc/PID--it's possible to (and someone I was working with did)
> >> write tools that introspect the PID namespace hierarchy to show all of
> >> process's and their PIDs in the various namespace instance. It's a
> >> natural enough thing to want to do, when confronted with the
> >> complexity of the namespaces.
> >> 
> >> Someone else then asked me a question that led me to wonder about
> >> generally introspecting on the parental relationships between user
> >> namespaces and the association of other namespaces types with user
> >> namespaces. One use would be visualization, in order to understand the
> >> running system. Another would be to answer the question I already
> >> mentioned: what capability does process X have to perform operations
> >> on a resource governed by namespace Y?
> >
> > I agree they'll probably want it, but if we want for a real need and
> > use case we can do a better job of providing what's needed.
> 
> That two which is why I mentioned CRIU.  But yeah it will probably take
> a little while to get there.
> 
> Eric
> ___
> CRIU mailing list
> c...@openvz.org
> https://lists.openvz.org/mailman/listinfo/criu


Re: [CRIU] Introspecting userns relationships to other namespaces?

2016-07-08 Thread James Bottomley
On Fri, 2016-07-08 at 02:44 -0500, Eric W. Biederman wrote:
> Andrew Vagin  writes:
> 
> > On Wed, Jul 06, 2016 at 10:46:33AM -0500, Eric W. Biederman wrote:
> > > "Serge E. Hallyn"  writes:
> > > 
> > > > On Wed, Jul 06, 2016 at 10:41:48AM +0200, Michael Kerrisk (man
> > > > -pages) wrote:
> > > > > [Rats! Doing now what I should have down to start with.
> > > > > Looping some
> > > > > lists and CRIU and other possibly relevant people into this
> > > > > conversation]
> > > > > 
> > > > > Hi Eric,
> > > > > 
> > > > > On 5 July 2016 at 23:47, Eric W. Biederman <
> > > > > ebied...@xmission.com> wrote:
> > > > > > "Michael Kerrisk (man-pages)" 
> > > > > > writes:
> > > > > > 
> > > > > > > Hi Eric,
> > > > > > > 
> > > > > > > I have a question. Is there any way currently to discover 
> > > > > > > which user namespace a particular nonuser namespace is 
> > > > > > > governed by? Maybe I am missing something, but there does 
> > > > > > > not seem to be a way to do this. Also, can one discover 
> > > > > > > which userns is the parent of a given userns? Again, I 
> > > > > > > can't see a way to do this.
> > > > > > > 
> > > > > > > The point here is introspecting so that a process might 
> > > > > > > determine what its capabilities are when operating on 
> > > > > > > some resource governed by a (nonuser) namespace.
> > > > > > 
> > > > > > To the best of my knowledge that there is not an interface 
> > > > > > to get that information.  It would be good to have such an 
> > > > > > interface for no other reason than the CRIU folks are going 
> > > > > > to need it at some point.  I am a bit surprised they have
> > > > > > not complained yet.
> > > > 
> > > > I don't think they need it.  They do in fact have what they 
> > > > need.  Assume you have tasks T1, T2, T1_1 and T2_1;  T1 and T2 
> > > > are in init_user_ns;  T1 spawned T1_1 in a new userns;  T2 
> > > > spawned T2_1 which setns()d to T1_1's ns. There's some
> > > > {handwave} uid mapping, does not matter.
> > > > 
> > > > At restart, it doesn't matter which task originally created the 
> > > > new userns. criu knows T1_1 and T2_1 are in the same userns; 
> > > >  it creates the userns, sets up the mapping, and T1_1 and T2_1
> > > > setns() to it.
> > > 
> > > Given that the simple cases are so easy it probably doesn't 
> > > matter in that sense.
> > > 
> > > However we now have the case where user namespaces own pid 
> > > namespaces, and uts namespaces, and network namespaces, and ipc 
> > > namespaces, and filesystems.  Throw in some mount propagation and 
> > > use of setns and things could get confusing.   It is something 
> > > that will need to be figured out if CRIU is going to properly 
> > > checkpoint containers containing containers containing containers 
> > > containing containers.
> > 
> > It isn't a joke:). We have a few requests to support CR of 
> > containers with Docker containers inside. And we are going to start 
> > this task in a near future, so we would like to have interface to 
> > get dependencies between namespaces too.
> > 
> > BTW: CRIU already supports nested mount namespaces, because systemd
> > creates them for services.
> 
> The tricky part about this and what messes up James proposed plan is
> that the interface needs to be something that returns a namespace 
> file descriptor.  So we can't print something out in a simple text
> file.

I actually described two problems: the first was how we get the
information in the first place.  Currently the owning or parent user_ns
is tucked inside an opaque structure.  I think we need to move that to
ns_common where it would be the owning userns for all non-user
namespaces and the parent for the userns.

Once we actually have the information, we can also add a set of proc
links, say either

/proc//ns/X-userns

Which might be a bit messy since it doubles the number of files, or
perhaps in a simple directory.

> Well I suppose we could print an device number and inode number pair.
> But then someone would still have to scour processes looking for a 
> user namespace so that is likely less than ideal.

There's no reason any of the proposed methods so far have to be
exclusive: nsfs.c has a lot of flexibility.

> Starting with 4.8 we are also going to need to be able to retrieve 
> the user namespace owner of filesystems.  That will be an interesting
> mix.

This is per mount point, isn't it? so it can't be in /proc/fs/ and it
would have to be per local mount tree.  Yes, that is a bit nasty. 
 Sounds like we might need to unfold mount or mountinfo into something
that has one directory per entry?

James

> Eric
> 
> ___
> Containers mailing list
> contain...@lists.linux-foundation.org
> https://lists.linuxfoundation.org/mailman/listinfo/containers
> 



Re: [CRIU] Introspecting userns relationships to other namespaces?

2016-07-08 Thread James Bottomley
On Fri, 2016-07-08 at 02:44 -0500, Eric W. Biederman wrote:
> Andrew Vagin  writes:
> 
> > On Wed, Jul 06, 2016 at 10:46:33AM -0500, Eric W. Biederman wrote:
> > > "Serge E. Hallyn"  writes:
> > > 
> > > > On Wed, Jul 06, 2016 at 10:41:48AM +0200, Michael Kerrisk (man
> > > > -pages) wrote:
> > > > > [Rats! Doing now what I should have down to start with.
> > > > > Looping some
> > > > > lists and CRIU and other possibly relevant people into this
> > > > > conversation]
> > > > > 
> > > > > Hi Eric,
> > > > > 
> > > > > On 5 July 2016 at 23:47, Eric W. Biederman <
> > > > > ebied...@xmission.com> wrote:
> > > > > > "Michael Kerrisk (man-pages)" 
> > > > > > writes:
> > > > > > 
> > > > > > > Hi Eric,
> > > > > > > 
> > > > > > > I have a question. Is there any way currently to discover 
> > > > > > > which user namespace a particular nonuser namespace is 
> > > > > > > governed by? Maybe I am missing something, but there does 
> > > > > > > not seem to be a way to do this. Also, can one discover 
> > > > > > > which userns is the parent of a given userns? Again, I 
> > > > > > > can't see a way to do this.
> > > > > > > 
> > > > > > > The point here is introspecting so that a process might 
> > > > > > > determine what its capabilities are when operating on 
> > > > > > > some resource governed by a (nonuser) namespace.
> > > > > > 
> > > > > > To the best of my knowledge that there is not an interface 
> > > > > > to get that information.  It would be good to have such an 
> > > > > > interface for no other reason than the CRIU folks are going 
> > > > > > to need it at some point.  I am a bit surprised they have
> > > > > > not complained yet.
> > > > 
> > > > I don't think they need it.  They do in fact have what they 
> > > > need.  Assume you have tasks T1, T2, T1_1 and T2_1;  T1 and T2 
> > > > are in init_user_ns;  T1 spawned T1_1 in a new userns;  T2 
> > > > spawned T2_1 which setns()d to T1_1's ns. There's some
> > > > {handwave} uid mapping, does not matter.
> > > > 
> > > > At restart, it doesn't matter which task originally created the 
> > > > new userns. criu knows T1_1 and T2_1 are in the same userns; 
> > > >  it creates the userns, sets up the mapping, and T1_1 and T2_1
> > > > setns() to it.
> > > 
> > > Given that the simple cases are so easy it probably doesn't 
> > > matter in that sense.
> > > 
> > > However we now have the case where user namespaces own pid 
> > > namespaces, and uts namespaces, and network namespaces, and ipc 
> > > namespaces, and filesystems.  Throw in some mount propagation and 
> > > use of setns and things could get confusing.   It is something 
> > > that will need to be figured out if CRIU is going to properly 
> > > checkpoint containers containing containers containing containers 
> > > containing containers.
> > 
> > It isn't a joke:). We have a few requests to support CR of 
> > containers with Docker containers inside. And we are going to start 
> > this task in a near future, so we would like to have interface to 
> > get dependencies between namespaces too.
> > 
> > BTW: CRIU already supports nested mount namespaces, because systemd
> > creates them for services.
> 
> The tricky part about this and what messes up James proposed plan is
> that the interface needs to be something that returns a namespace 
> file descriptor.  So we can't print something out in a simple text
> file.

I actually described two problems: the first was how we get the
information in the first place.  Currently the owning or parent user_ns
is tucked inside an opaque structure.  I think we need to move that to
ns_common where it would be the owning userns for all non-user
namespaces and the parent for the userns.

Once we actually have the information, we can also add a set of proc
links, say either

/proc//ns/X-userns

Which might be a bit messy since it doubles the number of files, or
perhaps in a simple directory.

> Well I suppose we could print an device number and inode number pair.
> But then someone would still have to scour processes looking for a 
> user namespace so that is likely less than ideal.

There's no reason any of the proposed methods so far have to be
exclusive: nsfs.c has a lot of flexibility.

> Starting with 4.8 we are also going to need to be able to retrieve 
> the user namespace owner of filesystems.  That will be an interesting
> mix.

This is per mount point, isn't it? so it can't be in /proc/fs/ and it
would have to be per local mount tree.  Yes, that is a bit nasty. 
 Sounds like we might need to unfold mount or mountinfo into something
that has one directory per entry?

James

> Eric
> 
> ___
> Containers mailing list
> contain...@lists.linux-foundation.org
> https://lists.linuxfoundation.org/mailman/listinfo/containers
> 



Re: [CRIU] Introspecting userns relationships to other namespaces?

2016-07-08 Thread Michael Kerrisk (man-pages)

On 07/08/2016 05:26 AM, James Bottomley wrote:

On Thu, 2016-07-07 at 20:00 -0700, Andrew Vagin wrote:

On Thu, Jul 07, 2016 at 07:16:18PM -0700, Andrew Vagin wrote:

On Thu, Jul 07, 2016 at 12:17:35PM -0700, James Bottomley wrote:

On Thu, 2016-07-07 at 20:21 +0200, Michael Kerrisk (man-pages)
wrote:

On 7 July 2016 at 17:01, James Bottomley
 wrote:

[Serge already answered the parenting issue]

On Thu, 2016-07-07 at 08:36 -0500, Serge E. Hallyn wrote:

Hm.  Probably best-effort based on the process hierarchy.
 So
yeah you could probably get a tree into a state that would
be
wrongly recreated. Create a new netns, bind mount it, exit;
  Have
another task create a new user_ns, bind mount it, exit;
 Third
task setns()s first to the new netns then to the new
user_ns.  I
suspect criu will recreate that wrongly.


This is a bit pathological, and you have to be root to do it:
so
root can set up a nesting hierarchy, bind it and destroy the
pids
but I know of no current orchestration system which does
this.

Actually, I have to back pedal a bit: the way I currently set
up
architecture emulation containers does precisely this: I set
up the
namespaces unprivileged with child mount namespaces, but then
I ask
root to bind the userns and kill the process that created it
so I
have a permanent handle to enter the namespace by, so I
suspect
that when our current orchestration systems get more
sophisticated,
they might eventually want to do something like this as well.

In theory, we could get nsfs to show this information as an
option
(just add a show_options entry to the superblock ops), but
the
problem is that although each namespace has a parent user_ns,
there's no way to get it without digging in the namespace
specific
structure.  Probably we should restructure to move it into
ns_common, then we could display it (and enforce all
namespaces
having owning user_ns) but it would be a


I'm missing something here. Is it not already the case that all
namespaces have an owning user_ns?


Um, yes, I don't believe I said they don't.  The problem I
thought you
were having is that there's no way of seeing what it is.

nsfs is the Namespace fileystem where bound namespaces appear to
a cat
of /proc/self/mounts.  It can display any information that's in
ns_common (the common core of namespaces) but the owning user_ns
pointer currently isn't in this structure.  Every user namespace
has a
pointer to it, but they're all privately embedded in the
individual
namespace specific structures.  What I was proposing was that
since
every current namespace has a pointer somewhere to the owning
user
namespace, we could abstract this out into ns_common so it's now
accessible to be displayed by nsfs, probably as a mount option.


James, I am not sure that I understood you correctly. We have one
file system for all namespace files, how we can show per-file
properties
in mount options. I think we can show all required information in
fdinfo. We open a namespaces file (/proc/pid/ns/N) and then read
/proc/pid/fdinfo/X for it.


Here is a proof-of-concept patch.

How it works:

In [1]: import os

In [2]: fd = os.open("/proc/self/ns/pid", os.O_RDONLY)

In [3]: print open("/proc/self/fdinfo/%d" % fd).read()
pos:0
flags:  010
mnt_id: 2
userns: 4026531837

In [4]: print "/proc/self/ns/user -> %s" %
os.readlink("/proc/self/ns/user")
/proc/self/ns/user -> user:[4026531837]


can't you just do

readlink /proc/self/ns/user | sed 's/.*\[\(.*\)\]/\1/'

?

But what Michael was asking about was the parent user_ns of all the
other namespaces ...


Just to reiterate, what I'm interested in is the introspection use
case (but there's clearly several other interesting use cases here).
The idea is to be able to answer these questions

1. For each userns, what is the parent of that userns?

2. For each non-user namespace, what is the owning userns?

This enables us to understand the userns hierarchy, which
matters in terms of answering the question: what capabilities
does process X have in namespace Y?
   
Cheers,


Michael

--
Michael Kerrisk
Linux man-pages maintainer; http://www.kernel.org/doc/man-pages/
Linux/UNIX System Programming Training: http://man7.org/training/


Re: [CRIU] Introspecting userns relationships to other namespaces?

2016-07-08 Thread Michael Kerrisk (man-pages)

On 07/08/2016 05:26 AM, James Bottomley wrote:

On Thu, 2016-07-07 at 20:00 -0700, Andrew Vagin wrote:

On Thu, Jul 07, 2016 at 07:16:18PM -0700, Andrew Vagin wrote:

On Thu, Jul 07, 2016 at 12:17:35PM -0700, James Bottomley wrote:

On Thu, 2016-07-07 at 20:21 +0200, Michael Kerrisk (man-pages)
wrote:

On 7 July 2016 at 17:01, James Bottomley
 wrote:

[Serge already answered the parenting issue]

On Thu, 2016-07-07 at 08:36 -0500, Serge E. Hallyn wrote:

Hm.  Probably best-effort based on the process hierarchy.
 So
yeah you could probably get a tree into a state that would
be
wrongly recreated. Create a new netns, bind mount it, exit;
  Have
another task create a new user_ns, bind mount it, exit;
 Third
task setns()s first to the new netns then to the new
user_ns.  I
suspect criu will recreate that wrongly.


This is a bit pathological, and you have to be root to do it:
so
root can set up a nesting hierarchy, bind it and destroy the
pids
but I know of no current orchestration system which does
this.

Actually, I have to back pedal a bit: the way I currently set
up
architecture emulation containers does precisely this: I set
up the
namespaces unprivileged with child mount namespaces, but then
I ask
root to bind the userns and kill the process that created it
so I
have a permanent handle to enter the namespace by, so I
suspect
that when our current orchestration systems get more
sophisticated,
they might eventually want to do something like this as well.

In theory, we could get nsfs to show this information as an
option
(just add a show_options entry to the superblock ops), but
the
problem is that although each namespace has a parent user_ns,
there's no way to get it without digging in the namespace
specific
structure.  Probably we should restructure to move it into
ns_common, then we could display it (and enforce all
namespaces
having owning user_ns) but it would be a


I'm missing something here. Is it not already the case that all
namespaces have an owning user_ns?


Um, yes, I don't believe I said they don't.  The problem I
thought you
were having is that there's no way of seeing what it is.

nsfs is the Namespace fileystem where bound namespaces appear to
a cat
of /proc/self/mounts.  It can display any information that's in
ns_common (the common core of namespaces) but the owning user_ns
pointer currently isn't in this structure.  Every user namespace
has a
pointer to it, but they're all privately embedded in the
individual
namespace specific structures.  What I was proposing was that
since
every current namespace has a pointer somewhere to the owning
user
namespace, we could abstract this out into ns_common so it's now
accessible to be displayed by nsfs, probably as a mount option.


James, I am not sure that I understood you correctly. We have one
file system for all namespace files, how we can show per-file
properties
in mount options. I think we can show all required information in
fdinfo. We open a namespaces file (/proc/pid/ns/N) and then read
/proc/pid/fdinfo/X for it.


Here is a proof-of-concept patch.

How it works:

In [1]: import os

In [2]: fd = os.open("/proc/self/ns/pid", os.O_RDONLY)

In [3]: print open("/proc/self/fdinfo/%d" % fd).read()
pos:0
flags:  010
mnt_id: 2
userns: 4026531837

In [4]: print "/proc/self/ns/user -> %s" %
os.readlink("/proc/self/ns/user")
/proc/self/ns/user -> user:[4026531837]


can't you just do

readlink /proc/self/ns/user | sed 's/.*\[\(.*\)\]/\1/'

?

But what Michael was asking about was the parent user_ns of all the
other namespaces ...


Just to reiterate, what I'm interested in is the introspection use
case (but there's clearly several other interesting use cases here).
The idea is to be able to answer these questions

1. For each userns, what is the parent of that userns?

2. For each non-user namespace, what is the owning userns?

This enables us to understand the userns hierarchy, which
matters in terms of answering the question: what capabilities
does process X have in namespace Y?
   
Cheers,


Michael

--
Michael Kerrisk
Linux man-pages maintainer; http://www.kernel.org/doc/man-pages/
Linux/UNIX System Programming Training: http://man7.org/training/


Re: Introspecting userns relationships to other namespaces?

2016-07-08 Thread Michael Kerrisk (man-pages)

On 07/07/2016 09:17 PM, James Bottomley wrote:

On Thu, 2016-07-07 at 20:21 +0200, Michael Kerrisk (man-pages) wrote:

On 7 July 2016 at 17:01, James Bottomley
 wrote:

[Serge already answered the parenting issue]

On Thu, 2016-07-07 at 08:36 -0500, Serge E. Hallyn wrote:

Hm.  Probably best-effort based on the process hierarchy.  So
yeah you could probably get a tree into a state that would be
wrongly recreated. Create a new netns, bind mount it, exit;  Have
another task create a new user_ns, bind mount it, exit;  Third
task setns()s first to the new netns then to the new user_ns.  I
suspect criu will recreate that wrongly.


This is a bit pathological, and you have to be root to do it: so
root can set up a nesting hierarchy, bind it and destroy the pids
but I know of no current orchestration system which does this.

Actually, I have to back pedal a bit: the way I currently set up
architecture emulation containers does precisely this: I set up the
namespaces unprivileged with child mount namespaces, but then I ask
root to bind the userns and kill the process that created it so I
have a permanent handle to enter the namespace by, so I suspect
that when our current orchestration systems get more sophisticated,
they might eventually want to do something like this as well.

In theory, we could get nsfs to show this information as an option
(just add a show_options entry to the superblock ops), but the
problem is that although each namespace has a parent user_ns,
there's no way to get it without digging in the namespace specific
structure.  Probably we should restructure to move it into
ns_common, then we could display it (and enforce all namespaces
having owning user_ns) but it would be a


I'm missing something here. Is it not already the case that all
namespaces have an owning user_ns?


Um, yes, I don't believe I said they don't.  The problem I thought you
were having is that there's no way of seeing what it is.


Your words "and enforce all namespaces having owning user_ns" were
what left me puzzled--it sounded to me that the implication was
that this is not "enforced" right now.

Cheers,

Michael

--
Michael Kerrisk
Linux man-pages maintainer; http://www.kernel.org/doc/man-pages/
Linux/UNIX System Programming Training: http://man7.org/training/


Re: Introspecting userns relationships to other namespaces?

2016-07-08 Thread Michael Kerrisk (man-pages)

On 07/07/2016 09:17 PM, James Bottomley wrote:

On Thu, 2016-07-07 at 20:21 +0200, Michael Kerrisk (man-pages) wrote:

On 7 July 2016 at 17:01, James Bottomley
 wrote:

[Serge already answered the parenting issue]

On Thu, 2016-07-07 at 08:36 -0500, Serge E. Hallyn wrote:

Hm.  Probably best-effort based on the process hierarchy.  So
yeah you could probably get a tree into a state that would be
wrongly recreated. Create a new netns, bind mount it, exit;  Have
another task create a new user_ns, bind mount it, exit;  Third
task setns()s first to the new netns then to the new user_ns.  I
suspect criu will recreate that wrongly.


This is a bit pathological, and you have to be root to do it: so
root can set up a nesting hierarchy, bind it and destroy the pids
but I know of no current orchestration system which does this.

Actually, I have to back pedal a bit: the way I currently set up
architecture emulation containers does precisely this: I set up the
namespaces unprivileged with child mount namespaces, but then I ask
root to bind the userns and kill the process that created it so I
have a permanent handle to enter the namespace by, so I suspect
that when our current orchestration systems get more sophisticated,
they might eventually want to do something like this as well.

In theory, we could get nsfs to show this information as an option
(just add a show_options entry to the superblock ops), but the
problem is that although each namespace has a parent user_ns,
there's no way to get it without digging in the namespace specific
structure.  Probably we should restructure to move it into
ns_common, then we could display it (and enforce all namespaces
having owning user_ns) but it would be a


I'm missing something here. Is it not already the case that all
namespaces have an owning user_ns?


Um, yes, I don't believe I said they don't.  The problem I thought you
were having is that there's no way of seeing what it is.


Your words "and enforce all namespaces having owning user_ns" were
what left me puzzled--it sounded to me that the implication was
that this is not "enforced" right now.

Cheers,

Michael

--
Michael Kerrisk
Linux man-pages maintainer; http://www.kernel.org/doc/man-pages/
Linux/UNIX System Programming Training: http://man7.org/training/


Re: [CRIU] Introspecting userns relationships to other namespaces?

2016-07-08 Thread Eric W. Biederman
Andrew Vagin  writes:

> On Wed, Jul 06, 2016 at 10:46:33AM -0500, Eric W. Biederman wrote:
>> "Serge E. Hallyn"  writes:
>> 
>> > On Wed, Jul 06, 2016 at 10:41:48AM +0200, Michael Kerrisk (man-pages) 
>> > wrote:
>> >> [Rats! Doing now what I should have down to start with. Looping some
>> >> lists and CRIU and other possibly relevant people into this
>> >> conversation]
>> >> 
>> >> Hi Eric,
>> >> 
>> >> On 5 July 2016 at 23:47, Eric W. Biederman  wrote:
>> >> > "Michael Kerrisk (man-pages)"  writes:
>> >> >
>> >> >> Hi Eric,
>> >> >>
>> >> >> I have a question. Is there any way currently to discover which
>> >> >> user namespace a particular nonuser namespace is governed by?
>> >> >> Maybe I am missing something, but there does not seem to be a
>> >> >> way to do this. Also, can one discover which userns is the
>> >> >> parent of a given userns? Again, I can't see a way to do this.
>> >> >>
>> >> >> The point here is introspecting so that a process might determine
>> >> >> what its capabilities are when operating on some resource governed
>> >> >> by a (nonuser) namespace.
>> >> >
>> >> > To the best of my knowledge that there is not an interface to get that
>> >> > information.  It would be good to have such an interface for no other
>> >> > reason than the CRIU folks are going to need it at some point.  I am a
>> >> > bit surprised they have not complained yet.
>> >
>> > I don't think they need it.  They do in fact have what they need.  Assume
>> > you have tasks T1, T2, T1_1 and T2_1;  T1 and T2 are in init_user_ns;  T1
>> > spawned T1_1 in a new userns;  T2 spawned T2_1 which setns()d to T1_1's ns.
>> > There's some {handwave} uid mapping, does not matter.
>> >
>> > At restart, it doesn't matter which task originally created the new userns.
>> > criu knows T1_1 and T2_1 are in the same userns;  it creates the userns, 
>> > sets
>> > up the mapping, and T1_1 and T2_1 setns() to it.
>> 
>> Given that the simple cases are so easy it probably doesn't matter in
>> that sense.
>> 
>> However we now have the case where user namespaces own pid namespaces,
>> and uts namespaces, and network namespaces, and ipc namespaces, and
>> filesystems.  Throw in some mount propagation and use of setns and
>> things could get confusing.   It is something that will need to be
>> figured out if CRIU is going to properly checkpoint containers
>> containing containers containing containers containing containers.
>
> It isn't a joke:). We have a few requests to support CR of containers with
> Docker containers inside. And we are going to start this task in a near
> future, so we would like to have interface to get dependencies between
> namespaces too.
>
> BTW: CRIU already supports nested mount namespaces, because systemd
> creates them for services.

The tricky part about this and what messes up James proposed plan is
that the interface needs to be something that returns a namespace file
descriptor.  So we can't print something out in a simple text file.
Well I suppose we could print an device number and inode number pair.
But then someone would still have to scour processes looking for a user
namespace so that is likely less than ideal.

Starting with 4.8 we are also going to need to be able to retrieve the
user namespace owner of filesystems.  That will be an interesting mix.

Eric



Re: [CRIU] Introspecting userns relationships to other namespaces?

2016-07-08 Thread Eric W. Biederman
Andrew Vagin  writes:

> On Wed, Jul 06, 2016 at 10:46:33AM -0500, Eric W. Biederman wrote:
>> "Serge E. Hallyn"  writes:
>> 
>> > On Wed, Jul 06, 2016 at 10:41:48AM +0200, Michael Kerrisk (man-pages) 
>> > wrote:
>> >> [Rats! Doing now what I should have down to start with. Looping some
>> >> lists and CRIU and other possibly relevant people into this
>> >> conversation]
>> >> 
>> >> Hi Eric,
>> >> 
>> >> On 5 July 2016 at 23:47, Eric W. Biederman  wrote:
>> >> > "Michael Kerrisk (man-pages)"  writes:
>> >> >
>> >> >> Hi Eric,
>> >> >>
>> >> >> I have a question. Is there any way currently to discover which
>> >> >> user namespace a particular nonuser namespace is governed by?
>> >> >> Maybe I am missing something, but there does not seem to be a
>> >> >> way to do this. Also, can one discover which userns is the
>> >> >> parent of a given userns? Again, I can't see a way to do this.
>> >> >>
>> >> >> The point here is introspecting so that a process might determine
>> >> >> what its capabilities are when operating on some resource governed
>> >> >> by a (nonuser) namespace.
>> >> >
>> >> > To the best of my knowledge that there is not an interface to get that
>> >> > information.  It would be good to have such an interface for no other
>> >> > reason than the CRIU folks are going to need it at some point.  I am a
>> >> > bit surprised they have not complained yet.
>> >
>> > I don't think they need it.  They do in fact have what they need.  Assume
>> > you have tasks T1, T2, T1_1 and T2_1;  T1 and T2 are in init_user_ns;  T1
>> > spawned T1_1 in a new userns;  T2 spawned T2_1 which setns()d to T1_1's ns.
>> > There's some {handwave} uid mapping, does not matter.
>> >
>> > At restart, it doesn't matter which task originally created the new userns.
>> > criu knows T1_1 and T2_1 are in the same userns;  it creates the userns, 
>> > sets
>> > up the mapping, and T1_1 and T2_1 setns() to it.
>> 
>> Given that the simple cases are so easy it probably doesn't matter in
>> that sense.
>> 
>> However we now have the case where user namespaces own pid namespaces,
>> and uts namespaces, and network namespaces, and ipc namespaces, and
>> filesystems.  Throw in some mount propagation and use of setns and
>> things could get confusing.   It is something that will need to be
>> figured out if CRIU is going to properly checkpoint containers
>> containing containers containing containers containing containers.
>
> It isn't a joke:). We have a few requests to support CR of containers with
> Docker containers inside. And we are going to start this task in a near
> future, so we would like to have interface to get dependencies between
> namespaces too.
>
> BTW: CRIU already supports nested mount namespaces, because systemd
> creates them for services.

The tricky part about this and what messes up James proposed plan is
that the interface needs to be something that returns a namespace file
descriptor.  So we can't print something out in a simple text file.
Well I suppose we could print an device number and inode number pair.
But then someone would still have to scour processes looking for a user
namespace so that is likely less than ideal.

Starting with 4.8 we are also going to need to be able to retrieve the
user namespace owner of filesystems.  That will be an interesting mix.

Eric



Re: Introspecting userns relationships to other namespaces?

2016-07-08 Thread W. Trevor King
On Thu, Jul 07, 2016 at 11:54:54PM -0700, Andrew Vagin wrote:
> On Thu, Jul 07, 2016 at 10:26:50PM -0700, W. Trevor King wrote:
> > On Thu, Jul 07, 2016 at 08:26:47PM -0700, James Bottomley wrote:
> > > On Thu, 2016-07-07 at 20:00 -0700, Andrew Vagin wrote:
> > > > On Thu, Jul 07, 2016 at 07:16:18PM -0700, Andrew Vagin wrote:
> > > > > I think we can show all required information in fdinfo. We open
> > > > > a namespaces file (/proc/pid/ns/N) and then read
> > > > > /proc/pid/fdinfo/X for it.
> > > > 
> > > > Here is a proof-of-concept patch.
> > > > …
> > > > In [2]: fd = os.open("/proc/self/ns/pid", os.O_RDONLY)
> > > > 
> > > > In [3]: print open("/proc/self/fdinfo/%d" % fd).read()
> > > > pos:0
> > > > flags:  010
> > > > mnt_id: 2
> > > > userns: 4026531837
> > > > 
> > > > In [4]: print "/proc/self/ns/user -> %s" %
> > > > os.readlink("/proc/self/ns/user")
> > > > /proc/self/ns/user -> user:[4026531837]
> > > 
> > > can't you just do
> > > 
> > > readlink /proc/self/ns/user | sed 's/.*\[\(.*\)\]/\1/'
> > …
> > If you only put one level in fdinfo, you're stuck if one of the
> > namespaces involved has neither bind mounts nor a PID to give you
> > handle on it [1].  And if you want to put that whole ancestor tree in
> > fdinfo, you have to come up with some way to handle the two-parent
> > branching.
> 
> I think it's a bad idea to draw a tree in fdinfo. Why do we want to know
> this hierarchy? Probably we will want to access these namespaces (setns),
> in this case we need to have a way to open them.
> 
> Maybe we need to extend functionality of the nsfs filesystem
> (somethink like /proc/PID for namespaces)?

A similar idea came up during the PID-translation brainstorming [1],
but I'm not sure if anything ever came of that.  Once you're dealing
with a separate pseudo-filesystem, it seems easier to decouple it from
proc and just make a mountable namespace-hierarchy filesystem (like we
have mountable cgroup hierarchy filesystems).  That also gets you an
opt-in playground while the details of the nsfs filesystem view are
worked out.  Are you imagining something like:

  $ tree .
  .
  ├── mnt{inum}
  │   └── user -> ../user{inum}
  ├── pid{inum}
  │   ├── pid{inum}
  │   │   └── user -> ../../user{inum}/user{inum}
  │   └── user -> ../user{inum}
  └── user{inum}
  └── user{inum}

Cheers,
Trevor

[1]: http://thread.gmane.org/gmane.linux.kernel.containers/28105/focus=28164
 Subject: RE: [RFC]Pid conversion between pid namespace
 Date: Fri, 25 Jul 2014 10:01:45 +
 Message-ID: 
<5871495633F38949900D2BF2DC04883E56C7A2@G08CNEXMBPEKD02.g08.fujitsu.local>

-- 
This email may be signed or encrypted with GnuPG (http://www.gnupg.org).
For more information, see http://en.wikipedia.org/wiki/Pretty_Good_Privacy


signature.asc
Description: OpenPGP digital signature


Re: Introspecting userns relationships to other namespaces?

2016-07-08 Thread W. Trevor King
On Thu, Jul 07, 2016 at 11:54:54PM -0700, Andrew Vagin wrote:
> On Thu, Jul 07, 2016 at 10:26:50PM -0700, W. Trevor King wrote:
> > On Thu, Jul 07, 2016 at 08:26:47PM -0700, James Bottomley wrote:
> > > On Thu, 2016-07-07 at 20:00 -0700, Andrew Vagin wrote:
> > > > On Thu, Jul 07, 2016 at 07:16:18PM -0700, Andrew Vagin wrote:
> > > > > I think we can show all required information in fdinfo. We open
> > > > > a namespaces file (/proc/pid/ns/N) and then read
> > > > > /proc/pid/fdinfo/X for it.
> > > > 
> > > > Here is a proof-of-concept patch.
> > > > …
> > > > In [2]: fd = os.open("/proc/self/ns/pid", os.O_RDONLY)
> > > > 
> > > > In [3]: print open("/proc/self/fdinfo/%d" % fd).read()
> > > > pos:0
> > > > flags:  010
> > > > mnt_id: 2
> > > > userns: 4026531837
> > > > 
> > > > In [4]: print "/proc/self/ns/user -> %s" %
> > > > os.readlink("/proc/self/ns/user")
> > > > /proc/self/ns/user -> user:[4026531837]
> > > 
> > > can't you just do
> > > 
> > > readlink /proc/self/ns/user | sed 's/.*\[\(.*\)\]/\1/'
> > …
> > If you only put one level in fdinfo, you're stuck if one of the
> > namespaces involved has neither bind mounts nor a PID to give you
> > handle on it [1].  And if you want to put that whole ancestor tree in
> > fdinfo, you have to come up with some way to handle the two-parent
> > branching.
> 
> I think it's a bad idea to draw a tree in fdinfo. Why do we want to know
> this hierarchy? Probably we will want to access these namespaces (setns),
> in this case we need to have a way to open them.
> 
> Maybe we need to extend functionality of the nsfs filesystem
> (somethink like /proc/PID for namespaces)?

A similar idea came up during the PID-translation brainstorming [1],
but I'm not sure if anything ever came of that.  Once you're dealing
with a separate pseudo-filesystem, it seems easier to decouple it from
proc and just make a mountable namespace-hierarchy filesystem (like we
have mountable cgroup hierarchy filesystems).  That also gets you an
opt-in playground while the details of the nsfs filesystem view are
worked out.  Are you imagining something like:

  $ tree .
  .
  ├── mnt{inum}
  │   └── user -> ../user{inum}
  ├── pid{inum}
  │   ├── pid{inum}
  │   │   └── user -> ../../user{inum}/user{inum}
  │   └── user -> ../user{inum}
  └── user{inum}
  └── user{inum}

Cheers,
Trevor

[1]: http://thread.gmane.org/gmane.linux.kernel.containers/28105/focus=28164
 Subject: RE: [RFC]Pid conversion between pid namespace
 Date: Fri, 25 Jul 2014 10:01:45 +
 Message-ID: 
<5871495633F38949900D2BF2DC04883E56C7A2@G08CNEXMBPEKD02.g08.fujitsu.local>

-- 
This email may be signed or encrypted with GnuPG (http://www.gnupg.org).
For more information, see http://en.wikipedia.org/wiki/Pretty_Good_Privacy


signature.asc
Description: OpenPGP digital signature


Re: Introspecting userns relationships to other namespaces?

2016-07-08 Thread Andrew Vagin
On Thu, Jul 07, 2016 at 10:26:50PM -0700, W. Trevor King wrote:
> On Thu, Jul 07, 2016 at 08:26:47PM -0700, James Bottomley wrote:
> > On Thu, 2016-07-07 at 20:00 -0700, Andrew Vagin wrote:
> > > On Thu, Jul 07, 2016 at 07:16:18PM -0700, Andrew Vagin wrote:
> > > > I think we can show all required information in fdinfo. We open
> > > > a namespaces file (/proc/pid/ns/N) and then read
> > > > /proc/pid/fdinfo/X for it.
> > > 
> > > Here is a proof-of-concept patch.
> > > …
> > > In [2]: fd = os.open("/proc/self/ns/pid", os.O_RDONLY)
> > > 
> > > In [3]: print open("/proc/self/fdinfo/%d" % fd).read()
> > > pos:  0
> > > flags:010
> > > mnt_id:   2
> > > userns: 4026531837
> > > 
> > > In [4]: print "/proc/self/ns/user -> %s" %
> > > os.readlink("/proc/self/ns/user")
> > > /proc/self/ns/user -> user:[4026531837]
> > 
> > can't you just do
> > 
> > readlink /proc/self/ns/user | sed 's/.*\[\(.*\)\]/\1/'
> 
> With Andrew's fdinfo approach you know the user namespace owning
> /proc/self/ns/pid is 4026531837.  That happens to be
> /proc/self/ns/user in this case, but doesn't have to be in general.
> 
> > But what Michael was asking about was the parent user_ns of all the
> > other namespaces ... I don't think there's any way we can get that
> > out of any information in /proc/self/
> 
> If fdinfo only shows immediate parents, you'd need to walk the tree to
> get back to the root.  And at each layer of the PID namespace tree
> there will be another user-namespace parent branching off).  With a
> tree like:
> 
>   Namespace | Parent   | Owning userns
>  ---+--+---
>   Root userns   | -| -
>   Root PID ns   | -| Root userns
>   Child userns  | Root usens   | Root userns
>   Child PID ns  | Root PID ns  | Root userns
>   Grandchild userns | Child userns | Child userns
>   Grandchild PID ns | Child PID ns | Grandchild userns
> 
> Walking from the granchild PID namespace would give you:
> 
>   Grandchild PID ns
>   |-- Child PID ns
>   |   |-- Root PID ns
>   |   `-- Root userns 
>   `-- Granchild userns
>   `-- Child userns
>   `-- Root userns
> 
> If you only put one level in fdinfo, you're stuck if one of the
> namespaces involved has neither bind mounts nor a PID to give you
> handle on it [1].  And if you want to put that whole ancestor tree in
> fdinfo, you have to come up with some way to handle the two-parent
> branching.

I think it's a bad idea to draw a tree in fdinfo. Why do we want to know
this hierarchy? Probably we will want to access these namespaces (setns),
in this case we need to have a way to open them.

Maybe we need to extend functionality of the nsfs filesystem
(somethink like /proc/PID for namespaces)?

> 
> I'm also not sure how exposing nsfs information [2] would handle
> namespaces that had neither a surviving bind mount nor a direct
> process.
> 
> If all the information is available (possible after a mechanical patch
> [3] makes it more accessible), then it seems easier to put it in a
> separate /proc or /sys file.  There was a stab at this for PID
> namespaces in [4] (the same series that landed NStgid, etc.) with
> additional background and alternative approaches in [5].  There were
> problems with that patch (and it was trying to do more by also listing
> a process's ID in each PID namespace), but the “let's put the whole
> tree in a new file” approach seems sound to me.
> 
> Cheers,
> Trevor
> 
> [1]: http://thread.gmane.org/gmane.linux.kernel.containers/30456/focus=20536
>  Subject: Re: Introspecting userns relationships to other namespaces?
>  Date: Thu, 7 Jul 2016 13:24:42 -0500
>  Message-ID: <20160707182442.ga6...@mail.hallyn.com>
> [2]: http://thread.gmane.org/gmane.linux.kernel.containers/30456/focus=30499
>  Subject: Re: [CRIU] Introspecting userns relationships to other 
> namespaces?
>  Date: Thu, 07 Jul 2016 20:20:05 -0700
>  Message-ID: <1467948005.2322.84.ca...@hansenpartnership.com>
> [3]: http://thread.gmane.org/gmane.linux.kernel.containers/30456/focus=20537
>  Subject: Re: Introspecting userns relationships to other namespaces?
>  Message-ID: <1467903712.2347.16.ca...@hansenpartnership.com>
>  Date: Thu, 07 Jul 2016 08:01:52 -0700
> [4]: http://thread.gmane.org/gmane.linux.kernel.containers/28925/focus=28928
>  Subject: [resend][PATCH v9 1/3] procfs: show hierarchy of pid namespace
>  Date: Tue, 23 Dec 2014 18:20:37 +0800
&

Re: Introspecting userns relationships to other namespaces?

2016-07-08 Thread Andrew Vagin
On Thu, Jul 07, 2016 at 10:26:50PM -0700, W. Trevor King wrote:
> On Thu, Jul 07, 2016 at 08:26:47PM -0700, James Bottomley wrote:
> > On Thu, 2016-07-07 at 20:00 -0700, Andrew Vagin wrote:
> > > On Thu, Jul 07, 2016 at 07:16:18PM -0700, Andrew Vagin wrote:
> > > > I think we can show all required information in fdinfo. We open
> > > > a namespaces file (/proc/pid/ns/N) and then read
> > > > /proc/pid/fdinfo/X for it.
> > > 
> > > Here is a proof-of-concept patch.
> > > …
> > > In [2]: fd = os.open("/proc/self/ns/pid", os.O_RDONLY)
> > > 
> > > In [3]: print open("/proc/self/fdinfo/%d" % fd).read()
> > > pos:  0
> > > flags:010
> > > mnt_id:   2
> > > userns: 4026531837
> > > 
> > > In [4]: print "/proc/self/ns/user -> %s" %
> > > os.readlink("/proc/self/ns/user")
> > > /proc/self/ns/user -> user:[4026531837]
> > 
> > can't you just do
> > 
> > readlink /proc/self/ns/user | sed 's/.*\[\(.*\)\]/\1/'
> 
> With Andrew's fdinfo approach you know the user namespace owning
> /proc/self/ns/pid is 4026531837.  That happens to be
> /proc/self/ns/user in this case, but doesn't have to be in general.
> 
> > But what Michael was asking about was the parent user_ns of all the
> > other namespaces ... I don't think there's any way we can get that
> > out of any information in /proc/self/
> 
> If fdinfo only shows immediate parents, you'd need to walk the tree to
> get back to the root.  And at each layer of the PID namespace tree
> there will be another user-namespace parent branching off).  With a
> tree like:
> 
>   Namespace | Parent   | Owning userns
>  ---+--+---
>   Root userns   | -| -
>   Root PID ns   | -| Root userns
>   Child userns  | Root usens   | Root userns
>   Child PID ns  | Root PID ns  | Root userns
>   Grandchild userns | Child userns | Child userns
>   Grandchild PID ns | Child PID ns | Grandchild userns
> 
> Walking from the granchild PID namespace would give you:
> 
>   Grandchild PID ns
>   |-- Child PID ns
>   |   |-- Root PID ns
>   |   `-- Root userns 
>   `-- Granchild userns
>   `-- Child userns
>   `-- Root userns
> 
> If you only put one level in fdinfo, you're stuck if one of the
> namespaces involved has neither bind mounts nor a PID to give you
> handle on it [1].  And if you want to put that whole ancestor tree in
> fdinfo, you have to come up with some way to handle the two-parent
> branching.

I think it's a bad idea to draw a tree in fdinfo. Why do we want to know
this hierarchy? Probably we will want to access these namespaces (setns),
in this case we need to have a way to open them.

Maybe we need to extend functionality of the nsfs filesystem
(somethink like /proc/PID for namespaces)?

> 
> I'm also not sure how exposing nsfs information [2] would handle
> namespaces that had neither a surviving bind mount nor a direct
> process.
> 
> If all the information is available (possible after a mechanical patch
> [3] makes it more accessible), then it seems easier to put it in a
> separate /proc or /sys file.  There was a stab at this for PID
> namespaces in [4] (the same series that landed NStgid, etc.) with
> additional background and alternative approaches in [5].  There were
> problems with that patch (and it was trying to do more by also listing
> a process's ID in each PID namespace), but the “let's put the whole
> tree in a new file” approach seems sound to me.
> 
> Cheers,
> Trevor
> 
> [1]: http://thread.gmane.org/gmane.linux.kernel.containers/30456/focus=20536
>  Subject: Re: Introspecting userns relationships to other namespaces?
>  Date: Thu, 7 Jul 2016 13:24:42 -0500
>  Message-ID: <20160707182442.ga6...@mail.hallyn.com>
> [2]: http://thread.gmane.org/gmane.linux.kernel.containers/30456/focus=30499
>  Subject: Re: [CRIU] Introspecting userns relationships to other 
> namespaces?
>  Date: Thu, 07 Jul 2016 20:20:05 -0700
>  Message-ID: <1467948005.2322.84.ca...@hansenpartnership.com>
> [3]: http://thread.gmane.org/gmane.linux.kernel.containers/30456/focus=20537
>  Subject: Re: Introspecting userns relationships to other namespaces?
>  Message-ID: <1467903712.2347.16.ca...@hansenpartnership.com>
>  Date: Thu, 07 Jul 2016 08:01:52 -0700
> [4]: http://thread.gmane.org/gmane.linux.kernel.containers/28925/focus=28928
>  Subject: [resend][PATCH v9 1/3] procfs: show hierarchy of pid namespace
>  Date: Tue, 23 Dec 2014 18:20:37 +0800
&

Re: Introspecting userns relationships to other namespaces?

2016-07-08 Thread W. Trevor King
On Thu, Jul 07, 2016 at 10:26:50PM -0700, W. Trevor King wrote:
> And if you want to put that whole ancestor tree in fdinfo, you have
> to come up with some way to handle the two-parent branching.

Going towards the roots is nice, because you know a given namespace
will only have two parents, but it leaks information about the system
into the container.  It's probably better to follow the NStgid,
etc. example and only walk toward the leaves.  So a (privileged?)
process in the root namespace could see the whole tree, while a
process in non-root namespaces could only see their namespaces and
descendants.  In situations where you were part of a namespace that
belonged to an external user namespace (e.g. you nsenter a child user
namespace but are still in the root PID namespace), you'd want an
“unknown” entry for the parent you couldn't see.

Cheers,
Trevor

-- 
This email may be signed or encrypted with GnuPG (http://www.gnupg.org).
For more information, see http://en.wikipedia.org/wiki/Pretty_Good_Privacy


signature.asc
Description: OpenPGP digital signature


Re: Introspecting userns relationships to other namespaces?

2016-07-08 Thread W. Trevor King
On Thu, Jul 07, 2016 at 10:26:50PM -0700, W. Trevor King wrote:
> And if you want to put that whole ancestor tree in fdinfo, you have
> to come up with some way to handle the two-parent branching.

Going towards the roots is nice, because you know a given namespace
will only have two parents, but it leaks information about the system
into the container.  It's probably better to follow the NStgid,
etc. example and only walk toward the leaves.  So a (privileged?)
process in the root namespace could see the whole tree, while a
process in non-root namespaces could only see their namespaces and
descendants.  In situations where you were part of a namespace that
belonged to an external user namespace (e.g. you nsenter a child user
namespace but are still in the root PID namespace), you'd want an
“unknown” entry for the parent you couldn't see.

Cheers,
Trevor

-- 
This email may be signed or encrypted with GnuPG (http://www.gnupg.org).
For more information, see http://en.wikipedia.org/wiki/Pretty_Good_Privacy


signature.asc
Description: OpenPGP digital signature


Re: [CRIU] Introspecting userns relationships to other namespaces?

2016-07-08 Thread James Bottomley
On Thu, 2016-07-07 at 22:41 -0700, Andrei Vagin wrote:
> On Thu, Jul 7, 2016 at 8:26 PM, James Bottomley
>  wrote:
> > On Thu, 2016-07-07 at 20:00 -0700, Andrew Vagin wrote:
> > > On Thu, Jul 07, 2016 at 07:16:18PM -0700, Andrew Vagin wrote:
> > > > On Thu, Jul 07, 2016 at 12:17:35PM -0700, James Bottomley
> > > > wrote:
> > > > > On Thu, 2016-07-07 at 20:21 +0200, Michael Kerrisk (man
> > > > > -pages)
> > > > > wrote:
> > > > > > On 7 July 2016 at 17:01, James Bottomley
> > > > > >  wrote:
> > > > > [Serge already answered the parenting issue]
> > > > > > > On Thu, 2016-07-07 at 08:36 -0500, Serge E. Hallyn wrote:
> > > > > > > > Hm.  Probably best-effort based on the process
> > > > > > > > hierarchy.
> > > > > > > >  So
> > > > > > > > yeah you could probably get a tree into a state that
> > > > > > > > would
> > > > > > > > be
> > > > > > > > wrongly recreated. Create a new netns, bind mount it,
> > > > > > > > exit;
> > > > > > > >   Have
> > > > > > > > another task create a new user_ns, bind mount it, exit;
> > > > > > > >  Third
> > > > > > > > task setns()s first to the new netns then to the new
> > > > > > > > user_ns.  I
> > > > > > > > suspect criu will recreate that wrongly.
> > > > > > > 
> > > > > > > This is a bit pathological, and you have to be root to do
> > > > > > > it:
> > > > > > > so
> > > > > > > root can set up a nesting hierarchy, bind it and destroy
> > > > > > > the
> > > > > > > pids
> > > > > > > but I know of no current orchestration system which does
> > > > > > > this.
> > > > > > > 
> > > > > > > Actually, I have to back pedal a bit: the way I currently
> > > > > > > set
> > > > > > > up
> > > > > > > architecture emulation containers does precisely this: I
> > > > > > > set
> > > > > > > up the
> > > > > > > namespaces unprivileged with child mount namespaces, but
> > > > > > > then
> > > > > > > I ask
> > > > > > > root to bind the userns and kill the process that created
> > > > > > > it
> > > > > > > so I
> > > > > > > have a permanent handle to enter the namespace by, so I
> > > > > > > suspect
> > > > > > > that when our current orchestration systems get more
> > > > > > > sophisticated,
> > > > > > > they might eventually want to do something like this as
> > > > > > > well.
> > > > > > > 
> > > > > > > In theory, we could get nsfs to show this information as
> > > > > > > an
> > > > > > > option
> > > > > > > (just add a show_options entry to the superblock ops),
> > > > > > > but
> > > > > > > the
> > > > > > > problem is that although each namespace has a parent
> > > > > > > user_ns,
> > > > > > > there's no way to get it without digging in the namespace
> > > > > > > specific
> > > > > > > structure.  Probably we should restructure to move it
> > > > > > > into
> > > > > > > ns_common, then we could display it (and enforce all
> > > > > > > namespaces
> > > > > > > having owning user_ns) but it would be a
> > > > > > 
> > > > > > I'm missing something here. Is it not already the case that
> > > > > > all
> > > > > > namespaces have an owning user_ns?
> > > > > 
> > > > > Um, yes, I don't believe I said they don't.  The problem I
> > > > > thought you
> > > > > were having is that there's no way of seeing what it is.
> > > > > 
> > > > > nsfs is the Namespace fileystem where bound namespaces appear
> > > > > to
> > > > > a cat
> > > > > of /proc/self/mounts.  It can display any information that's
> > > > > in
> > > > > ns_common (the common core of namespaces) but the owning
> > > > > user_ns
> > > > > pointer currently isn't in this structure.  Every user
> > > > > namespace
> > > > > has a
> > > > > pointer to it, but they're all privately embedded in the
> > > > > individual
> > > > > namespace specific structures.  What I was proposing was that
> > > > > since
> > > > > every current namespace has a pointer somewhere to the owning
> > > > > user
> > > > > namespace, we could abstract this out into ns_common so it's
> > > > > now
> > > > > accessible to be displayed by nsfs, probably as a mount
> > > > > option.
> > > > 
> > > > James, I am not sure that I understood you correctly. We have
> > > > one
> > > > file system for all namespace files, how we can show per-file
> > > > properties
> > > > in mount options. I think we can show all required information
> > > > in
> > > > fdinfo. We open a namespaces file (/proc/pid/ns/N) and then
> > > > read
> > > > /proc/pid/fdinfo/X for it.
> > > 
> > > Here is a proof-of-concept patch.
> > > 
> > > How it works:
> > > 
> > > In [1]: import os
> > > 
> > > In [2]: fd = os.open("/proc/self/ns/pid", os.O_RDONLY)
> > > 
> > > In [3]: print open("/proc/self/fdinfo/%d" % fd).read()
> > > pos:  0
> > > flags:010
> > > mnt_id:   2
> > > userns: 4026531837
> > > 
> > > In [4]: print "/proc/self/ns/user -> %s" %
> > > os.readlink("/proc/self/ns/user")
> > > /proc/self/ns/user -> user:[4026531837]
> > 
> > can't you just do
> > 
> 

Re: [CRIU] Introspecting userns relationships to other namespaces?

2016-07-08 Thread James Bottomley
On Thu, 2016-07-07 at 22:41 -0700, Andrei Vagin wrote:
> On Thu, Jul 7, 2016 at 8:26 PM, James Bottomley
>  wrote:
> > On Thu, 2016-07-07 at 20:00 -0700, Andrew Vagin wrote:
> > > On Thu, Jul 07, 2016 at 07:16:18PM -0700, Andrew Vagin wrote:
> > > > On Thu, Jul 07, 2016 at 12:17:35PM -0700, James Bottomley
> > > > wrote:
> > > > > On Thu, 2016-07-07 at 20:21 +0200, Michael Kerrisk (man
> > > > > -pages)
> > > > > wrote:
> > > > > > On 7 July 2016 at 17:01, James Bottomley
> > > > > >  wrote:
> > > > > [Serge already answered the parenting issue]
> > > > > > > On Thu, 2016-07-07 at 08:36 -0500, Serge E. Hallyn wrote:
> > > > > > > > Hm.  Probably best-effort based on the process
> > > > > > > > hierarchy.
> > > > > > > >  So
> > > > > > > > yeah you could probably get a tree into a state that
> > > > > > > > would
> > > > > > > > be
> > > > > > > > wrongly recreated. Create a new netns, bind mount it,
> > > > > > > > exit;
> > > > > > > >   Have
> > > > > > > > another task create a new user_ns, bind mount it, exit;
> > > > > > > >  Third
> > > > > > > > task setns()s first to the new netns then to the new
> > > > > > > > user_ns.  I
> > > > > > > > suspect criu will recreate that wrongly.
> > > > > > > 
> > > > > > > This is a bit pathological, and you have to be root to do
> > > > > > > it:
> > > > > > > so
> > > > > > > root can set up a nesting hierarchy, bind it and destroy
> > > > > > > the
> > > > > > > pids
> > > > > > > but I know of no current orchestration system which does
> > > > > > > this.
> > > > > > > 
> > > > > > > Actually, I have to back pedal a bit: the way I currently
> > > > > > > set
> > > > > > > up
> > > > > > > architecture emulation containers does precisely this: I
> > > > > > > set
> > > > > > > up the
> > > > > > > namespaces unprivileged with child mount namespaces, but
> > > > > > > then
> > > > > > > I ask
> > > > > > > root to bind the userns and kill the process that created
> > > > > > > it
> > > > > > > so I
> > > > > > > have a permanent handle to enter the namespace by, so I
> > > > > > > suspect
> > > > > > > that when our current orchestration systems get more
> > > > > > > sophisticated,
> > > > > > > they might eventually want to do something like this as
> > > > > > > well.
> > > > > > > 
> > > > > > > In theory, we could get nsfs to show this information as
> > > > > > > an
> > > > > > > option
> > > > > > > (just add a show_options entry to the superblock ops),
> > > > > > > but
> > > > > > > the
> > > > > > > problem is that although each namespace has a parent
> > > > > > > user_ns,
> > > > > > > there's no way to get it without digging in the namespace
> > > > > > > specific
> > > > > > > structure.  Probably we should restructure to move it
> > > > > > > into
> > > > > > > ns_common, then we could display it (and enforce all
> > > > > > > namespaces
> > > > > > > having owning user_ns) but it would be a
> > > > > > 
> > > > > > I'm missing something here. Is it not already the case that
> > > > > > all
> > > > > > namespaces have an owning user_ns?
> > > > > 
> > > > > Um, yes, I don't believe I said they don't.  The problem I
> > > > > thought you
> > > > > were having is that there's no way of seeing what it is.
> > > > > 
> > > > > nsfs is the Namespace fileystem where bound namespaces appear
> > > > > to
> > > > > a cat
> > > > > of /proc/self/mounts.  It can display any information that's
> > > > > in
> > > > > ns_common (the common core of namespaces) but the owning
> > > > > user_ns
> > > > > pointer currently isn't in this structure.  Every user
> > > > > namespace
> > > > > has a
> > > > > pointer to it, but they're all privately embedded in the
> > > > > individual
> > > > > namespace specific structures.  What I was proposing was that
> > > > > since
> > > > > every current namespace has a pointer somewhere to the owning
> > > > > user
> > > > > namespace, we could abstract this out into ns_common so it's
> > > > > now
> > > > > accessible to be displayed by nsfs, probably as a mount
> > > > > option.
> > > > 
> > > > James, I am not sure that I understood you correctly. We have
> > > > one
> > > > file system for all namespace files, how we can show per-file
> > > > properties
> > > > in mount options. I think we can show all required information
> > > > in
> > > > fdinfo. We open a namespaces file (/proc/pid/ns/N) and then
> > > > read
> > > > /proc/pid/fdinfo/X for it.
> > > 
> > > Here is a proof-of-concept patch.
> > > 
> > > How it works:
> > > 
> > > In [1]: import os
> > > 
> > > In [2]: fd = os.open("/proc/self/ns/pid", os.O_RDONLY)
> > > 
> > > In [3]: print open("/proc/self/fdinfo/%d" % fd).read()
> > > pos:  0
> > > flags:010
> > > mnt_id:   2
> > > userns: 4026531837
> > > 
> > > In [4]: print "/proc/self/ns/user -> %s" %
> > > os.readlink("/proc/self/ns/user")
> > > /proc/self/ns/user -> user:[4026531837]
> > 
> > can't you just do
> > 
> > readlink /proc/self/ns/user | sed 's/.*\[\(.*\)\]/\1/'
> 
> We can get 

Re: [CRIU] Introspecting userns relationships to other namespaces?

2016-07-08 Thread Andrew Vagin
On Thu, Jul 07, 2016 at 08:20:05PM -0700, James Bottomley wrote:
> On Thu, 2016-07-07 at 19:16 -0700, Andrew Vagin wrote:
> > On Thu, Jul 07, 2016 at 12:17:35PM -0700, James Bottomley wrote:
> > > On Thu, 2016-07-07 at 20:21 +0200, Michael Kerrisk (man-pages)
> > > wrote:
> > > > On 7 July 2016 at 17:01, James Bottomley
> > > >  wrote:
> > > [Serge already answered the parenting issue]
> > > > > On Thu, 2016-07-07 at 08:36 -0500, Serge E. Hallyn wrote:
> > > > > > Hm.  Probably best-effort based on the process hierarchy.  So
> > > > > > yeah you could probably get a tree into a state that would be
> > > > > > wrongly recreated. Create a new netns, bind mount it, exit; 
> > > > > >  Have 
> > > > > > another task create a new user_ns, bind mount it, exit; 
> > > > > >  Third 
> > > > > > task setns()s first to the new netns then to the new user_ns.
> > > > > >   I 
> > > > > > suspect criu will recreate that wrongly.
> > > > > 
> > > > > This is a bit pathological, and you have to be root to do it:
> > > > > so 
> > > > > root can set up a nesting hierarchy, bind it and destroy the
> > > > > pids 
> > > > > but I know of no current orchestration system which does this.
> > > > > 
> > > > > Actually, I have to back pedal a bit: the way I currently set
> > > > > up
> > > > > architecture emulation containers does precisely this: I set up
> > > > > the
> > > > > namespaces unprivileged with child mount namespaces, but then I
> > > > > ask
> > > > > root to bind the userns and kill the process that created it so
> > > > > I 
> > > > > have a permanent handle to enter the namespace by, so I suspect
> > > > > that when our current orchestration systems get more
> > > > > sophisticated, 
> > > > > they might eventually want to do something like this as well.
> > > > > 
> > > > > In theory, we could get nsfs to show this information as an
> > > > > option
> > > > > (just add a show_options entry to the superblock ops), but the 
> > > > > problem is that although each namespace has a parent user_ns, 
> > > > > there's no way to get it without digging in the namespace
> > > > > specific 
> > > > > structure.  Probably we should restructure to move it into 
> > > > > ns_common, then we could display it (and enforce all namespaces
> > > > > having owning user_ns) but it would be a
> > > > 
> > > > I'm missing something here. Is it not already the case that all
> > > > namespaces have an owning user_ns?
> > > 
> > > Um, yes, I don't believe I said they don't.  The problem I thought
> > > you
> > > were having is that there's no way of seeing what it is.
> > > 
> > > nsfs is the Namespace fileystem where bound namespaces appear to a
> > > cat
> > > of /proc/self/mounts.  It can display any information that's in
> > > ns_common (the common core of namespaces) but the owning user_ns
> > > pointer currently isn't in this structure.  Every user namespace
> > > has a
> > > pointer to it, but they're all privately embedded in the individual
> > > namespace specific structures.  What I was proposing was that since
> > > every current namespace has a pointer somewhere to the owning user
> > > namespace, we could abstract this out into ns_common so it's now
> > > accessible to be displayed by nsfs, probably as a mount option.
> > 
> > James, I am not sure that I understood you correctly. We have one
> > file system for all namespace files, how we can show per-file 
> > properties in mount options.
> 
> We have two ways of getting information.  For a namespace that only
> exists as a bind mount we only have what the mount/mountinfo shows, so
> you see something like this:
> 
> jejb@jarvis:~> mount|grep nsfs
> nsfs on /run/build-container/userns type nsfs (rw)
> nsfs on /run/build-container/ppc64 type nsfs (rw)
> 
> the (rw) are the mount options.  We could add the ability to add other
> mount options to this via the superblock .show_options callback.  We
> could make it show the type and parent user namespace.

Yes, we could. But this way works only for bind-mounted ns files, fdinfo
works for any ns files (e.g: /proc/PID/ns/X).
fdinfo show information about one namespace, when /proc/pid/mountinfo
shows infromation about all mounts, so we can parse fdinfo faster and
easier.

> 
> >  I think we can show all required information in fdinfo. We open a
> > namespaces file (/proc/pid/ns/N) and then read /proc/pid/fdinfo/X for
> > it.
> 
> Not if we don't have an extant process in the namespace, we can't use
> these files because they don't exist, plus fdinfo on the
> /proc//ns/X doesn't tell you what the parent user_ns of X is
> (again, we could add this information somewhere ... not sure where
> yet).

we can read fdinfo for any ns file.

For example,

fd = open("/run/build-container/userns", O_PATH);

then read fdinfo for this "fd" (/proc/self/fdinfo/[fd])

Thanks,
Andrew

> 
> James
> 


Re: [CRIU] Introspecting userns relationships to other namespaces?

2016-07-08 Thread Andrew Vagin
On Thu, Jul 07, 2016 at 08:20:05PM -0700, James Bottomley wrote:
> On Thu, 2016-07-07 at 19:16 -0700, Andrew Vagin wrote:
> > On Thu, Jul 07, 2016 at 12:17:35PM -0700, James Bottomley wrote:
> > > On Thu, 2016-07-07 at 20:21 +0200, Michael Kerrisk (man-pages)
> > > wrote:
> > > > On 7 July 2016 at 17:01, James Bottomley
> > > >  wrote:
> > > [Serge already answered the parenting issue]
> > > > > On Thu, 2016-07-07 at 08:36 -0500, Serge E. Hallyn wrote:
> > > > > > Hm.  Probably best-effort based on the process hierarchy.  So
> > > > > > yeah you could probably get a tree into a state that would be
> > > > > > wrongly recreated. Create a new netns, bind mount it, exit; 
> > > > > >  Have 
> > > > > > another task create a new user_ns, bind mount it, exit; 
> > > > > >  Third 
> > > > > > task setns()s first to the new netns then to the new user_ns.
> > > > > >   I 
> > > > > > suspect criu will recreate that wrongly.
> > > > > 
> > > > > This is a bit pathological, and you have to be root to do it:
> > > > > so 
> > > > > root can set up a nesting hierarchy, bind it and destroy the
> > > > > pids 
> > > > > but I know of no current orchestration system which does this.
> > > > > 
> > > > > Actually, I have to back pedal a bit: the way I currently set
> > > > > up
> > > > > architecture emulation containers does precisely this: I set up
> > > > > the
> > > > > namespaces unprivileged with child mount namespaces, but then I
> > > > > ask
> > > > > root to bind the userns and kill the process that created it so
> > > > > I 
> > > > > have a permanent handle to enter the namespace by, so I suspect
> > > > > that when our current orchestration systems get more
> > > > > sophisticated, 
> > > > > they might eventually want to do something like this as well.
> > > > > 
> > > > > In theory, we could get nsfs to show this information as an
> > > > > option
> > > > > (just add a show_options entry to the superblock ops), but the 
> > > > > problem is that although each namespace has a parent user_ns, 
> > > > > there's no way to get it without digging in the namespace
> > > > > specific 
> > > > > structure.  Probably we should restructure to move it into 
> > > > > ns_common, then we could display it (and enforce all namespaces
> > > > > having owning user_ns) but it would be a
> > > > 
> > > > I'm missing something here. Is it not already the case that all
> > > > namespaces have an owning user_ns?
> > > 
> > > Um, yes, I don't believe I said they don't.  The problem I thought
> > > you
> > > were having is that there's no way of seeing what it is.
> > > 
> > > nsfs is the Namespace fileystem where bound namespaces appear to a
> > > cat
> > > of /proc/self/mounts.  It can display any information that's in
> > > ns_common (the common core of namespaces) but the owning user_ns
> > > pointer currently isn't in this structure.  Every user namespace
> > > has a
> > > pointer to it, but they're all privately embedded in the individual
> > > namespace specific structures.  What I was proposing was that since
> > > every current namespace has a pointer somewhere to the owning user
> > > namespace, we could abstract this out into ns_common so it's now
> > > accessible to be displayed by nsfs, probably as a mount option.
> > 
> > James, I am not sure that I understood you correctly. We have one
> > file system for all namespace files, how we can show per-file 
> > properties in mount options.
> 
> We have two ways of getting information.  For a namespace that only
> exists as a bind mount we only have what the mount/mountinfo shows, so
> you see something like this:
> 
> jejb@jarvis:~> mount|grep nsfs
> nsfs on /run/build-container/userns type nsfs (rw)
> nsfs on /run/build-container/ppc64 type nsfs (rw)
> 
> the (rw) are the mount options.  We could add the ability to add other
> mount options to this via the superblock .show_options callback.  We
> could make it show the type and parent user namespace.

Yes, we could. But this way works only for bind-mounted ns files, fdinfo
works for any ns files (e.g: /proc/PID/ns/X).
fdinfo show information about one namespace, when /proc/pid/mountinfo
shows infromation about all mounts, so we can parse fdinfo faster and
easier.

> 
> >  I think we can show all required information in fdinfo. We open a
> > namespaces file (/proc/pid/ns/N) and then read /proc/pid/fdinfo/X for
> > it.
> 
> Not if we don't have an extant process in the namespace, we can't use
> these files because they don't exist, plus fdinfo on the
> /proc//ns/X doesn't tell you what the parent user_ns of X is
> (again, we could add this information somewhere ... not sure where
> yet).

we can read fdinfo for any ns file.

For example,

fd = open("/run/build-container/userns", O_PATH);

then read fdinfo for this "fd" (/proc/self/fdinfo/[fd])

Thanks,
Andrew

> 
> James
> 


Re: [CRIU] Introspecting userns relationships to other namespaces?

2016-07-07 Thread Andrew Vagin
On Thu, Jul 07, 2016 at 12:17:35PM -0700, James Bottomley wrote:
> On Thu, 2016-07-07 at 20:21 +0200, Michael Kerrisk (man-pages) wrote:
> > On 7 July 2016 at 17:01, James Bottomley
> >  wrote:
> [Serge already answered the parenting issue]
> > > On Thu, 2016-07-07 at 08:36 -0500, Serge E. Hallyn wrote:
> > > > Hm.  Probably best-effort based on the process hierarchy.  So 
> > > > yeah you could probably get a tree into a state that would be 
> > > > wrongly recreated. Create a new netns, bind mount it, exit;  Have 
> > > > another task create a new user_ns, bind mount it, exit;  Third 
> > > > task setns()s first to the new netns then to the new user_ns.  I 
> > > > suspect criu will recreate that wrongly.
> > > 
> > > This is a bit pathological, and you have to be root to do it: so 
> > > root can set up a nesting hierarchy, bind it and destroy the pids 
> > > but I know of no current orchestration system which does this.
> > > 
> > > Actually, I have to back pedal a bit: the way I currently set up
> > > architecture emulation containers does precisely this: I set up the
> > > namespaces unprivileged with child mount namespaces, but then I ask
> > > root to bind the userns and kill the process that created it so I 
> > > have a permanent handle to enter the namespace by, so I suspect 
> > > that when our current orchestration systems get more sophisticated, 
> > > they might eventually want to do something like this as well.
> > > 
> > > In theory, we could get nsfs to show this information as an option
> > > (just add a show_options entry to the superblock ops), but the 
> > > problem is that although each namespace has a parent user_ns, 
> > > there's no way to get it without digging in the namespace specific 
> > > structure.  Probably we should restructure to move it into 
> > > ns_common, then we could display it (and enforce all namespaces 
> > > having owning user_ns) but it would be a
> > 
> > I'm missing something here. Is it not already the case that all
> > namespaces have an owning user_ns?
> 
> Um, yes, I don't believe I said they don't.  The problem I thought you
> were having is that there's no way of seeing what it is.
> 
> nsfs is the Namespace fileystem where bound namespaces appear to a cat
> of /proc/self/mounts.  It can display any information that's in
> ns_common (the common core of namespaces) but the owning user_ns
> pointer currently isn't in this structure.  Every user namespace has a
> pointer to it, but they're all privately embedded in the individual
> namespace specific structures.  What I was proposing was that since
> every current namespace has a pointer somewhere to the owning user
> namespace, we could abstract this out into ns_common so it's now
> accessible to be displayed by nsfs, probably as a mount option.

James, I am not sure that I understood you correctly. We have one
file system for all namespace files, how we can show per-file properties
in mount options. I think we can show all required information in
fdinfo. We open a namespaces file (/proc/pid/ns/N) and then read
/proc/pid/fdinfo/X for it.

> 
> James
> 
> 
> ___
> CRIU mailing list
> c...@openvz.org
> https://lists.openvz.org/mailman/listinfo/criu


Re: [CRIU] Introspecting userns relationships to other namespaces?

2016-07-07 Thread Andrew Vagin
On Thu, Jul 07, 2016 at 12:17:35PM -0700, James Bottomley wrote:
> On Thu, 2016-07-07 at 20:21 +0200, Michael Kerrisk (man-pages) wrote:
> > On 7 July 2016 at 17:01, James Bottomley
> >  wrote:
> [Serge already answered the parenting issue]
> > > On Thu, 2016-07-07 at 08:36 -0500, Serge E. Hallyn wrote:
> > > > Hm.  Probably best-effort based on the process hierarchy.  So 
> > > > yeah you could probably get a tree into a state that would be 
> > > > wrongly recreated. Create a new netns, bind mount it, exit;  Have 
> > > > another task create a new user_ns, bind mount it, exit;  Third 
> > > > task setns()s first to the new netns then to the new user_ns.  I 
> > > > suspect criu will recreate that wrongly.
> > > 
> > > This is a bit pathological, and you have to be root to do it: so 
> > > root can set up a nesting hierarchy, bind it and destroy the pids 
> > > but I know of no current orchestration system which does this.
> > > 
> > > Actually, I have to back pedal a bit: the way I currently set up
> > > architecture emulation containers does precisely this: I set up the
> > > namespaces unprivileged with child mount namespaces, but then I ask
> > > root to bind the userns and kill the process that created it so I 
> > > have a permanent handle to enter the namespace by, so I suspect 
> > > that when our current orchestration systems get more sophisticated, 
> > > they might eventually want to do something like this as well.
> > > 
> > > In theory, we could get nsfs to show this information as an option
> > > (just add a show_options entry to the superblock ops), but the 
> > > problem is that although each namespace has a parent user_ns, 
> > > there's no way to get it without digging in the namespace specific 
> > > structure.  Probably we should restructure to move it into 
> > > ns_common, then we could display it (and enforce all namespaces 
> > > having owning user_ns) but it would be a
> > 
> > I'm missing something here. Is it not already the case that all
> > namespaces have an owning user_ns?
> 
> Um, yes, I don't believe I said they don't.  The problem I thought you
> were having is that there's no way of seeing what it is.
> 
> nsfs is the Namespace fileystem where bound namespaces appear to a cat
> of /proc/self/mounts.  It can display any information that's in
> ns_common (the common core of namespaces) but the owning user_ns
> pointer currently isn't in this structure.  Every user namespace has a
> pointer to it, but they're all privately embedded in the individual
> namespace specific structures.  What I was proposing was that since
> every current namespace has a pointer somewhere to the owning user
> namespace, we could abstract this out into ns_common so it's now
> accessible to be displayed by nsfs, probably as a mount option.

James, I am not sure that I understood you correctly. We have one
file system for all namespace files, how we can show per-file properties
in mount options. I think we can show all required information in
fdinfo. We open a namespaces file (/proc/pid/ns/N) and then read
/proc/pid/fdinfo/X for it.

> 
> James
> 
> 
> ___
> CRIU mailing list
> c...@openvz.org
> https://lists.openvz.org/mailman/listinfo/criu


Re: [CRIU] Introspecting userns relationships to other namespaces?

2016-07-07 Thread Andrei Vagin
On Thu, Jul 7, 2016 at 10:41 PM, Andrei Vagin  wrote:
> On Thu, Jul 7, 2016 at 8:26 PM, James Bottomley
>  wrote:
>> On Thu, 2016-07-07 at 20:00 -0700, Andrew Vagin wrote:
>>> On Thu, Jul 07, 2016 at 07:16:18PM -0700, Andrew Vagin wrote:
>>> > On Thu, Jul 07, 2016 at 12:17:35PM -0700, James Bottomley wrote:
>>> > > On Thu, 2016-07-07 at 20:21 +0200, Michael Kerrisk (man-pages)
>>> > > wrote:
>>> > > > On 7 July 2016 at 17:01, James Bottomley
>>> > > >  wrote:
>>> > > [Serge already answered the parenting issue]
>>> > > > > On Thu, 2016-07-07 at 08:36 -0500, Serge E. Hallyn wrote:
>>> > > > > > Hm.  Probably best-effort based on the process hierarchy.
>>> > > > > >  So
>>> > > > > > yeah you could probably get a tree into a state that would
>>> > > > > > be
>>> > > > > > wrongly recreated. Create a new netns, bind mount it, exit;
>>> > > > > >   Have
>>> > > > > > another task create a new user_ns, bind mount it, exit;
>>> > > > > >  Third
>>> > > > > > task setns()s first to the new netns then to the new
>>> > > > > > user_ns.  I
>>> > > > > > suspect criu will recreate that wrongly.
>>> > > > >
>>> > > > > This is a bit pathological, and you have to be root to do it:
>>> > > > > so
>>> > > > > root can set up a nesting hierarchy, bind it and destroy the
>>> > > > > pids
>>> > > > > but I know of no current orchestration system which does
>>> > > > > this.
>>> > > > >
>>> > > > > Actually, I have to back pedal a bit: the way I currently set
>>> > > > > up
>>> > > > > architecture emulation containers does precisely this: I set
>>> > > > > up the
>>> > > > > namespaces unprivileged with child mount namespaces, but then
>>> > > > > I ask
>>> > > > > root to bind the userns and kill the process that created it
>>> > > > > so I
>>> > > > > have a permanent handle to enter the namespace by, so I
>>> > > > > suspect
>>> > > > > that when our current orchestration systems get more
>>> > > > > sophisticated,
>>> > > > > they might eventually want to do something like this as well.
>>> > > > >
>>> > > > > In theory, we could get nsfs to show this information as an
>>> > > > > option
>>> > > > > (just add a show_options entry to the superblock ops), but
>>> > > > > the
>>> > > > > problem is that although each namespace has a parent user_ns,
>>> > > > > there's no way to get it without digging in the namespace
>>> > > > > specific
>>> > > > > structure.  Probably we should restructure to move it into
>>> > > > > ns_common, then we could display it (and enforce all
>>> > > > > namespaces
>>> > > > > having owning user_ns) but it would be a
>>> > > >
>>> > > > I'm missing something here. Is it not already the case that all
>>> > > > namespaces have an owning user_ns?
>>> > >
>>> > > Um, yes, I don't believe I said they don't.  The problem I
>>> > > thought you
>>> > > were having is that there's no way of seeing what it is.
>>> > >
>>> > > nsfs is the Namespace fileystem where bound namespaces appear to
>>> > > a cat
>>> > > of /proc/self/mounts.  It can display any information that's in
>>> > > ns_common (the common core of namespaces) but the owning user_ns
>>> > > pointer currently isn't in this structure.  Every user namespace
>>> > > has a
>>> > > pointer to it, but they're all privately embedded in the
>>> > > individual
>>> > > namespace specific structures.  What I was proposing was that
>>> > > since
>>> > > every current namespace has a pointer somewhere to the owning
>>> > > user
>>> > > namespace, we could abstract this out into ns_common so it's now
>>> > > accessible to be displayed by nsfs, probably as a mount option.
>>> >
>>> > James, I am not sure that I understood you correctly. We have one
>>> > file system for all namespace files, how we can show per-file
>>> > properties
>>> > in mount options. I think we can show all required information in
>>> > fdinfo. We open a namespaces file (/proc/pid/ns/N) and then read
>>> > /proc/pid/fdinfo/X for it.
>>>
>>> Here is a proof-of-concept patch.
>>>
>>> How it works:
>>>
>>> In [1]: import os
>>>
>>> In [2]: fd = os.open("/proc/self/ns/pid", os.O_RDONLY)
>>>
>>> In [3]: print open("/proc/self/fdinfo/%d" % fd).read()
>>> pos:  0
>>> flags:010
>>> mnt_id:   2
>>> userns: 4026531837
>>>
>>> In [4]: print "/proc/self/ns/user -> %s" %
>>> os.readlink("/proc/self/ns/user")
>>> /proc/self/ns/user -> user:[4026531837]
>>
>> can't you just do
>>
>> readlink /proc/self/ns/user | sed 's/.*\[\(.*\)\]/\1/'
>
> We can get fdinfo for any ns file. I used /proc/self/ns/pid as an example.
>
> Look at another example:
>
> [root@fc22-vm ~]# cat /proc/self/mountinfo | grep pid_ns_file
> 115 38 0:3 pid:[4026532306] /tmp/pid_ns_file rw shared:67 - nsfs nsfs rw
>

Sorry, I forgot to say that fd is a file descriptor for /tmp/pid_ns_file

In [2]  : fd = os.open("/tmp/pid_ns_file", os.O_RDONLY)
In [3]  : fd
Out[4]: 5

> In [4]: print open("/proc/self/fdinfo/5").read()
> 

Re: [CRIU] Introspecting userns relationships to other namespaces?

2016-07-07 Thread Andrei Vagin
On Thu, Jul 7, 2016 at 10:41 PM, Andrei Vagin  wrote:
> On Thu, Jul 7, 2016 at 8:26 PM, James Bottomley
>  wrote:
>> On Thu, 2016-07-07 at 20:00 -0700, Andrew Vagin wrote:
>>> On Thu, Jul 07, 2016 at 07:16:18PM -0700, Andrew Vagin wrote:
>>> > On Thu, Jul 07, 2016 at 12:17:35PM -0700, James Bottomley wrote:
>>> > > On Thu, 2016-07-07 at 20:21 +0200, Michael Kerrisk (man-pages)
>>> > > wrote:
>>> > > > On 7 July 2016 at 17:01, James Bottomley
>>> > > >  wrote:
>>> > > [Serge already answered the parenting issue]
>>> > > > > On Thu, 2016-07-07 at 08:36 -0500, Serge E. Hallyn wrote:
>>> > > > > > Hm.  Probably best-effort based on the process hierarchy.
>>> > > > > >  So
>>> > > > > > yeah you could probably get a tree into a state that would
>>> > > > > > be
>>> > > > > > wrongly recreated. Create a new netns, bind mount it, exit;
>>> > > > > >   Have
>>> > > > > > another task create a new user_ns, bind mount it, exit;
>>> > > > > >  Third
>>> > > > > > task setns()s first to the new netns then to the new
>>> > > > > > user_ns.  I
>>> > > > > > suspect criu will recreate that wrongly.
>>> > > > >
>>> > > > > This is a bit pathological, and you have to be root to do it:
>>> > > > > so
>>> > > > > root can set up a nesting hierarchy, bind it and destroy the
>>> > > > > pids
>>> > > > > but I know of no current orchestration system which does
>>> > > > > this.
>>> > > > >
>>> > > > > Actually, I have to back pedal a bit: the way I currently set
>>> > > > > up
>>> > > > > architecture emulation containers does precisely this: I set
>>> > > > > up the
>>> > > > > namespaces unprivileged with child mount namespaces, but then
>>> > > > > I ask
>>> > > > > root to bind the userns and kill the process that created it
>>> > > > > so I
>>> > > > > have a permanent handle to enter the namespace by, so I
>>> > > > > suspect
>>> > > > > that when our current orchestration systems get more
>>> > > > > sophisticated,
>>> > > > > they might eventually want to do something like this as well.
>>> > > > >
>>> > > > > In theory, we could get nsfs to show this information as an
>>> > > > > option
>>> > > > > (just add a show_options entry to the superblock ops), but
>>> > > > > the
>>> > > > > problem is that although each namespace has a parent user_ns,
>>> > > > > there's no way to get it without digging in the namespace
>>> > > > > specific
>>> > > > > structure.  Probably we should restructure to move it into
>>> > > > > ns_common, then we could display it (and enforce all
>>> > > > > namespaces
>>> > > > > having owning user_ns) but it would be a
>>> > > >
>>> > > > I'm missing something here. Is it not already the case that all
>>> > > > namespaces have an owning user_ns?
>>> > >
>>> > > Um, yes, I don't believe I said they don't.  The problem I
>>> > > thought you
>>> > > were having is that there's no way of seeing what it is.
>>> > >
>>> > > nsfs is the Namespace fileystem where bound namespaces appear to
>>> > > a cat
>>> > > of /proc/self/mounts.  It can display any information that's in
>>> > > ns_common (the common core of namespaces) but the owning user_ns
>>> > > pointer currently isn't in this structure.  Every user namespace
>>> > > has a
>>> > > pointer to it, but they're all privately embedded in the
>>> > > individual
>>> > > namespace specific structures.  What I was proposing was that
>>> > > since
>>> > > every current namespace has a pointer somewhere to the owning
>>> > > user
>>> > > namespace, we could abstract this out into ns_common so it's now
>>> > > accessible to be displayed by nsfs, probably as a mount option.
>>> >
>>> > James, I am not sure that I understood you correctly. We have one
>>> > file system for all namespace files, how we can show per-file
>>> > properties
>>> > in mount options. I think we can show all required information in
>>> > fdinfo. We open a namespaces file (/proc/pid/ns/N) and then read
>>> > /proc/pid/fdinfo/X for it.
>>>
>>> Here is a proof-of-concept patch.
>>>
>>> How it works:
>>>
>>> In [1]: import os
>>>
>>> In [2]: fd = os.open("/proc/self/ns/pid", os.O_RDONLY)
>>>
>>> In [3]: print open("/proc/self/fdinfo/%d" % fd).read()
>>> pos:  0
>>> flags:010
>>> mnt_id:   2
>>> userns: 4026531837
>>>
>>> In [4]: print "/proc/self/ns/user -> %s" %
>>> os.readlink("/proc/self/ns/user")
>>> /proc/self/ns/user -> user:[4026531837]
>>
>> can't you just do
>>
>> readlink /proc/self/ns/user | sed 's/.*\[\(.*\)\]/\1/'
>
> We can get fdinfo for any ns file. I used /proc/self/ns/pid as an example.
>
> Look at another example:
>
> [root@fc22-vm ~]# cat /proc/self/mountinfo | grep pid_ns_file
> 115 38 0:3 pid:[4026532306] /tmp/pid_ns_file rw shared:67 - nsfs nsfs rw
>

Sorry, I forgot to say that fd is a file descriptor for /tmp/pid_ns_file

In [2]  : fd = os.open("/tmp/pid_ns_file", os.O_RDONLY)
In [3]  : fd
Out[4]: 5

> In [4]: print open("/proc/self/fdinfo/5").read()
> pos: 0
> flags: 010
> mnt_id: 115
> userns: 4026532305
>
>
> In [5]: 

Re: [CRIU] Introspecting userns relationships to other namespaces?

2016-07-07 Thread Andrei Vagin
On Thu, Jul 7, 2016 at 8:26 PM, James Bottomley
 wrote:
> On Thu, 2016-07-07 at 20:00 -0700, Andrew Vagin wrote:
>> On Thu, Jul 07, 2016 at 07:16:18PM -0700, Andrew Vagin wrote:
>> > On Thu, Jul 07, 2016 at 12:17:35PM -0700, James Bottomley wrote:
>> > > On Thu, 2016-07-07 at 20:21 +0200, Michael Kerrisk (man-pages)
>> > > wrote:
>> > > > On 7 July 2016 at 17:01, James Bottomley
>> > > >  wrote:
>> > > [Serge already answered the parenting issue]
>> > > > > On Thu, 2016-07-07 at 08:36 -0500, Serge E. Hallyn wrote:
>> > > > > > Hm.  Probably best-effort based on the process hierarchy.
>> > > > > >  So
>> > > > > > yeah you could probably get a tree into a state that would
>> > > > > > be
>> > > > > > wrongly recreated. Create a new netns, bind mount it, exit;
>> > > > > >   Have
>> > > > > > another task create a new user_ns, bind mount it, exit;
>> > > > > >  Third
>> > > > > > task setns()s first to the new netns then to the new
>> > > > > > user_ns.  I
>> > > > > > suspect criu will recreate that wrongly.
>> > > > >
>> > > > > This is a bit pathological, and you have to be root to do it:
>> > > > > so
>> > > > > root can set up a nesting hierarchy, bind it and destroy the
>> > > > > pids
>> > > > > but I know of no current orchestration system which does
>> > > > > this.
>> > > > >
>> > > > > Actually, I have to back pedal a bit: the way I currently set
>> > > > > up
>> > > > > architecture emulation containers does precisely this: I set
>> > > > > up the
>> > > > > namespaces unprivileged with child mount namespaces, but then
>> > > > > I ask
>> > > > > root to bind the userns and kill the process that created it
>> > > > > so I
>> > > > > have a permanent handle to enter the namespace by, so I
>> > > > > suspect
>> > > > > that when our current orchestration systems get more
>> > > > > sophisticated,
>> > > > > they might eventually want to do something like this as well.
>> > > > >
>> > > > > In theory, we could get nsfs to show this information as an
>> > > > > option
>> > > > > (just add a show_options entry to the superblock ops), but
>> > > > > the
>> > > > > problem is that although each namespace has a parent user_ns,
>> > > > > there's no way to get it without digging in the namespace
>> > > > > specific
>> > > > > structure.  Probably we should restructure to move it into
>> > > > > ns_common, then we could display it (and enforce all
>> > > > > namespaces
>> > > > > having owning user_ns) but it would be a
>> > > >
>> > > > I'm missing something here. Is it not already the case that all
>> > > > namespaces have an owning user_ns?
>> > >
>> > > Um, yes, I don't believe I said they don't.  The problem I
>> > > thought you
>> > > were having is that there's no way of seeing what it is.
>> > >
>> > > nsfs is the Namespace fileystem where bound namespaces appear to
>> > > a cat
>> > > of /proc/self/mounts.  It can display any information that's in
>> > > ns_common (the common core of namespaces) but the owning user_ns
>> > > pointer currently isn't in this structure.  Every user namespace
>> > > has a
>> > > pointer to it, but they're all privately embedded in the
>> > > individual
>> > > namespace specific structures.  What I was proposing was that
>> > > since
>> > > every current namespace has a pointer somewhere to the owning
>> > > user
>> > > namespace, we could abstract this out into ns_common so it's now
>> > > accessible to be displayed by nsfs, probably as a mount option.
>> >
>> > James, I am not sure that I understood you correctly. We have one
>> > file system for all namespace files, how we can show per-file
>> > properties
>> > in mount options. I think we can show all required information in
>> > fdinfo. We open a namespaces file (/proc/pid/ns/N) and then read
>> > /proc/pid/fdinfo/X for it.
>>
>> Here is a proof-of-concept patch.
>>
>> How it works:
>>
>> In [1]: import os
>>
>> In [2]: fd = os.open("/proc/self/ns/pid", os.O_RDONLY)
>>
>> In [3]: print open("/proc/self/fdinfo/%d" % fd).read()
>> pos:  0
>> flags:010
>> mnt_id:   2
>> userns: 4026531837
>>
>> In [4]: print "/proc/self/ns/user -> %s" %
>> os.readlink("/proc/self/ns/user")
>> /proc/self/ns/user -> user:[4026531837]
>
> can't you just do
>
> readlink /proc/self/ns/user | sed 's/.*\[\(.*\)\]/\1/'

We can get fdinfo for any ns file. I used /proc/self/ns/pid as an example.

Look at another example:

[root@fc22-vm ~]# cat /proc/self/mountinfo | grep pid_ns_file
115 38 0:3 pid:[4026532306] /tmp/pid_ns_file rw shared:67 - nsfs nsfs rw

In [4]: print open("/proc/self/fdinfo/5").read()
pos: 0
flags: 010
mnt_id: 115
userns: 4026532305


In [5]: os.readlink("/proc/self/ns/user")
Out[5]: 'user:[4026531837]'


>
> ?
>
> But what Michael was asking about was the parent user_ns of all the
> other namespaces ... I don't think there's any way we can get that out
> of any information in /proc/self/
>
> James
>
>
> 

Re: [CRIU] Introspecting userns relationships to other namespaces?

2016-07-07 Thread Andrei Vagin
On Thu, Jul 7, 2016 at 8:26 PM, James Bottomley
 wrote:
> On Thu, 2016-07-07 at 20:00 -0700, Andrew Vagin wrote:
>> On Thu, Jul 07, 2016 at 07:16:18PM -0700, Andrew Vagin wrote:
>> > On Thu, Jul 07, 2016 at 12:17:35PM -0700, James Bottomley wrote:
>> > > On Thu, 2016-07-07 at 20:21 +0200, Michael Kerrisk (man-pages)
>> > > wrote:
>> > > > On 7 July 2016 at 17:01, James Bottomley
>> > > >  wrote:
>> > > [Serge already answered the parenting issue]
>> > > > > On Thu, 2016-07-07 at 08:36 -0500, Serge E. Hallyn wrote:
>> > > > > > Hm.  Probably best-effort based on the process hierarchy.
>> > > > > >  So
>> > > > > > yeah you could probably get a tree into a state that would
>> > > > > > be
>> > > > > > wrongly recreated. Create a new netns, bind mount it, exit;
>> > > > > >   Have
>> > > > > > another task create a new user_ns, bind mount it, exit;
>> > > > > >  Third
>> > > > > > task setns()s first to the new netns then to the new
>> > > > > > user_ns.  I
>> > > > > > suspect criu will recreate that wrongly.
>> > > > >
>> > > > > This is a bit pathological, and you have to be root to do it:
>> > > > > so
>> > > > > root can set up a nesting hierarchy, bind it and destroy the
>> > > > > pids
>> > > > > but I know of no current orchestration system which does
>> > > > > this.
>> > > > >
>> > > > > Actually, I have to back pedal a bit: the way I currently set
>> > > > > up
>> > > > > architecture emulation containers does precisely this: I set
>> > > > > up the
>> > > > > namespaces unprivileged with child mount namespaces, but then
>> > > > > I ask
>> > > > > root to bind the userns and kill the process that created it
>> > > > > so I
>> > > > > have a permanent handle to enter the namespace by, so I
>> > > > > suspect
>> > > > > that when our current orchestration systems get more
>> > > > > sophisticated,
>> > > > > they might eventually want to do something like this as well.
>> > > > >
>> > > > > In theory, we could get nsfs to show this information as an
>> > > > > option
>> > > > > (just add a show_options entry to the superblock ops), but
>> > > > > the
>> > > > > problem is that although each namespace has a parent user_ns,
>> > > > > there's no way to get it without digging in the namespace
>> > > > > specific
>> > > > > structure.  Probably we should restructure to move it into
>> > > > > ns_common, then we could display it (and enforce all
>> > > > > namespaces
>> > > > > having owning user_ns) but it would be a
>> > > >
>> > > > I'm missing something here. Is it not already the case that all
>> > > > namespaces have an owning user_ns?
>> > >
>> > > Um, yes, I don't believe I said they don't.  The problem I
>> > > thought you
>> > > were having is that there's no way of seeing what it is.
>> > >
>> > > nsfs is the Namespace fileystem where bound namespaces appear to
>> > > a cat
>> > > of /proc/self/mounts.  It can display any information that's in
>> > > ns_common (the common core of namespaces) but the owning user_ns
>> > > pointer currently isn't in this structure.  Every user namespace
>> > > has a
>> > > pointer to it, but they're all privately embedded in the
>> > > individual
>> > > namespace specific structures.  What I was proposing was that
>> > > since
>> > > every current namespace has a pointer somewhere to the owning
>> > > user
>> > > namespace, we could abstract this out into ns_common so it's now
>> > > accessible to be displayed by nsfs, probably as a mount option.
>> >
>> > James, I am not sure that I understood you correctly. We have one
>> > file system for all namespace files, how we can show per-file
>> > properties
>> > in mount options. I think we can show all required information in
>> > fdinfo. We open a namespaces file (/proc/pid/ns/N) and then read
>> > /proc/pid/fdinfo/X for it.
>>
>> Here is a proof-of-concept patch.
>>
>> How it works:
>>
>> In [1]: import os
>>
>> In [2]: fd = os.open("/proc/self/ns/pid", os.O_RDONLY)
>>
>> In [3]: print open("/proc/self/fdinfo/%d" % fd).read()
>> pos:  0
>> flags:010
>> mnt_id:   2
>> userns: 4026531837
>>
>> In [4]: print "/proc/self/ns/user -> %s" %
>> os.readlink("/proc/self/ns/user")
>> /proc/self/ns/user -> user:[4026531837]
>
> can't you just do
>
> readlink /proc/self/ns/user | sed 's/.*\[\(.*\)\]/\1/'

We can get fdinfo for any ns file. I used /proc/self/ns/pid as an example.

Look at another example:

[root@fc22-vm ~]# cat /proc/self/mountinfo | grep pid_ns_file
115 38 0:3 pid:[4026532306] /tmp/pid_ns_file rw shared:67 - nsfs nsfs rw

In [4]: print open("/proc/self/fdinfo/5").read()
pos: 0
flags: 010
mnt_id: 115
userns: 4026532305


In [5]: os.readlink("/proc/self/ns/user")
Out[5]: 'user:[4026531837]'


>
> ?
>
> But what Michael was asking about was the parent user_ns of all the
> other namespaces ... I don't think there's any way we can get that out
> of any information in /proc/self/
>
> James
>
>
> ___
> Containers mailing list
> 

Re: Introspecting userns relationships to other namespaces?

2016-07-07 Thread W. Trevor King
On Thu, Jul 07, 2016 at 08:26:47PM -0700, James Bottomley wrote:
> On Thu, 2016-07-07 at 20:00 -0700, Andrew Vagin wrote:
> > On Thu, Jul 07, 2016 at 07:16:18PM -0700, Andrew Vagin wrote:
> > > I think we can show all required information in fdinfo. We open
> > > a namespaces file (/proc/pid/ns/N) and then read
> > > /proc/pid/fdinfo/X for it.
> > 
> > Here is a proof-of-concept patch.
> > …
> > In [2]: fd = os.open("/proc/self/ns/pid", os.O_RDONLY)
> > 
> > In [3]: print open("/proc/self/fdinfo/%d" % fd).read()
> > pos:0
> > flags:  010
> > mnt_id: 2
> > userns: 4026531837
> > 
> > In [4]: print "/proc/self/ns/user -> %s" %
> > os.readlink("/proc/self/ns/user")
> > /proc/self/ns/user -> user:[4026531837]
> 
> can't you just do
> 
> readlink /proc/self/ns/user | sed 's/.*\[\(.*\)\]/\1/'

With Andrew's fdinfo approach you know the user namespace owning
/proc/self/ns/pid is 4026531837.  That happens to be
/proc/self/ns/user in this case, but doesn't have to be in general.

> But what Michael was asking about was the parent user_ns of all the
> other namespaces ... I don't think there's any way we can get that
> out of any information in /proc/self/

If fdinfo only shows immediate parents, you'd need to walk the tree to
get back to the root.  And at each layer of the PID namespace tree
there will be another user-namespace parent branching off).  With a
tree like:

  Namespace | Parent   | Owning userns
 ---+--+---
  Root userns   | -| -
  Root PID ns   | -| Root userns
  Child userns  | Root usens   | Root userns
  Child PID ns  | Root PID ns  | Root userns
  Grandchild userns | Child userns | Child userns
  Grandchild PID ns | Child PID ns | Grandchild userns

Walking from the granchild PID namespace would give you:

  Grandchild PID ns
  |-- Child PID ns
  |   |-- Root PID ns
  |   `-- Root userns 
  `-- Granchild userns
  `-- Child userns
  `-- Root userns

If you only put one level in fdinfo, you're stuck if one of the
namespaces involved has neither bind mounts nor a PID to give you
handle on it [1].  And if you want to put that whole ancestor tree in
fdinfo, you have to come up with some way to handle the two-parent
branching.

I'm also not sure how exposing nsfs information [2] would handle
namespaces that had neither a surviving bind mount nor a direct
process.

If all the information is available (possible after a mechanical patch
[3] makes it more accessible), then it seems easier to put it in a
separate /proc or /sys file.  There was a stab at this for PID
namespaces in [4] (the same series that landed NStgid, etc.) with
additional background and alternative approaches in [5].  There were
problems with that patch (and it was trying to do more by also listing
a process's ID in each PID namespace), but the “let's put the whole
tree in a new file” approach seems sound to me.

Cheers,
Trevor

[1]: http://thread.gmane.org/gmane.linux.kernel.containers/30456/focus=20536
 Subject: Re: Introspecting userns relationships to other namespaces?
 Date: Thu, 7 Jul 2016 13:24:42 -0500
 Message-ID: <20160707182442.ga6...@mail.hallyn.com>
[2]: http://thread.gmane.org/gmane.linux.kernel.containers/30456/focus=30499
 Subject: Re: [CRIU] Introspecting userns relationships to other namespaces?
 Date: Thu, 07 Jul 2016 20:20:05 -0700
     Message-ID: <1467948005.2322.84.ca...@hansenpartnership.com>
[3]: http://thread.gmane.org/gmane.linux.kernel.containers/30456/focus=20537
 Subject: Re: Introspecting userns relationships to other namespaces?
 Message-ID: <1467903712.2347.16.ca...@hansenpartnership.com>
 Date: Thu, 07 Jul 2016 08:01:52 -0700
[4]: http://thread.gmane.org/gmane.linux.kernel.containers/28925/focus=28928
 Subject: [resend][PATCH v9 1/3] procfs: show hierarchy of pid namespace
 Date: Tue, 23 Dec 2014 18:20:37 +0800
 Message-ID: <1419330039-29207-2-git-send-email-chenhanx...@cn.fujitsu.com>
[5]: http://thread.gmane.org/gmane.linux.kernel.containers/28105
 Subject: [RFC]Pid conversion between pid namespace
 Date: Thu, 3 Jul 2014 12:18:33 +
 Message-ID: 
<5871495633F38949900D2BF2DC04883E55C374@G08CNEXMBPEKD02.g08.fujitsu.local>

-- 
This email may be signed or encrypted with GnuPG (http://www.gnupg.org).
For more information, see http://en.wikipedia.org/wiki/Pretty_Good_Privacy


signature.asc
Description: OpenPGP digital signature


Re: Introspecting userns relationships to other namespaces?

2016-07-07 Thread W. Trevor King
On Thu, Jul 07, 2016 at 08:26:47PM -0700, James Bottomley wrote:
> On Thu, 2016-07-07 at 20:00 -0700, Andrew Vagin wrote:
> > On Thu, Jul 07, 2016 at 07:16:18PM -0700, Andrew Vagin wrote:
> > > I think we can show all required information in fdinfo. We open
> > > a namespaces file (/proc/pid/ns/N) and then read
> > > /proc/pid/fdinfo/X for it.
> > 
> > Here is a proof-of-concept patch.
> > …
> > In [2]: fd = os.open("/proc/self/ns/pid", os.O_RDONLY)
> > 
> > In [3]: print open("/proc/self/fdinfo/%d" % fd).read()
> > pos:0
> > flags:  010
> > mnt_id: 2
> > userns: 4026531837
> > 
> > In [4]: print "/proc/self/ns/user -> %s" %
> > os.readlink("/proc/self/ns/user")
> > /proc/self/ns/user -> user:[4026531837]
> 
> can't you just do
> 
> readlink /proc/self/ns/user | sed 's/.*\[\(.*\)\]/\1/'

With Andrew's fdinfo approach you know the user namespace owning
/proc/self/ns/pid is 4026531837.  That happens to be
/proc/self/ns/user in this case, but doesn't have to be in general.

> But what Michael was asking about was the parent user_ns of all the
> other namespaces ... I don't think there's any way we can get that
> out of any information in /proc/self/

If fdinfo only shows immediate parents, you'd need to walk the tree to
get back to the root.  And at each layer of the PID namespace tree
there will be another user-namespace parent branching off).  With a
tree like:

  Namespace | Parent   | Owning userns
 ---+--+---
  Root userns   | -| -
  Root PID ns   | -| Root userns
  Child userns  | Root usens   | Root userns
  Child PID ns  | Root PID ns  | Root userns
  Grandchild userns | Child userns | Child userns
  Grandchild PID ns | Child PID ns | Grandchild userns

Walking from the granchild PID namespace would give you:

  Grandchild PID ns
  |-- Child PID ns
  |   |-- Root PID ns
  |   `-- Root userns 
  `-- Granchild userns
  `-- Child userns
  `-- Root userns

If you only put one level in fdinfo, you're stuck if one of the
namespaces involved has neither bind mounts nor a PID to give you
handle on it [1].  And if you want to put that whole ancestor tree in
fdinfo, you have to come up with some way to handle the two-parent
branching.

I'm also not sure how exposing nsfs information [2] would handle
namespaces that had neither a surviving bind mount nor a direct
process.

If all the information is available (possible after a mechanical patch
[3] makes it more accessible), then it seems easier to put it in a
separate /proc or /sys file.  There was a stab at this for PID
namespaces in [4] (the same series that landed NStgid, etc.) with
additional background and alternative approaches in [5].  There were
problems with that patch (and it was trying to do more by also listing
a process's ID in each PID namespace), but the “let's put the whole
tree in a new file” approach seems sound to me.

Cheers,
Trevor

[1]: http://thread.gmane.org/gmane.linux.kernel.containers/30456/focus=20536
 Subject: Re: Introspecting userns relationships to other namespaces?
 Date: Thu, 7 Jul 2016 13:24:42 -0500
 Message-ID: <20160707182442.ga6...@mail.hallyn.com>
[2]: http://thread.gmane.org/gmane.linux.kernel.containers/30456/focus=30499
 Subject: Re: [CRIU] Introspecting userns relationships to other namespaces?
 Date: Thu, 07 Jul 2016 20:20:05 -0700
     Message-ID: <1467948005.2322.84.ca...@hansenpartnership.com>
[3]: http://thread.gmane.org/gmane.linux.kernel.containers/30456/focus=20537
 Subject: Re: Introspecting userns relationships to other namespaces?
 Message-ID: <1467903712.2347.16.ca...@hansenpartnership.com>
 Date: Thu, 07 Jul 2016 08:01:52 -0700
[4]: http://thread.gmane.org/gmane.linux.kernel.containers/28925/focus=28928
 Subject: [resend][PATCH v9 1/3] procfs: show hierarchy of pid namespace
 Date: Tue, 23 Dec 2014 18:20:37 +0800
 Message-ID: <1419330039-29207-2-git-send-email-chenhanx...@cn.fujitsu.com>
[5]: http://thread.gmane.org/gmane.linux.kernel.containers/28105
 Subject: [RFC]Pid conversion between pid namespace
 Date: Thu, 3 Jul 2014 12:18:33 +
 Message-ID: 
<5871495633F38949900D2BF2DC04883E55C374@G08CNEXMBPEKD02.g08.fujitsu.local>

-- 
This email may be signed or encrypted with GnuPG (http://www.gnupg.org).
For more information, see http://en.wikipedia.org/wiki/Pretty_Good_Privacy


signature.asc
Description: OpenPGP digital signature


Re: [CRIU] Introspecting userns relationships to other namespaces?

2016-07-07 Thread Andrew Vagin
On Thu, Jul 07, 2016 at 07:16:18PM -0700, Andrew Vagin wrote:
> On Thu, Jul 07, 2016 at 12:17:35PM -0700, James Bottomley wrote:
> > On Thu, 2016-07-07 at 20:21 +0200, Michael Kerrisk (man-pages) wrote:
> > > On 7 July 2016 at 17:01, James Bottomley
> > >  wrote:
> > [Serge already answered the parenting issue]
> > > > On Thu, 2016-07-07 at 08:36 -0500, Serge E. Hallyn wrote:
> > > > > Hm.  Probably best-effort based on the process hierarchy.  So 
> > > > > yeah you could probably get a tree into a state that would be 
> > > > > wrongly recreated. Create a new netns, bind mount it, exit;  Have 
> > > > > another task create a new user_ns, bind mount it, exit;  Third 
> > > > > task setns()s first to the new netns then to the new user_ns.  I 
> > > > > suspect criu will recreate that wrongly.
> > > > 
> > > > This is a bit pathological, and you have to be root to do it: so 
> > > > root can set up a nesting hierarchy, bind it and destroy the pids 
> > > > but I know of no current orchestration system which does this.
> > > > 
> > > > Actually, I have to back pedal a bit: the way I currently set up
> > > > architecture emulation containers does precisely this: I set up the
> > > > namespaces unprivileged with child mount namespaces, but then I ask
> > > > root to bind the userns and kill the process that created it so I 
> > > > have a permanent handle to enter the namespace by, so I suspect 
> > > > that when our current orchestration systems get more sophisticated, 
> > > > they might eventually want to do something like this as well.
> > > > 
> > > > In theory, we could get nsfs to show this information as an option
> > > > (just add a show_options entry to the superblock ops), but the 
> > > > problem is that although each namespace has a parent user_ns, 
> > > > there's no way to get it without digging in the namespace specific 
> > > > structure.  Probably we should restructure to move it into 
> > > > ns_common, then we could display it (and enforce all namespaces 
> > > > having owning user_ns) but it would be a
> > > 
> > > I'm missing something here. Is it not already the case that all
> > > namespaces have an owning user_ns?
> > 
> > Um, yes, I don't believe I said they don't.  The problem I thought you
> > were having is that there's no way of seeing what it is.
> > 
> > nsfs is the Namespace fileystem where bound namespaces appear to a cat
> > of /proc/self/mounts.  It can display any information that's in
> > ns_common (the common core of namespaces) but the owning user_ns
> > pointer currently isn't in this structure.  Every user namespace has a
> > pointer to it, but they're all privately embedded in the individual
> > namespace specific structures.  What I was proposing was that since
> > every current namespace has a pointer somewhere to the owning user
> > namespace, we could abstract this out into ns_common so it's now
> > accessible to be displayed by nsfs, probably as a mount option.
> 
> James, I am not sure that I understood you correctly. We have one
> file system for all namespace files, how we can show per-file properties
> in mount options. I think we can show all required information in
> fdinfo. We open a namespaces file (/proc/pid/ns/N) and then read
> /proc/pid/fdinfo/X for it.

Here is a proof-of-concept patch.

How it works:

In [1]: import os

In [2]: fd = os.open("/proc/self/ns/pid", os.O_RDONLY)

In [3]: print open("/proc/self/fdinfo/%d" % fd).read()
pos:0
flags:  010
mnt_id: 2
userns: 4026531837

In [4]: print "/proc/self/ns/user -> %s" % os.readlink("/proc/self/ns/user")
/proc/self/ns/user -> user:[4026531837]

> 
> > 
> > James
> > 
> > 
> > ___
> > CRIU mailing list
> > c...@openvz.org
> > https://lists.openvz.org/mailman/listinfo/criu
> ___
> CRIU mailing list
> c...@openvz.org
> https://lists.openvz.org/mailman/listinfo/criu
diff --git a/fs/nsfs.c b/fs/nsfs.c
index 8f20d60..bfd5bde 100644
--- a/fs/nsfs.c
+++ b/fs/nsfs.c
@@ -8,8 +8,20 @@
 
 static struct vfsmount *nsfs_mnt;
 
+static void show_fdinfo(struct seq_file *m, struct file *f)
+{
+   struct dentry *dentry = f->f_path.dentry;
+   struct inode *inode = d_inode(dentry);
+   const struct proc_ns_operations *ns_ops = dentry->d_fsdata;
+   struct ns_common *ns = inode->i_private;
+
+   if (ns_ops->show_fdinfo)
+   ns_ops->show_fdinfo(m, ns);
+}
+
 static const struct file_operations ns_file_operations = {
.llseek = no_llseek,
+   .show_fdinfo= show_fdinfo,
 };
 
 static char *ns_dname(struct dentry *dentry, char *buffer, int buflen)
diff --git a/include/linux/proc_ns.h b/include/linux/proc_ns.h
index de0e771..fed276b 100644
--- a/include/linux/proc_ns.h
+++ b/include/linux/proc_ns.h
@@ -18,6 +18,7 @@ struct proc_ns_operations {
struct ns_common *(*get)(struct task_struct *task);
void (*put)(struct ns_common 

Re: [CRIU] Introspecting userns relationships to other namespaces?

2016-07-07 Thread Andrew Vagin
On Thu, Jul 07, 2016 at 07:16:18PM -0700, Andrew Vagin wrote:
> On Thu, Jul 07, 2016 at 12:17:35PM -0700, James Bottomley wrote:
> > On Thu, 2016-07-07 at 20:21 +0200, Michael Kerrisk (man-pages) wrote:
> > > On 7 July 2016 at 17:01, James Bottomley
> > >  wrote:
> > [Serge already answered the parenting issue]
> > > > On Thu, 2016-07-07 at 08:36 -0500, Serge E. Hallyn wrote:
> > > > > Hm.  Probably best-effort based on the process hierarchy.  So 
> > > > > yeah you could probably get a tree into a state that would be 
> > > > > wrongly recreated. Create a new netns, bind mount it, exit;  Have 
> > > > > another task create a new user_ns, bind mount it, exit;  Third 
> > > > > task setns()s first to the new netns then to the new user_ns.  I 
> > > > > suspect criu will recreate that wrongly.
> > > > 
> > > > This is a bit pathological, and you have to be root to do it: so 
> > > > root can set up a nesting hierarchy, bind it and destroy the pids 
> > > > but I know of no current orchestration system which does this.
> > > > 
> > > > Actually, I have to back pedal a bit: the way I currently set up
> > > > architecture emulation containers does precisely this: I set up the
> > > > namespaces unprivileged with child mount namespaces, but then I ask
> > > > root to bind the userns and kill the process that created it so I 
> > > > have a permanent handle to enter the namespace by, so I suspect 
> > > > that when our current orchestration systems get more sophisticated, 
> > > > they might eventually want to do something like this as well.
> > > > 
> > > > In theory, we could get nsfs to show this information as an option
> > > > (just add a show_options entry to the superblock ops), but the 
> > > > problem is that although each namespace has a parent user_ns, 
> > > > there's no way to get it without digging in the namespace specific 
> > > > structure.  Probably we should restructure to move it into 
> > > > ns_common, then we could display it (and enforce all namespaces 
> > > > having owning user_ns) but it would be a
> > > 
> > > I'm missing something here. Is it not already the case that all
> > > namespaces have an owning user_ns?
> > 
> > Um, yes, I don't believe I said they don't.  The problem I thought you
> > were having is that there's no way of seeing what it is.
> > 
> > nsfs is the Namespace fileystem where bound namespaces appear to a cat
> > of /proc/self/mounts.  It can display any information that's in
> > ns_common (the common core of namespaces) but the owning user_ns
> > pointer currently isn't in this structure.  Every user namespace has a
> > pointer to it, but they're all privately embedded in the individual
> > namespace specific structures.  What I was proposing was that since
> > every current namespace has a pointer somewhere to the owning user
> > namespace, we could abstract this out into ns_common so it's now
> > accessible to be displayed by nsfs, probably as a mount option.
> 
> James, I am not sure that I understood you correctly. We have one
> file system for all namespace files, how we can show per-file properties
> in mount options. I think we can show all required information in
> fdinfo. We open a namespaces file (/proc/pid/ns/N) and then read
> /proc/pid/fdinfo/X for it.

Here is a proof-of-concept patch.

How it works:

In [1]: import os

In [2]: fd = os.open("/proc/self/ns/pid", os.O_RDONLY)

In [3]: print open("/proc/self/fdinfo/%d" % fd).read()
pos:0
flags:  010
mnt_id: 2
userns: 4026531837

In [4]: print "/proc/self/ns/user -> %s" % os.readlink("/proc/self/ns/user")
/proc/self/ns/user -> user:[4026531837]

> 
> > 
> > James
> > 
> > 
> > ___
> > CRIU mailing list
> > c...@openvz.org
> > https://lists.openvz.org/mailman/listinfo/criu
> ___
> CRIU mailing list
> c...@openvz.org
> https://lists.openvz.org/mailman/listinfo/criu
diff --git a/fs/nsfs.c b/fs/nsfs.c
index 8f20d60..bfd5bde 100644
--- a/fs/nsfs.c
+++ b/fs/nsfs.c
@@ -8,8 +8,20 @@
 
 static struct vfsmount *nsfs_mnt;
 
+static void show_fdinfo(struct seq_file *m, struct file *f)
+{
+   struct dentry *dentry = f->f_path.dentry;
+   struct inode *inode = d_inode(dentry);
+   const struct proc_ns_operations *ns_ops = dentry->d_fsdata;
+   struct ns_common *ns = inode->i_private;
+
+   if (ns_ops->show_fdinfo)
+   ns_ops->show_fdinfo(m, ns);
+}
+
 static const struct file_operations ns_file_operations = {
.llseek = no_llseek,
+   .show_fdinfo= show_fdinfo,
 };
 
 static char *ns_dname(struct dentry *dentry, char *buffer, int buflen)
diff --git a/include/linux/proc_ns.h b/include/linux/proc_ns.h
index de0e771..fed276b 100644
--- a/include/linux/proc_ns.h
+++ b/include/linux/proc_ns.h
@@ -18,6 +18,7 @@ struct proc_ns_operations {
struct ns_common *(*get)(struct task_struct *task);
void (*put)(struct ns_common *ns);
int (*install)(struct 

Re: [CRIU] Introspecting userns relationships to other namespaces?

2016-07-07 Thread James Bottomley
On Thu, 2016-07-07 at 20:00 -0700, Andrew Vagin wrote:
> On Thu, Jul 07, 2016 at 07:16:18PM -0700, Andrew Vagin wrote:
> > On Thu, Jul 07, 2016 at 12:17:35PM -0700, James Bottomley wrote:
> > > On Thu, 2016-07-07 at 20:21 +0200, Michael Kerrisk (man-pages)
> > > wrote:
> > > > On 7 July 2016 at 17:01, James Bottomley
> > > >  wrote:
> > > [Serge already answered the parenting issue]
> > > > > On Thu, 2016-07-07 at 08:36 -0500, Serge E. Hallyn wrote:
> > > > > > Hm.  Probably best-effort based on the process hierarchy. 
> > > > > >  So 
> > > > > > yeah you could probably get a tree into a state that would
> > > > > > be 
> > > > > > wrongly recreated. Create a new netns, bind mount it, exit;
> > > > > >   Have 
> > > > > > another task create a new user_ns, bind mount it, exit; 
> > > > > >  Third 
> > > > > > task setns()s first to the new netns then to the new
> > > > > > user_ns.  I 
> > > > > > suspect criu will recreate that wrongly.
> > > > > 
> > > > > This is a bit pathological, and you have to be root to do it:
> > > > > so 
> > > > > root can set up a nesting hierarchy, bind it and destroy the
> > > > > pids 
> > > > > but I know of no current orchestration system which does
> > > > > this.
> > > > > 
> > > > > Actually, I have to back pedal a bit: the way I currently set
> > > > > up
> > > > > architecture emulation containers does precisely this: I set
> > > > > up the
> > > > > namespaces unprivileged with child mount namespaces, but then
> > > > > I ask
> > > > > root to bind the userns and kill the process that created it
> > > > > so I 
> > > > > have a permanent handle to enter the namespace by, so I
> > > > > suspect 
> > > > > that when our current orchestration systems get more
> > > > > sophisticated, 
> > > > > they might eventually want to do something like this as well.
> > > > > 
> > > > > In theory, we could get nsfs to show this information as an
> > > > > option
> > > > > (just add a show_options entry to the superblock ops), but
> > > > > the 
> > > > > problem is that although each namespace has a parent user_ns,
> > > > > there's no way to get it without digging in the namespace
> > > > > specific 
> > > > > structure.  Probably we should restructure to move it into 
> > > > > ns_common, then we could display it (and enforce all
> > > > > namespaces 
> > > > > having owning user_ns) but it would be a
> > > > 
> > > > I'm missing something here. Is it not already the case that all
> > > > namespaces have an owning user_ns?
> > > 
> > > Um, yes, I don't believe I said they don't.  The problem I
> > > thought you
> > > were having is that there's no way of seeing what it is.
> > > 
> > > nsfs is the Namespace fileystem where bound namespaces appear to
> > > a cat
> > > of /proc/self/mounts.  It can display any information that's in
> > > ns_common (the common core of namespaces) but the owning user_ns
> > > pointer currently isn't in this structure.  Every user namespace
> > > has a
> > > pointer to it, but they're all privately embedded in the
> > > individual
> > > namespace specific structures.  What I was proposing was that
> > > since
> > > every current namespace has a pointer somewhere to the owning
> > > user
> > > namespace, we could abstract this out into ns_common so it's now
> > > accessible to be displayed by nsfs, probably as a mount option.
> > 
> > James, I am not sure that I understood you correctly. We have one
> > file system for all namespace files, how we can show per-file
> > properties
> > in mount options. I think we can show all required information in
> > fdinfo. We open a namespaces file (/proc/pid/ns/N) and then read
> > /proc/pid/fdinfo/X for it.
> 
> Here is a proof-of-concept patch.
> 
> How it works:
> 
> In [1]: import os
> 
> In [2]: fd = os.open("/proc/self/ns/pid", os.O_RDONLY)
> 
> In [3]: print open("/proc/self/fdinfo/%d" % fd).read()
> pos:  0
> flags:010
> mnt_id:   2
> userns: 4026531837
> 
> In [4]: print "/proc/self/ns/user -> %s" %
> os.readlink("/proc/self/ns/user")
> /proc/self/ns/user -> user:[4026531837]

can't you just do

readlink /proc/self/ns/user | sed 's/.*\[\(.*\)\]/\1/'

?

But what Michael was asking about was the parent user_ns of all the
other namespaces ... I don't think there's any way we can get that out
of any information in /proc/self/

James




Re: [CRIU] Introspecting userns relationships to other namespaces?

2016-07-07 Thread James Bottomley
On Thu, 2016-07-07 at 20:00 -0700, Andrew Vagin wrote:
> On Thu, Jul 07, 2016 at 07:16:18PM -0700, Andrew Vagin wrote:
> > On Thu, Jul 07, 2016 at 12:17:35PM -0700, James Bottomley wrote:
> > > On Thu, 2016-07-07 at 20:21 +0200, Michael Kerrisk (man-pages)
> > > wrote:
> > > > On 7 July 2016 at 17:01, James Bottomley
> > > >  wrote:
> > > [Serge already answered the parenting issue]
> > > > > On Thu, 2016-07-07 at 08:36 -0500, Serge E. Hallyn wrote:
> > > > > > Hm.  Probably best-effort based on the process hierarchy. 
> > > > > >  So 
> > > > > > yeah you could probably get a tree into a state that would
> > > > > > be 
> > > > > > wrongly recreated. Create a new netns, bind mount it, exit;
> > > > > >   Have 
> > > > > > another task create a new user_ns, bind mount it, exit; 
> > > > > >  Third 
> > > > > > task setns()s first to the new netns then to the new
> > > > > > user_ns.  I 
> > > > > > suspect criu will recreate that wrongly.
> > > > > 
> > > > > This is a bit pathological, and you have to be root to do it:
> > > > > so 
> > > > > root can set up a nesting hierarchy, bind it and destroy the
> > > > > pids 
> > > > > but I know of no current orchestration system which does
> > > > > this.
> > > > > 
> > > > > Actually, I have to back pedal a bit: the way I currently set
> > > > > up
> > > > > architecture emulation containers does precisely this: I set
> > > > > up the
> > > > > namespaces unprivileged with child mount namespaces, but then
> > > > > I ask
> > > > > root to bind the userns and kill the process that created it
> > > > > so I 
> > > > > have a permanent handle to enter the namespace by, so I
> > > > > suspect 
> > > > > that when our current orchestration systems get more
> > > > > sophisticated, 
> > > > > they might eventually want to do something like this as well.
> > > > > 
> > > > > In theory, we could get nsfs to show this information as an
> > > > > option
> > > > > (just add a show_options entry to the superblock ops), but
> > > > > the 
> > > > > problem is that although each namespace has a parent user_ns,
> > > > > there's no way to get it without digging in the namespace
> > > > > specific 
> > > > > structure.  Probably we should restructure to move it into 
> > > > > ns_common, then we could display it (and enforce all
> > > > > namespaces 
> > > > > having owning user_ns) but it would be a
> > > > 
> > > > I'm missing something here. Is it not already the case that all
> > > > namespaces have an owning user_ns?
> > > 
> > > Um, yes, I don't believe I said they don't.  The problem I
> > > thought you
> > > were having is that there's no way of seeing what it is.
> > > 
> > > nsfs is the Namespace fileystem where bound namespaces appear to
> > > a cat
> > > of /proc/self/mounts.  It can display any information that's in
> > > ns_common (the common core of namespaces) but the owning user_ns
> > > pointer currently isn't in this structure.  Every user namespace
> > > has a
> > > pointer to it, but they're all privately embedded in the
> > > individual
> > > namespace specific structures.  What I was proposing was that
> > > since
> > > every current namespace has a pointer somewhere to the owning
> > > user
> > > namespace, we could abstract this out into ns_common so it's now
> > > accessible to be displayed by nsfs, probably as a mount option.
> > 
> > James, I am not sure that I understood you correctly. We have one
> > file system for all namespace files, how we can show per-file
> > properties
> > in mount options. I think we can show all required information in
> > fdinfo. We open a namespaces file (/proc/pid/ns/N) and then read
> > /proc/pid/fdinfo/X for it.
> 
> Here is a proof-of-concept patch.
> 
> How it works:
> 
> In [1]: import os
> 
> In [2]: fd = os.open("/proc/self/ns/pid", os.O_RDONLY)
> 
> In [3]: print open("/proc/self/fdinfo/%d" % fd).read()
> pos:  0
> flags:010
> mnt_id:   2
> userns: 4026531837
> 
> In [4]: print "/proc/self/ns/user -> %s" %
> os.readlink("/proc/self/ns/user")
> /proc/self/ns/user -> user:[4026531837]

can't you just do

readlink /proc/self/ns/user | sed 's/.*\[\(.*\)\]/\1/'

?

But what Michael was asking about was the parent user_ns of all the
other namespaces ... I don't think there's any way we can get that out
of any information in /proc/self/

James




Re: [CRIU] Introspecting userns relationships to other namespaces?

2016-07-07 Thread James Bottomley
On Thu, 2016-07-07 at 19:16 -0700, Andrew Vagin wrote:
> On Thu, Jul 07, 2016 at 12:17:35PM -0700, James Bottomley wrote:
> > On Thu, 2016-07-07 at 20:21 +0200, Michael Kerrisk (man-pages)
> > wrote:
> > > On 7 July 2016 at 17:01, James Bottomley
> > >  wrote:
> > [Serge already answered the parenting issue]
> > > > On Thu, 2016-07-07 at 08:36 -0500, Serge E. Hallyn wrote:
> > > > > Hm.  Probably best-effort based on the process hierarchy.  So
> > > > > yeah you could probably get a tree into a state that would be
> > > > > wrongly recreated. Create a new netns, bind mount it, exit; 
> > > > >  Have 
> > > > > another task create a new user_ns, bind mount it, exit; 
> > > > >  Third 
> > > > > task setns()s first to the new netns then to the new user_ns.
> > > > >   I 
> > > > > suspect criu will recreate that wrongly.
> > > > 
> > > > This is a bit pathological, and you have to be root to do it:
> > > > so 
> > > > root can set up a nesting hierarchy, bind it and destroy the
> > > > pids 
> > > > but I know of no current orchestration system which does this.
> > > > 
> > > > Actually, I have to back pedal a bit: the way I currently set
> > > > up
> > > > architecture emulation containers does precisely this: I set up
> > > > the
> > > > namespaces unprivileged with child mount namespaces, but then I
> > > > ask
> > > > root to bind the userns and kill the process that created it so
> > > > I 
> > > > have a permanent handle to enter the namespace by, so I suspect
> > > > that when our current orchestration systems get more
> > > > sophisticated, 
> > > > they might eventually want to do something like this as well.
> > > > 
> > > > In theory, we could get nsfs to show this information as an
> > > > option
> > > > (just add a show_options entry to the superblock ops), but the 
> > > > problem is that although each namespace has a parent user_ns, 
> > > > there's no way to get it without digging in the namespace
> > > > specific 
> > > > structure.  Probably we should restructure to move it into 
> > > > ns_common, then we could display it (and enforce all namespaces
> > > > having owning user_ns) but it would be a
> > > 
> > > I'm missing something here. Is it not already the case that all
> > > namespaces have an owning user_ns?
> > 
> > Um, yes, I don't believe I said they don't.  The problem I thought
> > you
> > were having is that there's no way of seeing what it is.
> > 
> > nsfs is the Namespace fileystem where bound namespaces appear to a
> > cat
> > of /proc/self/mounts.  It can display any information that's in
> > ns_common (the common core of namespaces) but the owning user_ns
> > pointer currently isn't in this structure.  Every user namespace
> > has a
> > pointer to it, but they're all privately embedded in the individual
> > namespace specific structures.  What I was proposing was that since
> > every current namespace has a pointer somewhere to the owning user
> > namespace, we could abstract this out into ns_common so it's now
> > accessible to be displayed by nsfs, probably as a mount option.
> 
> James, I am not sure that I understood you correctly. We have one
> file system for all namespace files, how we can show per-file 
> properties in mount options.

We have two ways of getting information.  For a namespace that only
exists as a bind mount we only have what the mount/mountinfo shows, so
you see something like this:

jejb@jarvis:~> mount|grep nsfs
nsfs on /run/build-container/userns type nsfs (rw)
nsfs on /run/build-container/ppc64 type nsfs (rw)

the (rw) are the mount options.  We could add the ability to add other
mount options to this via the superblock .show_options callback.  We
could make it show the type and parent user namespace.

>  I think we can show all required information in fdinfo. We open a
> namespaces file (/proc/pid/ns/N) and then read /proc/pid/fdinfo/X for
> it.

Not if we don't have an extant process in the namespace, we can't use
these files because they don't exist, plus fdinfo on the
/proc//ns/X doesn't tell you what the parent user_ns of X is
(again, we could add this information somewhere ... not sure where
yet).

James



Re: [CRIU] Introspecting userns relationships to other namespaces?

2016-07-07 Thread James Bottomley
On Thu, 2016-07-07 at 19:16 -0700, Andrew Vagin wrote:
> On Thu, Jul 07, 2016 at 12:17:35PM -0700, James Bottomley wrote:
> > On Thu, 2016-07-07 at 20:21 +0200, Michael Kerrisk (man-pages)
> > wrote:
> > > On 7 July 2016 at 17:01, James Bottomley
> > >  wrote:
> > [Serge already answered the parenting issue]
> > > > On Thu, 2016-07-07 at 08:36 -0500, Serge E. Hallyn wrote:
> > > > > Hm.  Probably best-effort based on the process hierarchy.  So
> > > > > yeah you could probably get a tree into a state that would be
> > > > > wrongly recreated. Create a new netns, bind mount it, exit; 
> > > > >  Have 
> > > > > another task create a new user_ns, bind mount it, exit; 
> > > > >  Third 
> > > > > task setns()s first to the new netns then to the new user_ns.
> > > > >   I 
> > > > > suspect criu will recreate that wrongly.
> > > > 
> > > > This is a bit pathological, and you have to be root to do it:
> > > > so 
> > > > root can set up a nesting hierarchy, bind it and destroy the
> > > > pids 
> > > > but I know of no current orchestration system which does this.
> > > > 
> > > > Actually, I have to back pedal a bit: the way I currently set
> > > > up
> > > > architecture emulation containers does precisely this: I set up
> > > > the
> > > > namespaces unprivileged with child mount namespaces, but then I
> > > > ask
> > > > root to bind the userns and kill the process that created it so
> > > > I 
> > > > have a permanent handle to enter the namespace by, so I suspect
> > > > that when our current orchestration systems get more
> > > > sophisticated, 
> > > > they might eventually want to do something like this as well.
> > > > 
> > > > In theory, we could get nsfs to show this information as an
> > > > option
> > > > (just add a show_options entry to the superblock ops), but the 
> > > > problem is that although each namespace has a parent user_ns, 
> > > > there's no way to get it without digging in the namespace
> > > > specific 
> > > > structure.  Probably we should restructure to move it into 
> > > > ns_common, then we could display it (and enforce all namespaces
> > > > having owning user_ns) but it would be a
> > > 
> > > I'm missing something here. Is it not already the case that all
> > > namespaces have an owning user_ns?
> > 
> > Um, yes, I don't believe I said they don't.  The problem I thought
> > you
> > were having is that there's no way of seeing what it is.
> > 
> > nsfs is the Namespace fileystem where bound namespaces appear to a
> > cat
> > of /proc/self/mounts.  It can display any information that's in
> > ns_common (the common core of namespaces) but the owning user_ns
> > pointer currently isn't in this structure.  Every user namespace
> > has a
> > pointer to it, but they're all privately embedded in the individual
> > namespace specific structures.  What I was proposing was that since
> > every current namespace has a pointer somewhere to the owning user
> > namespace, we could abstract this out into ns_common so it's now
> > accessible to be displayed by nsfs, probably as a mount option.
> 
> James, I am not sure that I understood you correctly. We have one
> file system for all namespace files, how we can show per-file 
> properties in mount options.

We have two ways of getting information.  For a namespace that only
exists as a bind mount we only have what the mount/mountinfo shows, so
you see something like this:

jejb@jarvis:~> mount|grep nsfs
nsfs on /run/build-container/userns type nsfs (rw)
nsfs on /run/build-container/ppc64 type nsfs (rw)

the (rw) are the mount options.  We could add the ability to add other
mount options to this via the superblock .show_options callback.  We
could make it show the type and parent user namespace.

>  I think we can show all required information in fdinfo. We open a
> namespaces file (/proc/pid/ns/N) and then read /proc/pid/fdinfo/X for
> it.

Not if we don't have an extant process in the namespace, we can't use
these files because they don't exist, plus fdinfo on the
/proc//ns/X doesn't tell you what the parent user_ns of X is
(again, we could add this information somewhere ... not sure where
yet).

James



Re: Introspecting userns relationships to other namespaces?

2016-07-07 Thread James Bottomley
On Thu, 2016-07-07 at 20:21 +0200, Michael Kerrisk (man-pages) wrote:
> On 7 July 2016 at 17:01, James Bottomley
>  wrote:
[Serge already answered the parenting issue]
> > On Thu, 2016-07-07 at 08:36 -0500, Serge E. Hallyn wrote:
> > > Hm.  Probably best-effort based on the process hierarchy.  So 
> > > yeah you could probably get a tree into a state that would be 
> > > wrongly recreated. Create a new netns, bind mount it, exit;  Have 
> > > another task create a new user_ns, bind mount it, exit;  Third 
> > > task setns()s first to the new netns then to the new user_ns.  I 
> > > suspect criu will recreate that wrongly.
> > 
> > This is a bit pathological, and you have to be root to do it: so 
> > root can set up a nesting hierarchy, bind it and destroy the pids 
> > but I know of no current orchestration system which does this.
> > 
> > Actually, I have to back pedal a bit: the way I currently set up
> > architecture emulation containers does precisely this: I set up the
> > namespaces unprivileged with child mount namespaces, but then I ask
> > root to bind the userns and kill the process that created it so I 
> > have a permanent handle to enter the namespace by, so I suspect 
> > that when our current orchestration systems get more sophisticated, 
> > they might eventually want to do something like this as well.
> > 
> > In theory, we could get nsfs to show this information as an option
> > (just add a show_options entry to the superblock ops), but the 
> > problem is that although each namespace has a parent user_ns, 
> > there's no way to get it without digging in the namespace specific 
> > structure.  Probably we should restructure to move it into 
> > ns_common, then we could display it (and enforce all namespaces 
> > having owning user_ns) but it would be a
> 
> I'm missing something here. Is it not already the case that all
> namespaces have an owning user_ns?

Um, yes, I don't believe I said they don't.  The problem I thought you
were having is that there's no way of seeing what it is.

nsfs is the Namespace fileystem where bound namespaces appear to a cat
of /proc/self/mounts.  It can display any information that's in
ns_common (the common core of namespaces) but the owning user_ns
pointer currently isn't in this structure.  Every user namespace has a
pointer to it, but they're all privately embedded in the individual
namespace specific structures.  What I was proposing was that since
every current namespace has a pointer somewhere to the owning user
namespace, we could abstract this out into ns_common so it's now
accessible to be displayed by nsfs, probably as a mount option.

James




Re: Introspecting userns relationships to other namespaces?

2016-07-07 Thread James Bottomley
On Thu, 2016-07-07 at 20:21 +0200, Michael Kerrisk (man-pages) wrote:
> On 7 July 2016 at 17:01, James Bottomley
>  wrote:
[Serge already answered the parenting issue]
> > On Thu, 2016-07-07 at 08:36 -0500, Serge E. Hallyn wrote:
> > > Hm.  Probably best-effort based on the process hierarchy.  So 
> > > yeah you could probably get a tree into a state that would be 
> > > wrongly recreated. Create a new netns, bind mount it, exit;  Have 
> > > another task create a new user_ns, bind mount it, exit;  Third 
> > > task setns()s first to the new netns then to the new user_ns.  I 
> > > suspect criu will recreate that wrongly.
> > 
> > This is a bit pathological, and you have to be root to do it: so 
> > root can set up a nesting hierarchy, bind it and destroy the pids 
> > but I know of no current orchestration system which does this.
> > 
> > Actually, I have to back pedal a bit: the way I currently set up
> > architecture emulation containers does precisely this: I set up the
> > namespaces unprivileged with child mount namespaces, but then I ask
> > root to bind the userns and kill the process that created it so I 
> > have a permanent handle to enter the namespace by, so I suspect 
> > that when our current orchestration systems get more sophisticated, 
> > they might eventually want to do something like this as well.
> > 
> > In theory, we could get nsfs to show this information as an option
> > (just add a show_options entry to the superblock ops), but the 
> > problem is that although each namespace has a parent user_ns, 
> > there's no way to get it without digging in the namespace specific 
> > structure.  Probably we should restructure to move it into 
> > ns_common, then we could display it (and enforce all namespaces 
> > having owning user_ns) but it would be a
> 
> I'm missing something here. Is it not already the case that all
> namespaces have an owning user_ns?

Um, yes, I don't believe I said they don't.  The problem I thought you
were having is that there's no way of seeing what it is.

nsfs is the Namespace fileystem where bound namespaces appear to a cat
of /proc/self/mounts.  It can display any information that's in
ns_common (the common core of namespaces) but the owning user_ns
pointer currently isn't in this structure.  Every user namespace has a
pointer to it, but they're all privately embedded in the individual
namespace specific structures.  What I was proposing was that since
every current namespace has a pointer somewhere to the owning user
namespace, we could abstract this out into ns_common so it's now
accessible to be displayed by nsfs, probably as a mount option.

James




Re: Introspecting userns relationships to other namespaces?

2016-07-07 Thread Serge E. Hallyn
Quoting Michael Kerrisk (man-pages) (mtk.manpa...@gmail.com):
> On 7 July 2016 at 17:01, James Bottomley
>  wrote:
> > On Thu, 2016-07-07 at 08:36 -0500, Serge E. Hallyn wrote:
> >> Quoting Michael Kerrisk (man-pages) (mtk.manpa...@gmail.com):
> >> > Hi Serge,
> >> >
> >> > On 6 July 2016 at 16:13, Serge E. Hallyn  wrote:
> >> > > On Wed, Jul 06, 2016 at 10:41:48AM +0200, Michael Kerrisk (man
> >> > > -pages) wrote:
> >> > > > [Rats! Doing now what I should have down to start with. Looping
> >> > > > some lists and CRIU and other possibly relevant people into
> >> > > > this conversation]
> >> > > >
> >> > > > Hi Eric,
> >> > > >
> >> > > > On 5 July 2016 at 23:47, Eric W. Biederman <
> >> > > > ebied...@xmission.com> wrote:
> >> > > > > "Michael Kerrisk (man-pages)" 
> >> > > > > writes:
> >> > > > >
> >> > > > > > Hi Eric,
> >> > > > > >
> >> > > > > > I have a question. Is there any way currently to discover
> >> > > > > > which user namespace a particular nonuser namespace is
> >> > > > > > governed by? Maybe I am missing something, but there does
> >> > > > > > not seem to be a way to do this. Also, can one discover
> >> > > > > > which userns is the parent of a given userns? Again, I
> >> > > > > > can't see a way to do this.
> >> > > > > >
> >> > > > > > The point here is introspecting so that a process might
> >> > > > > > determine what its capabilities are when operating on some
> >> > > > > > resource governed by a (nonuser) namespace.
> >> > > > >
> >> > > > > To the best of my knowledge that there is not an interface to
> >> > > > > get that information.  It would be good to have such an
> >> > > > > interface for no other reason than the CRIU folks are going
> >> > > > > to need it at some point.  I am a bit surprised they have not
> >> > > > > complained yet.
> >> > >
> >> > > I don't think they need it.  They do in fact have what they need.
> >> > >   Assume you have tasks T1, T2, T1_1 and T2_1;  T1 and T2 are in
> >> > > init_user_ns;  T1 spawned T1_1 in a new userns;  T2 spawned T2_1
> >> > > which setns()d to T1_1's ns. There's some {handwave} uid mapping,
> >> > > does not matter.
> >> > >
> >> > > At restart, it doesn't matter which task originally created the
> >> > > new userns. criu knows T1_1 and T2_1 are in the same userns;  it
> >> > > creates the userns, sets up the mapping, and T1_1 and T2_1
> >> > > setns() to it.
> >> >
> >> > I'm missing something here. How does the parental relationships
> >> > between the user namespaces get reconstructed? Those relationships
> >> > will govern what capabilities a process will have in various user
> >> > namespaces.
> >
> > Actually, you get the parent namespace from the process tree by
> > tracking the user namespaces of the parent pids.   Currently non-root
> > users can't bind the namespace, so the only way to keep a new user_ns
> > around if you're not root is to keep the process around, so for
> > multiply nested user namespaces you can usually build the user_ns
> > hierarchy by looking at the process hierarchy.  Conversely, if the
> > process is reparented to init, chances are that the user_ns is also
> > parented to init_user_ns.
> 
> Yes, but "chances are" == this isn't robust.  PR_SET_CHILD_SUBREAPER
> further complicates things.
> 
> By the way, is that really what happens? Do child user namespaces get
> reparented to the grandparent ns if the parent ns disappears (i.e.,

The parent ns cannot disappear.  The child ns pins the creator's cred,
which pins the parent user_ns.



Re: Introspecting userns relationships to other namespaces?

2016-07-07 Thread Serge E. Hallyn
Quoting Michael Kerrisk (man-pages) (mtk.manpa...@gmail.com):
> On 7 July 2016 at 17:01, James Bottomley
>  wrote:
> > On Thu, 2016-07-07 at 08:36 -0500, Serge E. Hallyn wrote:
> >> Quoting Michael Kerrisk (man-pages) (mtk.manpa...@gmail.com):
> >> > Hi Serge,
> >> >
> >> > On 6 July 2016 at 16:13, Serge E. Hallyn  wrote:
> >> > > On Wed, Jul 06, 2016 at 10:41:48AM +0200, Michael Kerrisk (man
> >> > > -pages) wrote:
> >> > > > [Rats! Doing now what I should have down to start with. Looping
> >> > > > some lists and CRIU and other possibly relevant people into
> >> > > > this conversation]
> >> > > >
> >> > > > Hi Eric,
> >> > > >
> >> > > > On 5 July 2016 at 23:47, Eric W. Biederman <
> >> > > > ebied...@xmission.com> wrote:
> >> > > > > "Michael Kerrisk (man-pages)" 
> >> > > > > writes:
> >> > > > >
> >> > > > > > Hi Eric,
> >> > > > > >
> >> > > > > > I have a question. Is there any way currently to discover
> >> > > > > > which user namespace a particular nonuser namespace is
> >> > > > > > governed by? Maybe I am missing something, but there does
> >> > > > > > not seem to be a way to do this. Also, can one discover
> >> > > > > > which userns is the parent of a given userns? Again, I
> >> > > > > > can't see a way to do this.
> >> > > > > >
> >> > > > > > The point here is introspecting so that a process might
> >> > > > > > determine what its capabilities are when operating on some
> >> > > > > > resource governed by a (nonuser) namespace.
> >> > > > >
> >> > > > > To the best of my knowledge that there is not an interface to
> >> > > > > get that information.  It would be good to have such an
> >> > > > > interface for no other reason than the CRIU folks are going
> >> > > > > to need it at some point.  I am a bit surprised they have not
> >> > > > > complained yet.
> >> > >
> >> > > I don't think they need it.  They do in fact have what they need.
> >> > >   Assume you have tasks T1, T2, T1_1 and T2_1;  T1 and T2 are in
> >> > > init_user_ns;  T1 spawned T1_1 in a new userns;  T2 spawned T2_1
> >> > > which setns()d to T1_1's ns. There's some {handwave} uid mapping,
> >> > > does not matter.
> >> > >
> >> > > At restart, it doesn't matter which task originally created the
> >> > > new userns. criu knows T1_1 and T2_1 are in the same userns;  it
> >> > > creates the userns, sets up the mapping, and T1_1 and T2_1
> >> > > setns() to it.
> >> >
> >> > I'm missing something here. How does the parental relationships
> >> > between the user namespaces get reconstructed? Those relationships
> >> > will govern what capabilities a process will have in various user
> >> > namespaces.
> >
> > Actually, you get the parent namespace from the process tree by
> > tracking the user namespaces of the parent pids.   Currently non-root
> > users can't bind the namespace, so the only way to keep a new user_ns
> > around if you're not root is to keep the process around, so for
> > multiply nested user namespaces you can usually build the user_ns
> > hierarchy by looking at the process hierarchy.  Conversely, if the
> > process is reparented to init, chances are that the user_ns is also
> > parented to init_user_ns.
> 
> Yes, but "chances are" == this isn't robust.  PR_SET_CHILD_SUBREAPER
> further complicates things.
> 
> By the way, is that really what happens? Do child user namespaces get
> reparented to the grandparent ns if the parent ns disappears (i.e.,

The parent ns cannot disappear.  The child ns pins the creator's cred,
which pins the parent user_ns.



Re: Introspecting userns relationships to other namespaces?

2016-07-07 Thread Michael Kerrisk (man-pages)
On 7 July 2016 at 17:01, James Bottomley
 wrote:
> On Thu, 2016-07-07 at 08:36 -0500, Serge E. Hallyn wrote:
>> Quoting Michael Kerrisk (man-pages) (mtk.manpa...@gmail.com):
>> > Hi Serge,
>> >
>> > On 6 July 2016 at 16:13, Serge E. Hallyn  wrote:
>> > > On Wed, Jul 06, 2016 at 10:41:48AM +0200, Michael Kerrisk (man
>> > > -pages) wrote:
>> > > > [Rats! Doing now what I should have down to start with. Looping
>> > > > some lists and CRIU and other possibly relevant people into
>> > > > this conversation]
>> > > >
>> > > > Hi Eric,
>> > > >
>> > > > On 5 July 2016 at 23:47, Eric W. Biederman <
>> > > > ebied...@xmission.com> wrote:
>> > > > > "Michael Kerrisk (man-pages)" 
>> > > > > writes:
>> > > > >
>> > > > > > Hi Eric,
>> > > > > >
>> > > > > > I have a question. Is there any way currently to discover
>> > > > > > which user namespace a particular nonuser namespace is
>> > > > > > governed by? Maybe I am missing something, but there does
>> > > > > > not seem to be a way to do this. Also, can one discover
>> > > > > > which userns is the parent of a given userns? Again, I
>> > > > > > can't see a way to do this.
>> > > > > >
>> > > > > > The point here is introspecting so that a process might
>> > > > > > determine what its capabilities are when operating on some
>> > > > > > resource governed by a (nonuser) namespace.
>> > > > >
>> > > > > To the best of my knowledge that there is not an interface to
>> > > > > get that information.  It would be good to have such an
>> > > > > interface for no other reason than the CRIU folks are going
>> > > > > to need it at some point.  I am a bit surprised they have not
>> > > > > complained yet.
>> > >
>> > > I don't think they need it.  They do in fact have what they need.
>> > >   Assume you have tasks T1, T2, T1_1 and T2_1;  T1 and T2 are in
>> > > init_user_ns;  T1 spawned T1_1 in a new userns;  T2 spawned T2_1
>> > > which setns()d to T1_1's ns. There's some {handwave} uid mapping,
>> > > does not matter.
>> > >
>> > > At restart, it doesn't matter which task originally created the
>> > > new userns. criu knows T1_1 and T2_1 are in the same userns;  it
>> > > creates the userns, sets up the mapping, and T1_1 and T2_1
>> > > setns() to it.
>> >
>> > I'm missing something here. How does the parental relationships
>> > between the user namespaces get reconstructed? Those relationships
>> > will govern what capabilities a process will have in various user
>> > namespaces.
>
> Actually, you get the parent namespace from the process tree by
> tracking the user namespaces of the parent pids.   Currently non-root
> users can't bind the namespace, so the only way to keep a new user_ns
> around if you're not root is to keep the process around, so for
> multiply nested user namespaces you can usually build the user_ns
> hierarchy by looking at the process hierarchy.  Conversely, if the
> process is reparented to init, chances are that the user_ns is also
> parented to init_user_ns.

Yes, but "chances are" == this isn't robust.  PR_SET_CHILD_SUBREAPER
further complicates things.

By the way, is that really what happens? Do child user namespaces get
reparented to the grandparent ns if the parent ns disappears (i.e.,
ceases to have any members and no bind mounts)? I hadn't thought about
that scenario before. It may be worth documenting in
user_namespaces(7).

>> Hm.  Probably best-effort based on the process hierarchy.  So yeah
>> you could probably get a tree into a state that would be wrongly
>> recreated. Create a new netns, bind mount it, exit;  Have another
>> task create a new user_ns, bind mount it, exit;  Third task setns()s
>> first to the new netns then to the new user_ns.  I suspect criu will
>> recreate that wrongly.
>
> This is a bit pathological, and you have to be root to do it: so root
> can set up a nesting hierarchy, bind it and destroy the pids but I know
> of no current orchestration system which does this.
>
> Actually, I have to back pedal a bit: the way I currently set up
> architecture emulation containers does precisely this: I set up the
> namespaces unprivileged with child mount namespaces, but then I ask
> root to bind the userns and kill the process that created it so I have
> a permanent handle to enter the namespace by, so I suspect that when
> our current orchestration systems get more sophisticated, they might
> eventually want to do something like this as well.
>
> In theory, we could get nsfs to show this information as an option
> (just add a show_options entry to the superblock ops), but the problem
> is that although each namespace has a parent user_ns, there's no way to
> get it without digging in the namespace specific structure.  Probably
> we should restructure to move it into ns_common, then we could display
> it (and enforce all namespaces having owning user_ns) but it would be a

I'm missing something here. Is it not already the case that all

Re: Introspecting userns relationships to other namespaces?

2016-07-07 Thread Michael Kerrisk (man-pages)
On 7 July 2016 at 17:01, James Bottomley
 wrote:
> On Thu, 2016-07-07 at 08:36 -0500, Serge E. Hallyn wrote:
>> Quoting Michael Kerrisk (man-pages) (mtk.manpa...@gmail.com):
>> > Hi Serge,
>> >
>> > On 6 July 2016 at 16:13, Serge E. Hallyn  wrote:
>> > > On Wed, Jul 06, 2016 at 10:41:48AM +0200, Michael Kerrisk (man
>> > > -pages) wrote:
>> > > > [Rats! Doing now what I should have down to start with. Looping
>> > > > some lists and CRIU and other possibly relevant people into
>> > > > this conversation]
>> > > >
>> > > > Hi Eric,
>> > > >
>> > > > On 5 July 2016 at 23:47, Eric W. Biederman <
>> > > > ebied...@xmission.com> wrote:
>> > > > > "Michael Kerrisk (man-pages)" 
>> > > > > writes:
>> > > > >
>> > > > > > Hi Eric,
>> > > > > >
>> > > > > > I have a question. Is there any way currently to discover
>> > > > > > which user namespace a particular nonuser namespace is
>> > > > > > governed by? Maybe I am missing something, but there does
>> > > > > > not seem to be a way to do this. Also, can one discover
>> > > > > > which userns is the parent of a given userns? Again, I
>> > > > > > can't see a way to do this.
>> > > > > >
>> > > > > > The point here is introspecting so that a process might
>> > > > > > determine what its capabilities are when operating on some
>> > > > > > resource governed by a (nonuser) namespace.
>> > > > >
>> > > > > To the best of my knowledge that there is not an interface to
>> > > > > get that information.  It would be good to have such an
>> > > > > interface for no other reason than the CRIU folks are going
>> > > > > to need it at some point.  I am a bit surprised they have not
>> > > > > complained yet.
>> > >
>> > > I don't think they need it.  They do in fact have what they need.
>> > >   Assume you have tasks T1, T2, T1_1 and T2_1;  T1 and T2 are in
>> > > init_user_ns;  T1 spawned T1_1 in a new userns;  T2 spawned T2_1
>> > > which setns()d to T1_1's ns. There's some {handwave} uid mapping,
>> > > does not matter.
>> > >
>> > > At restart, it doesn't matter which task originally created the
>> > > new userns. criu knows T1_1 and T2_1 are in the same userns;  it
>> > > creates the userns, sets up the mapping, and T1_1 and T2_1
>> > > setns() to it.
>> >
>> > I'm missing something here. How does the parental relationships
>> > between the user namespaces get reconstructed? Those relationships
>> > will govern what capabilities a process will have in various user
>> > namespaces.
>
> Actually, you get the parent namespace from the process tree by
> tracking the user namespaces of the parent pids.   Currently non-root
> users can't bind the namespace, so the only way to keep a new user_ns
> around if you're not root is to keep the process around, so for
> multiply nested user namespaces you can usually build the user_ns
> hierarchy by looking at the process hierarchy.  Conversely, if the
> process is reparented to init, chances are that the user_ns is also
> parented to init_user_ns.

Yes, but "chances are" == this isn't robust.  PR_SET_CHILD_SUBREAPER
further complicates things.

By the way, is that really what happens? Do child user namespaces get
reparented to the grandparent ns if the parent ns disappears (i.e.,
ceases to have any members and no bind mounts)? I hadn't thought about
that scenario before. It may be worth documenting in
user_namespaces(7).

>> Hm.  Probably best-effort based on the process hierarchy.  So yeah
>> you could probably get a tree into a state that would be wrongly
>> recreated. Create a new netns, bind mount it, exit;  Have another
>> task create a new user_ns, bind mount it, exit;  Third task setns()s
>> first to the new netns then to the new user_ns.  I suspect criu will
>> recreate that wrongly.
>
> This is a bit pathological, and you have to be root to do it: so root
> can set up a nesting hierarchy, bind it and destroy the pids but I know
> of no current orchestration system which does this.
>
> Actually, I have to back pedal a bit: the way I currently set up
> architecture emulation containers does precisely this: I set up the
> namespaces unprivileged with child mount namespaces, but then I ask
> root to bind the userns and kill the process that created it so I have
> a permanent handle to enter the namespace by, so I suspect that when
> our current orchestration systems get more sophisticated, they might
> eventually want to do something like this as well.
>
> In theory, we could get nsfs to show this information as an option
> (just add a show_options entry to the superblock ops), but the problem
> is that although each namespace has a parent user_ns, there's no way to
> get it without digging in the namespace specific structure.  Probably
> we should restructure to move it into ns_common, then we could display
> it (and enforce all namespaces having owning user_ns) but it would be a

I'm missing something here. Is it not already the case that all
namespaces have an owning user_ns?

Cheers,

Michael

> reasonably large (but 

Re: Introspecting userns relationships to other namespaces?

2016-07-07 Thread James Bottomley
On Thu, 2016-07-07 at 08:36 -0500, Serge E. Hallyn wrote:
> Quoting Michael Kerrisk (man-pages) (mtk.manpa...@gmail.com):
> > Hi Serge,
> > 
> > On 6 July 2016 at 16:13, Serge E. Hallyn  wrote:
> > > On Wed, Jul 06, 2016 at 10:41:48AM +0200, Michael Kerrisk (man
> > > -pages) wrote:
> > > > [Rats! Doing now what I should have down to start with. Looping 
> > > > some lists and CRIU and other possibly relevant people into 
> > > > this conversation]
> > > > 
> > > > Hi Eric,
> > > > 
> > > > On 5 July 2016 at 23:47, Eric W. Biederman <
> > > > ebied...@xmission.com> wrote:
> > > > > "Michael Kerrisk (man-pages)" 
> > > > > writes:
> > > > > 
> > > > > > Hi Eric,
> > > > > > 
> > > > > > I have a question. Is there any way currently to discover 
> > > > > > which user namespace a particular nonuser namespace is 
> > > > > > governed by? Maybe I am missing something, but there does 
> > > > > > not seem to be a way to do this. Also, can one discover 
> > > > > > which userns is the parent of a given userns? Again, I 
> > > > > > can't see a way to do this.
> > > > > > 
> > > > > > The point here is introspecting so that a process might 
> > > > > > determine what its capabilities are when operating on some 
> > > > > > resource governed by a (nonuser) namespace.
> > > > > 
> > > > > To the best of my knowledge that there is not an interface to 
> > > > > get that information.  It would be good to have such an 
> > > > > interface for no other reason than the CRIU folks are going 
> > > > > to need it at some point.  I am a bit surprised they have not
> > > > > complained yet.
> > > 
> > > I don't think they need it.  They do in fact have what they need.
> > >   Assume you have tasks T1, T2, T1_1 and T2_1;  T1 and T2 are in
> > > init_user_ns;  T1 spawned T1_1 in a new userns;  T2 spawned T2_1 
> > > which setns()d to T1_1's ns. There's some {handwave} uid mapping,
> > > does not matter.
> > > 
> > > At restart, it doesn't matter which task originally created the 
> > > new userns. criu knows T1_1 and T2_1 are in the same userns;  it 
> > > creates the userns, sets up the mapping, and T1_1 and T2_1
> > > setns() to it.
> > 
> > I'm missing something here. How does the parental relationships
> > between the user namespaces get reconstructed? Those relationships
> > will govern what capabilities a process will have in various user
> > namespaces.

Actually, you get the parent namespace from the process tree by
tracking the user namespaces of the parent pids.  Currently non-root
users can't bind the namespace, so the only way to keep a new user_ns
around if you're not root is to keep the process around, so for
multiply nested user namespaces you can usually build the user_ns
hierarchy by looking at the process hierarchy.  Conversely, if the
process is reparented to init, chances are that the user_ns is also
parented to init_user_ns.

> Hm.  Probably best-effort based on the process hierarchy.  So yeah
> you could probably get a tree into a state that would be wrongly
> recreated. Create a new netns, bind mount it, exit;  Have another 
> task create a new user_ns, bind mount it, exit;  Third task setns()s 
> first to the new netns then to the new user_ns.  I suspect criu will 
> recreate that wrongly.

This is a bit pathological, and you have to be root to do it: so root
can set up a nesting hierarchy, bind it and destroy the pids but I know
of no current orchestration system which does this.

Actually, I have to back pedal a bit: the way I currently set up
architecture emulation containers does precisely this: I set up the
namespaces unprivileged with child mount namespaces, but then I ask
root to bind the userns and kill the process that created it so I have
a permanent handle to enter the namespace by, so I suspect that when
our current orchestration systems get more sophisticated, they might
eventually want to do something like this as well.

In theory, we could get nsfs to show this information as an option
(just add a show_options entry to the superblock ops), but the problem
is that although each namespace has a parent user_ns, there's no way to
get it without digging in the namespace specific structure.  Probably
we should restructure to move it into ns_common, then we could display
it (and enforce all namespaces having owning user_ns) but it would be a
reasonably large (but mechanical) change.

James



Re: Introspecting userns relationships to other namespaces?

2016-07-07 Thread James Bottomley
On Thu, 2016-07-07 at 08:36 -0500, Serge E. Hallyn wrote:
> Quoting Michael Kerrisk (man-pages) (mtk.manpa...@gmail.com):
> > Hi Serge,
> > 
> > On 6 July 2016 at 16:13, Serge E. Hallyn  wrote:
> > > On Wed, Jul 06, 2016 at 10:41:48AM +0200, Michael Kerrisk (man
> > > -pages) wrote:
> > > > [Rats! Doing now what I should have down to start with. Looping 
> > > > some lists and CRIU and other possibly relevant people into 
> > > > this conversation]
> > > > 
> > > > Hi Eric,
> > > > 
> > > > On 5 July 2016 at 23:47, Eric W. Biederman <
> > > > ebied...@xmission.com> wrote:
> > > > > "Michael Kerrisk (man-pages)" 
> > > > > writes:
> > > > > 
> > > > > > Hi Eric,
> > > > > > 
> > > > > > I have a question. Is there any way currently to discover 
> > > > > > which user namespace a particular nonuser namespace is 
> > > > > > governed by? Maybe I am missing something, but there does 
> > > > > > not seem to be a way to do this. Also, can one discover 
> > > > > > which userns is the parent of a given userns? Again, I 
> > > > > > can't see a way to do this.
> > > > > > 
> > > > > > The point here is introspecting so that a process might 
> > > > > > determine what its capabilities are when operating on some 
> > > > > > resource governed by a (nonuser) namespace.
> > > > > 
> > > > > To the best of my knowledge that there is not an interface to 
> > > > > get that information.  It would be good to have such an 
> > > > > interface for no other reason than the CRIU folks are going 
> > > > > to need it at some point.  I am a bit surprised they have not
> > > > > complained yet.
> > > 
> > > I don't think they need it.  They do in fact have what they need.
> > >   Assume you have tasks T1, T2, T1_1 and T2_1;  T1 and T2 are in
> > > init_user_ns;  T1 spawned T1_1 in a new userns;  T2 spawned T2_1 
> > > which setns()d to T1_1's ns. There's some {handwave} uid mapping,
> > > does not matter.
> > > 
> > > At restart, it doesn't matter which task originally created the 
> > > new userns. criu knows T1_1 and T2_1 are in the same userns;  it 
> > > creates the userns, sets up the mapping, and T1_1 and T2_1
> > > setns() to it.
> > 
> > I'm missing something here. How does the parental relationships
> > between the user namespaces get reconstructed? Those relationships
> > will govern what capabilities a process will have in various user
> > namespaces.

Actually, you get the parent namespace from the process tree by
tracking the user namespaces of the parent pids.  Currently non-root
users can't bind the namespace, so the only way to keep a new user_ns
around if you're not root is to keep the process around, so for
multiply nested user namespaces you can usually build the user_ns
hierarchy by looking at the process hierarchy.  Conversely, if the
process is reparented to init, chances are that the user_ns is also
parented to init_user_ns.

> Hm.  Probably best-effort based on the process hierarchy.  So yeah
> you could probably get a tree into a state that would be wrongly
> recreated. Create a new netns, bind mount it, exit;  Have another 
> task create a new user_ns, bind mount it, exit;  Third task setns()s 
> first to the new netns then to the new user_ns.  I suspect criu will 
> recreate that wrongly.

This is a bit pathological, and you have to be root to do it: so root
can set up a nesting hierarchy, bind it and destroy the pids but I know
of no current orchestration system which does this.

Actually, I have to back pedal a bit: the way I currently set up
architecture emulation containers does precisely this: I set up the
namespaces unprivileged with child mount namespaces, but then I ask
root to bind the userns and kill the process that created it so I have
a permanent handle to enter the namespace by, so I suspect that when
our current orchestration systems get more sophisticated, they might
eventually want to do something like this as well.

In theory, we could get nsfs to show this information as an option
(just add a show_options entry to the superblock ops), but the problem
is that although each namespace has a parent user_ns, there's no way to
get it without digging in the namespace specific structure.  Probably
we should restructure to move it into ns_common, then we could display
it (and enforce all namespaces having owning user_ns) but it would be a
reasonably large (but mechanical) change.

James



Re: Introspecting userns relationships to other namespaces?

2016-07-07 Thread Serge E. Hallyn
Quoting Michael Kerrisk (man-pages) (mtk.manpa...@gmail.com):
> Hi Serge,
> 
> On 6 July 2016 at 16:13, Serge E. Hallyn  wrote:
> > On Wed, Jul 06, 2016 at 10:41:48AM +0200, Michael Kerrisk (man-pages) wrote:
> >> [Rats! Doing now what I should have down to start with. Looping some
> >> lists and CRIU and other possibly relevant people into this
> >> conversation]
> >>
> >> Hi Eric,
> >>
> >> On 5 July 2016 at 23:47, Eric W. Biederman  wrote:
> >> > "Michael Kerrisk (man-pages)"  writes:
> >> >
> >> >> Hi Eric,
> >> >>
> >> >> I have a question. Is there any way currently to discover which
> >> >> user namespace a particular nonuser namespace is governed by?
> >> >> Maybe I am missing something, but there does not seem to be a
> >> >> way to do this. Also, can one discover which userns is the
> >> >> parent of a given userns? Again, I can't see a way to do this.
> >> >>
> >> >> The point here is introspecting so that a process might determine
> >> >> what its capabilities are when operating on some resource governed
> >> >> by a (nonuser) namespace.
> >> >
> >> > To the best of my knowledge that there is not an interface to get that
> >> > information.  It would be good to have such an interface for no other
> >> > reason than the CRIU folks are going to need it at some point.  I am a
> >> > bit surprised they have not complained yet.
> >
> > I don't think they need it.  They do in fact have what they need.  Assume
> > you have tasks T1, T2, T1_1 and T2_1;  T1 and T2 are in init_user_ns;  T1
> > spawned T1_1 in a new userns;  T2 spawned T2_1 which setns()d to T1_1's ns.
> > There's some {handwave} uid mapping, does not matter.
> >
> > At restart, it doesn't matter which task originally created the new userns.
> > criu knows T1_1 and T2_1 are in the same userns;  it creates the userns, 
> > sets
> > up the mapping, and T1_1 and T2_1 setns() to it.
> 
> I'm missing something here. How does the parental relationships
> between the user namespaces get reconstructed? Those relationships
> will govern what capabilities a process will have in various user
> namespaces.

Hm.  Probably best-effort based on the process hierarchy.  So yeah you
could probably get a tree into a state that would be wrongly recreated.
Create a new netns, bind mount it, exit;  Have another task create a
new user_ns, bind mount it, exit;  Third task setns()s first to the new
netns then to the new user_ns.  I suspect criu will recreate that
wrongly.


Re: Introspecting userns relationships to other namespaces?

2016-07-07 Thread Serge E. Hallyn
Quoting Michael Kerrisk (man-pages) (mtk.manpa...@gmail.com):
> Hi Serge,
> 
> On 6 July 2016 at 16:13, Serge E. Hallyn  wrote:
> > On Wed, Jul 06, 2016 at 10:41:48AM +0200, Michael Kerrisk (man-pages) wrote:
> >> [Rats! Doing now what I should have down to start with. Looping some
> >> lists and CRIU and other possibly relevant people into this
> >> conversation]
> >>
> >> Hi Eric,
> >>
> >> On 5 July 2016 at 23:47, Eric W. Biederman  wrote:
> >> > "Michael Kerrisk (man-pages)"  writes:
> >> >
> >> >> Hi Eric,
> >> >>
> >> >> I have a question. Is there any way currently to discover which
> >> >> user namespace a particular nonuser namespace is governed by?
> >> >> Maybe I am missing something, but there does not seem to be a
> >> >> way to do this. Also, can one discover which userns is the
> >> >> parent of a given userns? Again, I can't see a way to do this.
> >> >>
> >> >> The point here is introspecting so that a process might determine
> >> >> what its capabilities are when operating on some resource governed
> >> >> by a (nonuser) namespace.
> >> >
> >> > To the best of my knowledge that there is not an interface to get that
> >> > information.  It would be good to have such an interface for no other
> >> > reason than the CRIU folks are going to need it at some point.  I am a
> >> > bit surprised they have not complained yet.
> >
> > I don't think they need it.  They do in fact have what they need.  Assume
> > you have tasks T1, T2, T1_1 and T2_1;  T1 and T2 are in init_user_ns;  T1
> > spawned T1_1 in a new userns;  T2 spawned T2_1 which setns()d to T1_1's ns.
> > There's some {handwave} uid mapping, does not matter.
> >
> > At restart, it doesn't matter which task originally created the new userns.
> > criu knows T1_1 and T2_1 are in the same userns;  it creates the userns, 
> > sets
> > up the mapping, and T1_1 and T2_1 setns() to it.
> 
> I'm missing something here. How does the parental relationships
> between the user namespaces get reconstructed? Those relationships
> will govern what capabilities a process will have in various user
> namespaces.

Hm.  Probably best-effort based on the process hierarchy.  So yeah you
could probably get a tree into a state that would be wrongly recreated.
Create a new netns, bind mount it, exit;  Have another task create a
new user_ns, bind mount it, exit;  Third task setns()s first to the new
netns then to the new user_ns.  I suspect criu will recreate that
wrongly.


Re: Introspecting userns relationships to other namespaces?

2016-07-07 Thread Michael Kerrisk (man-pages)
Hi Serge,

On 6 July 2016 at 16:13, Serge E. Hallyn  wrote:
> On Wed, Jul 06, 2016 at 10:41:48AM +0200, Michael Kerrisk (man-pages) wrote:
>> [Rats! Doing now what I should have down to start with. Looping some
>> lists and CRIU and other possibly relevant people into this
>> conversation]
>>
>> Hi Eric,
>>
>> On 5 July 2016 at 23:47, Eric W. Biederman  wrote:
>> > "Michael Kerrisk (man-pages)"  writes:
>> >
>> >> Hi Eric,
>> >>
>> >> I have a question. Is there any way currently to discover which
>> >> user namespace a particular nonuser namespace is governed by?
>> >> Maybe I am missing something, but there does not seem to be a
>> >> way to do this. Also, can one discover which userns is the
>> >> parent of a given userns? Again, I can't see a way to do this.
>> >>
>> >> The point here is introspecting so that a process might determine
>> >> what its capabilities are when operating on some resource governed
>> >> by a (nonuser) namespace.
>> >
>> > To the best of my knowledge that there is not an interface to get that
>> > information.  It would be good to have such an interface for no other
>> > reason than the CRIU folks are going to need it at some point.  I am a
>> > bit surprised they have not complained yet.
>
> I don't think they need it.  They do in fact have what they need.  Assume
> you have tasks T1, T2, T1_1 and T2_1;  T1 and T2 are in init_user_ns;  T1
> spawned T1_1 in a new userns;  T2 spawned T2_1 which setns()d to T1_1's ns.
> There's some {handwave} uid mapping, does not matter.
>
> At restart, it doesn't matter which task originally created the new userns.
> criu knows T1_1 and T2_1 are in the same userns;  it creates the userns, sets
> up the mapping, and T1_1 and T2_1 setns() to it.

I'm missing something here. How does the parental relationships
between the user namespaces get reconstructed? Those relationships
will govern what capabilities a process will have in various user
namespaces.

Cheers,

Michael


-- 
Michael Kerrisk
Linux man-pages maintainer; http://www.kernel.org/doc/man-pages/
Linux/UNIX System Programming Training: http://man7.org/training/


Re: Introspecting userns relationships to other namespaces?

2016-07-07 Thread Michael Kerrisk (man-pages)
Hi Serge,

On 6 July 2016 at 16:13, Serge E. Hallyn  wrote:
> On Wed, Jul 06, 2016 at 10:41:48AM +0200, Michael Kerrisk (man-pages) wrote:
>> [Rats! Doing now what I should have down to start with. Looping some
>> lists and CRIU and other possibly relevant people into this
>> conversation]
>>
>> Hi Eric,
>>
>> On 5 July 2016 at 23:47, Eric W. Biederman  wrote:
>> > "Michael Kerrisk (man-pages)"  writes:
>> >
>> >> Hi Eric,
>> >>
>> >> I have a question. Is there any way currently to discover which
>> >> user namespace a particular nonuser namespace is governed by?
>> >> Maybe I am missing something, but there does not seem to be a
>> >> way to do this. Also, can one discover which userns is the
>> >> parent of a given userns? Again, I can't see a way to do this.
>> >>
>> >> The point here is introspecting so that a process might determine
>> >> what its capabilities are when operating on some resource governed
>> >> by a (nonuser) namespace.
>> >
>> > To the best of my knowledge that there is not an interface to get that
>> > information.  It would be good to have such an interface for no other
>> > reason than the CRIU folks are going to need it at some point.  I am a
>> > bit surprised they have not complained yet.
>
> I don't think they need it.  They do in fact have what they need.  Assume
> you have tasks T1, T2, T1_1 and T2_1;  T1 and T2 are in init_user_ns;  T1
> spawned T1_1 in a new userns;  T2 spawned T2_1 which setns()d to T1_1's ns.
> There's some {handwave} uid mapping, does not matter.
>
> At restart, it doesn't matter which task originally created the new userns.
> criu knows T1_1 and T2_1 are in the same userns;  it creates the userns, sets
> up the mapping, and T1_1 and T2_1 setns() to it.

I'm missing something here. How does the parental relationships
between the user namespaces get reconstructed? Those relationships
will govern what capabilities a process will have in various user
namespaces.

Cheers,

Michael


-- 
Michael Kerrisk
Linux man-pages maintainer; http://www.kernel.org/doc/man-pages/
Linux/UNIX System Programming Training: http://man7.org/training/


Re: Introspecting userns relationships to other namespaces?

2016-07-06 Thread Eric W. Biederman
"Serge E. Hallyn"  writes:

> On Wed, Jul 06, 2016 at 10:41:48AM +0200, Michael Kerrisk (man-pages) wrote:
>> [Rats! Doing now what I should have down to start with. Looping some
>> lists and CRIU and other possibly relevant people into this
>> conversation]
>> 
>> Hi Eric,
>> 
>> On 5 July 2016 at 23:47, Eric W. Biederman  wrote:
>> > "Michael Kerrisk (man-pages)"  writes:
>> >
>> >> Hi Eric,
>> >>
>> >> I have a question. Is there any way currently to discover which
>> >> user namespace a particular nonuser namespace is governed by?
>> >> Maybe I am missing something, but there does not seem to be a
>> >> way to do this. Also, can one discover which userns is the
>> >> parent of a given userns? Again, I can't see a way to do this.
>> >>
>> >> The point here is introspecting so that a process might determine
>> >> what its capabilities are when operating on some resource governed
>> >> by a (nonuser) namespace.
>> >
>> > To the best of my knowledge that there is not an interface to get that
>> > information.  It would be good to have such an interface for no other
>> > reason than the CRIU folks are going to need it at some point.  I am a
>> > bit surprised they have not complained yet.
>
> I don't think they need it.  They do in fact have what they need.  Assume
> you have tasks T1, T2, T1_1 and T2_1;  T1 and T2 are in init_user_ns;  T1
> spawned T1_1 in a new userns;  T2 spawned T2_1 which setns()d to T1_1's ns.
> There's some {handwave} uid mapping, does not matter.
>
> At restart, it doesn't matter which task originally created the new userns.
> criu knows T1_1 and T2_1 are in the same userns;  it creates the userns, sets
> up the mapping, and T1_1 and T2_1 setns() to it.

Given that the simple cases are so easy it probably doesn't matter in
that sense.

However we now have the case where user namespaces own pid namespaces,
and uts namespaces, and network namespaces, and ipc namespaces, and
filesystems.  Throw in some mount propagation and use of setns and
things could get confusing.   It is something that will need to be
figured out if CRIU is going to properly checkpoint containers
containing containers containing containers containing containers.

Did I mention I like recursion?

>> > That said in a normal use scenario I don't think that information is
>> > needed.
>> >
>> > Do you have a particular use case besides checkpoint/restart where this
>> > is useful?  That might help in coming up with a good userspace interface
>> > for this information.
>> 
>> So, I spend a moderate amount of time working with people to introduce
>> them to the namespaces infrastructure, and one topic that comes up now
>> and this introspection/visualization tools. For example,
>> nowadays--thanks to the (bizarrely misnamed) NStgid and NSpid fields
>> in /proc/PID--it's possible to (and someone I was working with did)
>> write tools that introspect the PID namespace hierarchy to show all of
>> process's and their PIDs in the various namespace instance. It's a
>> natural enough thing to want to do, when confronted with the
>> complexity of the namespaces.
>> 
>> Someone else then asked me a question that led me to wonder about
>> generally introspecting on the parental relationships between user
>> namespaces and the association of other namespaces types with user
>> namespaces. One use would be visualization, in order to understand the
>> running system. Another would be to answer the question I already
>> mentioned: what capability does process X have to perform operations
>> on a resource governed by namespace Y?
>
> I agree they'll probably want it, but if we want for a real need and
> use case we can do a better job of providing what's needed.

That two which is why I mentioned CRIU.  But yeah it will probably take
a little while to get there.

Eric


Re: Introspecting userns relationships to other namespaces?

2016-07-06 Thread Eric W. Biederman
"Serge E. Hallyn"  writes:

> On Wed, Jul 06, 2016 at 10:41:48AM +0200, Michael Kerrisk (man-pages) wrote:
>> [Rats! Doing now what I should have down to start with. Looping some
>> lists and CRIU and other possibly relevant people into this
>> conversation]
>> 
>> Hi Eric,
>> 
>> On 5 July 2016 at 23:47, Eric W. Biederman  wrote:
>> > "Michael Kerrisk (man-pages)"  writes:
>> >
>> >> Hi Eric,
>> >>
>> >> I have a question. Is there any way currently to discover which
>> >> user namespace a particular nonuser namespace is governed by?
>> >> Maybe I am missing something, but there does not seem to be a
>> >> way to do this. Also, can one discover which userns is the
>> >> parent of a given userns? Again, I can't see a way to do this.
>> >>
>> >> The point here is introspecting so that a process might determine
>> >> what its capabilities are when operating on some resource governed
>> >> by a (nonuser) namespace.
>> >
>> > To the best of my knowledge that there is not an interface to get that
>> > information.  It would be good to have such an interface for no other
>> > reason than the CRIU folks are going to need it at some point.  I am a
>> > bit surprised they have not complained yet.
>
> I don't think they need it.  They do in fact have what they need.  Assume
> you have tasks T1, T2, T1_1 and T2_1;  T1 and T2 are in init_user_ns;  T1
> spawned T1_1 in a new userns;  T2 spawned T2_1 which setns()d to T1_1's ns.
> There's some {handwave} uid mapping, does not matter.
>
> At restart, it doesn't matter which task originally created the new userns.
> criu knows T1_1 and T2_1 are in the same userns;  it creates the userns, sets
> up the mapping, and T1_1 and T2_1 setns() to it.

Given that the simple cases are so easy it probably doesn't matter in
that sense.

However we now have the case where user namespaces own pid namespaces,
and uts namespaces, and network namespaces, and ipc namespaces, and
filesystems.  Throw in some mount propagation and use of setns and
things could get confusing.   It is something that will need to be
figured out if CRIU is going to properly checkpoint containers
containing containers containing containers containing containers.

Did I mention I like recursion?

>> > That said in a normal use scenario I don't think that information is
>> > needed.
>> >
>> > Do you have a particular use case besides checkpoint/restart where this
>> > is useful?  That might help in coming up with a good userspace interface
>> > for this information.
>> 
>> So, I spend a moderate amount of time working with people to introduce
>> them to the namespaces infrastructure, and one topic that comes up now
>> and this introspection/visualization tools. For example,
>> nowadays--thanks to the (bizarrely misnamed) NStgid and NSpid fields
>> in /proc/PID--it's possible to (and someone I was working with did)
>> write tools that introspect the PID namespace hierarchy to show all of
>> process's and their PIDs in the various namespace instance. It's a
>> natural enough thing to want to do, when confronted with the
>> complexity of the namespaces.
>> 
>> Someone else then asked me a question that led me to wonder about
>> generally introspecting on the parental relationships between user
>> namespaces and the association of other namespaces types with user
>> namespaces. One use would be visualization, in order to understand the
>> running system. Another would be to answer the question I already
>> mentioned: what capability does process X have to perform operations
>> on a resource governed by namespace Y?
>
> I agree they'll probably want it, but if we want for a real need and
> use case we can do a better job of providing what's needed.

That two which is why I mentioned CRIU.  But yeah it will probably take
a little while to get there.

Eric


Re: Introspecting userns relationships to other namespaces?

2016-07-06 Thread Serge E. Hallyn
On Wed, Jul 06, 2016 at 10:41:48AM +0200, Michael Kerrisk (man-pages) wrote:
> [Rats! Doing now what I should have down to start with. Looping some
> lists and CRIU and other possibly relevant people into this
> conversation]
> 
> Hi Eric,
> 
> On 5 July 2016 at 23:47, Eric W. Biederman  wrote:
> > "Michael Kerrisk (man-pages)"  writes:
> >
> >> Hi Eric,
> >>
> >> I have a question. Is there any way currently to discover which
> >> user namespace a particular nonuser namespace is governed by?
> >> Maybe I am missing something, but there does not seem to be a
> >> way to do this. Also, can one discover which userns is the
> >> parent of a given userns? Again, I can't see a way to do this.
> >>
> >> The point here is introspecting so that a process might determine
> >> what its capabilities are when operating on some resource governed
> >> by a (nonuser) namespace.
> >
> > To the best of my knowledge that there is not an interface to get that
> > information.  It would be good to have such an interface for no other
> > reason than the CRIU folks are going to need it at some point.  I am a
> > bit surprised they have not complained yet.

I don't think they need it.  They do in fact have what they need.  Assume
you have tasks T1, T2, T1_1 and T2_1;  T1 and T2 are in init_user_ns;  T1
spawned T1_1 in a new userns;  T2 spawned T2_1 which setns()d to T1_1's ns.
There's some {handwave} uid mapping, does not matter.

At restart, it doesn't matter which task originally created the new userns.
criu knows T1_1 and T2_1 are in the same userns;  it creates the userns, sets
up the mapping, and T1_1 and T2_1 setns() to it.

> > That said in a normal use scenario I don't think that information is
> > needed.
> >
> > Do you have a particular use case besides checkpoint/restart where this
> > is useful?  That might help in coming up with a good userspace interface
> > for this information.
> 
> So, I spend a moderate amount of time working with people to introduce
> them to the namespaces infrastructure, and one topic that comes up now
> and this introspection/visualization tools. For example,
> nowadays--thanks to the (bizarrely misnamed) NStgid and NSpid fields
> in /proc/PID--it's possible to (and someone I was working with did)
> write tools that introspect the PID namespace hierarchy to show all of
> process's and their PIDs in the various namespace instance. It's a
> natural enough thing to want to do, when confronted with the
> complexity of the namespaces.
> 
> Someone else then asked me a question that led me to wonder about
> generally introspecting on the parental relationships between user
> namespaces and the association of other namespaces types with user
> namespaces. One use would be visualization, in order to understand the
> running system. Another would be to answer the question I already
> mentioned: what capability does process X have to perform operations
> on a resource governed by namespace Y?

I agree they'll probably want it, but if we want for a real need and
use case we can do a better job of providing what's needed.

-serge


Re: Introspecting userns relationships to other namespaces?

2016-07-06 Thread Serge E. Hallyn
On Wed, Jul 06, 2016 at 10:41:48AM +0200, Michael Kerrisk (man-pages) wrote:
> [Rats! Doing now what I should have down to start with. Looping some
> lists and CRIU and other possibly relevant people into this
> conversation]
> 
> Hi Eric,
> 
> On 5 July 2016 at 23:47, Eric W. Biederman  wrote:
> > "Michael Kerrisk (man-pages)"  writes:
> >
> >> Hi Eric,
> >>
> >> I have a question. Is there any way currently to discover which
> >> user namespace a particular nonuser namespace is governed by?
> >> Maybe I am missing something, but there does not seem to be a
> >> way to do this. Also, can one discover which userns is the
> >> parent of a given userns? Again, I can't see a way to do this.
> >>
> >> The point here is introspecting so that a process might determine
> >> what its capabilities are when operating on some resource governed
> >> by a (nonuser) namespace.
> >
> > To the best of my knowledge that there is not an interface to get that
> > information.  It would be good to have such an interface for no other
> > reason than the CRIU folks are going to need it at some point.  I am a
> > bit surprised they have not complained yet.

I don't think they need it.  They do in fact have what they need.  Assume
you have tasks T1, T2, T1_1 and T2_1;  T1 and T2 are in init_user_ns;  T1
spawned T1_1 in a new userns;  T2 spawned T2_1 which setns()d to T1_1's ns.
There's some {handwave} uid mapping, does not matter.

At restart, it doesn't matter which task originally created the new userns.
criu knows T1_1 and T2_1 are in the same userns;  it creates the userns, sets
up the mapping, and T1_1 and T2_1 setns() to it.

> > That said in a normal use scenario I don't think that information is
> > needed.
> >
> > Do you have a particular use case besides checkpoint/restart where this
> > is useful?  That might help in coming up with a good userspace interface
> > for this information.
> 
> So, I spend a moderate amount of time working with people to introduce
> them to the namespaces infrastructure, and one topic that comes up now
> and this introspection/visualization tools. For example,
> nowadays--thanks to the (bizarrely misnamed) NStgid and NSpid fields
> in /proc/PID--it's possible to (and someone I was working with did)
> write tools that introspect the PID namespace hierarchy to show all of
> process's and their PIDs in the various namespace instance. It's a
> natural enough thing to want to do, when confronted with the
> complexity of the namespaces.
> 
> Someone else then asked me a question that led me to wonder about
> generally introspecting on the parental relationships between user
> namespaces and the association of other namespaces types with user
> namespaces. One use would be visualization, in order to understand the
> running system. Another would be to answer the question I already
> mentioned: what capability does process X have to perform operations
> on a resource governed by namespace Y?

I agree they'll probably want it, but if we want for a real need and
use case we can do a better job of providing what's needed.

-serge


Re: Introspecting userns relationships to other namespaces?

2016-07-06 Thread Michael Kerrisk (man-pages)
[Rats! Doing now what I should have down to start with. Looping some
lists and CRIU and other possibly relevant people into this
conversation]

Hi Eric,

On 5 July 2016 at 23:47, Eric W. Biederman  wrote:
> "Michael Kerrisk (man-pages)"  writes:
>
>> Hi Eric,
>>
>> I have a question. Is there any way currently to discover which
>> user namespace a particular nonuser namespace is governed by?
>> Maybe I am missing something, but there does not seem to be a
>> way to do this. Also, can one discover which userns is the
>> parent of a given userns? Again, I can't see a way to do this.
>>
>> The point here is introspecting so that a process might determine
>> what its capabilities are when operating on some resource governed
>> by a (nonuser) namespace.
>
> To the best of my knowledge that there is not an interface to get that
> information.  It would be good to have such an interface for no other
> reason than the CRIU folks are going to need it at some point.  I am a
> bit surprised they have not complained yet.
>
> That said in a normal use scenario I don't think that information is
> needed.
>
> Do you have a particular use case besides checkpoint/restart where this
> is useful?  That might help in coming up with a good userspace interface
> for this information.

So, I spend a moderate amount of time working with people to introduce
them to the namespaces infrastructure, and one topic that comes up now
and this introspection/visualization tools. For example,
nowadays--thanks to the (bizarrely misnamed) NStgid and NSpid fields
in /proc/PID--it's possible to (and someone I was working with did)
write tools that introspect the PID namespace hierarchy to show all of
process's and their PIDs in the various namespace instance. It's a
natural enough thing to want to do, when confronted with the
complexity of the namespaces.

Someone else then asked me a question that led me to wonder about
generally introspecting on the parental relationships between user
namespaces and the association of other namespaces types with user
namespaces. One use would be visualization, in order to understand the
running system. Another would be to answer the question I already
mentioned: what capability does process X have to perform operations
on a resource governed by namespace Y?

Cheers,

Michael




-- 
Michael Kerrisk
Linux man-pages maintainer; http://www.kernel.org/doc/man-pages/
Linux/UNIX System Programming Training: http://man7.org/training/


Re: Introspecting userns relationships to other namespaces?

2016-07-06 Thread Michael Kerrisk (man-pages)
[Rats! Doing now what I should have down to start with. Looping some
lists and CRIU and other possibly relevant people into this
conversation]

Hi Eric,

On 5 July 2016 at 23:47, Eric W. Biederman  wrote:
> "Michael Kerrisk (man-pages)"  writes:
>
>> Hi Eric,
>>
>> I have a question. Is there any way currently to discover which
>> user namespace a particular nonuser namespace is governed by?
>> Maybe I am missing something, but there does not seem to be a
>> way to do this. Also, can one discover which userns is the
>> parent of a given userns? Again, I can't see a way to do this.
>>
>> The point here is introspecting so that a process might determine
>> what its capabilities are when operating on some resource governed
>> by a (nonuser) namespace.
>
> To the best of my knowledge that there is not an interface to get that
> information.  It would be good to have such an interface for no other
> reason than the CRIU folks are going to need it at some point.  I am a
> bit surprised they have not complained yet.
>
> That said in a normal use scenario I don't think that information is
> needed.
>
> Do you have a particular use case besides checkpoint/restart where this
> is useful?  That might help in coming up with a good userspace interface
> for this information.

So, I spend a moderate amount of time working with people to introduce
them to the namespaces infrastructure, and one topic that comes up now
and this introspection/visualization tools. For example,
nowadays--thanks to the (bizarrely misnamed) NStgid and NSpid fields
in /proc/PID--it's possible to (and someone I was working with did)
write tools that introspect the PID namespace hierarchy to show all of
process's and their PIDs in the various namespace instance. It's a
natural enough thing to want to do, when confronted with the
complexity of the namespaces.

Someone else then asked me a question that led me to wonder about
generally introspecting on the parental relationships between user
namespaces and the association of other namespaces types with user
namespaces. One use would be visualization, in order to understand the
running system. Another would be to answer the question I already
mentioned: what capability does process X have to perform operations
on a resource governed by namespace Y?

Cheers,

Michael




-- 
Michael Kerrisk
Linux man-pages maintainer; http://www.kernel.org/doc/man-pages/
Linux/UNIX System Programming Training: http://man7.org/training/