Re: [RFC v2 PATCH 0/8] VFS:userns: support portable root filesystems

2016-05-17 Thread Eric W. Biederman
James Bottomley  writes:

> On Sat, 2016-05-14 at 21:21 -0500, Eric W. Biederman wrote:
>> James Bottomley  writes:
>> 
>> > On Sat, 2016-05-14 at 10:53 +0100, Djalal Harouni wrote:
>> 
>> Just a couple of quick comments from a very high level design point.
>> 
>> - I think a shiftfs is valuable in the same way that overlayfs is
>>   valuable.
>> 
>>   Esepcially in the Docker case where a lot of containers want a shared
>>   base image (for efficiency), but it is desirable to run those
>>   containers in different user namespaces for safety.
>> 
>> - It is also the plan to make it possible to mount a filesystem where
>>   the uids and gids of that filesystem on disk do not have a one to one
>>   mapping to kernel uids and gids.  99% of the work has already be done,
>>   for all filesystem except XFS.
>
> Can you elaborate a bit more on why we want to do this?  I think only
> having a single shift of uid_t to kuid_t across the kernel to user
> boundary is a nice feature of user namespaces.  Architecturally, it's
> not such a big thing to do it as the data goes on to the disk as well,
> but what's the use case for it?

fuse/nfs or just plain sanity.  As the data comes off disk we convert it
into the kernel internal form kuid_t and kgid_t.   For shiftfs this
would be converting the uids when they come from your underlying
filesystem to the upper level vfs abstractions.

Converting to the kernel form for a filesystem such as fuse that is does
all that is necessary to keep evil users from breaking the kernel means
that we call allow users in a user namespace to mount fuse themselves.
Supply whatever uids and gids they want in the fuse messages.  If the
uids/gids don't map from the mounting users user namespace into the
kernel then we set inode->i_uid to INVALID_UID.

That is all we ask of a filesystem, and we are sorting out the rest in
the VFS as nothing sets INVALID_UID in inode->i_uid today.


>>   That said there are some significant issues to work through, before
>>   something like that can be enabled.
>> 
>>   * Handling of uids/gids on disk that don't map into a kuid/kgid.
>
> So I think this is nicely handled in the capability checks in
> generic_permission() (capable_wrt_inode_uidgid()) is there a need to
> make it more complex (and thus more error prone)?

No just a need to handle INVALID_UID, and INVALID_GID which we don't
handle today.

>>   * Safety from poisoned filesystem images.
>
> By poisoned FS image, you mean an image over whose internal data the
> user has control?  The basic problem of how do we give users write
> access to data devices they can then cause to be mounted as
> filesystems?

Yes.  For fuse except for uids and gids this is already solved for most
other filesystems it is a whole new world of horror.

The general case of evil usb devices (think android) that look like
block devices but can return whatever they want already exists in the
wild.

>>   I have slowly been working with Seth Forshee on these issues as
>>   the last thing I want is to introduce more security bugs right now.
>>   Seth being a braver man than I am has already merged his changes into
>>   the Ubuntu kernel.
>> 
>>   Right now we are targeting fuse, because fuse is already designed to
>>   handle poisoned filesystem images.  So to safely enable this kind of
>>   mapping for fuse is not a giant step.
>> 
>>   The big thing from my point of view is to get the VFS interfaces
>>   correct so that the VFS handles all of the weird cases that come up
>>   with uids and gids that don't map, and any other weird cases.  Keeping
>>   the weird bits out of the filesystems.
>
> If by VFS interfaces, you mean where we've already got the mapping 
> confined, absolutely.

Yes.  It is just making certain we handle INVALID_UID and INVALID_GID
that results from a mapping failure.  As we don't handle that in 4.6.0.

>> James I think you are missing the fact that all filesystems already 
>> have the make_kuid and make_kgid calls right where the data comes off
>> disk,
>
> I beg to differ: they certainly don't.  The underlying filesystem
> populates the inode in ->lookup with the data off the disk which goes
> into the inode as a kuid_t/kgid_t  It remains forever in the inode as
> that.  We convert it as it goes out of the kernel in the stat calls
> (actually stat.c:cp_old/new_stat())

They do.  i_uid_write calls make_kuid to map the in comming uid from
disk into a kuid_t.  That is all I was referring to.

The only thing I am looking at infrastructure wise it to make it so that
we cleanly handle when the first parameter to make_kuid is not
_user_ns.  That is the core point of Seths work.

>>  and the from_kuid and from_kgid calls right where the on-disk data
>> is being created just before it goes on disk.  Which means that the
>> actual impact on filesystems of the translation is trivial.
>
> Are you looking at a different tree from me?  I'm 

Re: [RFC v2 PATCH 0/8] VFS:userns: support portable root filesystems

2016-05-17 Thread Eric W. Biederman
James Bottomley  writes:

> On Sat, 2016-05-14 at 21:21 -0500, Eric W. Biederman wrote:
>> James Bottomley  writes:
>> 
>> > On Sat, 2016-05-14 at 10:53 +0100, Djalal Harouni wrote:
>> 
>> Just a couple of quick comments from a very high level design point.
>> 
>> - I think a shiftfs is valuable in the same way that overlayfs is
>>   valuable.
>> 
>>   Esepcially in the Docker case where a lot of containers want a shared
>>   base image (for efficiency), but it is desirable to run those
>>   containers in different user namespaces for safety.
>> 
>> - It is also the plan to make it possible to mount a filesystem where
>>   the uids and gids of that filesystem on disk do not have a one to one
>>   mapping to kernel uids and gids.  99% of the work has already be done,
>>   for all filesystem except XFS.
>
> Can you elaborate a bit more on why we want to do this?  I think only
> having a single shift of uid_t to kuid_t across the kernel to user
> boundary is a nice feature of user namespaces.  Architecturally, it's
> not such a big thing to do it as the data goes on to the disk as well,
> but what's the use case for it?

fuse/nfs or just plain sanity.  As the data comes off disk we convert it
into the kernel internal form kuid_t and kgid_t.   For shiftfs this
would be converting the uids when they come from your underlying
filesystem to the upper level vfs abstractions.

Converting to the kernel form for a filesystem such as fuse that is does
all that is necessary to keep evil users from breaking the kernel means
that we call allow users in a user namespace to mount fuse themselves.
Supply whatever uids and gids they want in the fuse messages.  If the
uids/gids don't map from the mounting users user namespace into the
kernel then we set inode->i_uid to INVALID_UID.

That is all we ask of a filesystem, and we are sorting out the rest in
the VFS as nothing sets INVALID_UID in inode->i_uid today.


>>   That said there are some significant issues to work through, before
>>   something like that can be enabled.
>> 
>>   * Handling of uids/gids on disk that don't map into a kuid/kgid.
>
> So I think this is nicely handled in the capability checks in
> generic_permission() (capable_wrt_inode_uidgid()) is there a need to
> make it more complex (and thus more error prone)?

No just a need to handle INVALID_UID, and INVALID_GID which we don't
handle today.

>>   * Safety from poisoned filesystem images.
>
> By poisoned FS image, you mean an image over whose internal data the
> user has control?  The basic problem of how do we give users write
> access to data devices they can then cause to be mounted as
> filesystems?

Yes.  For fuse except for uids and gids this is already solved for most
other filesystems it is a whole new world of horror.

The general case of evil usb devices (think android) that look like
block devices but can return whatever they want already exists in the
wild.

>>   I have slowly been working with Seth Forshee on these issues as
>>   the last thing I want is to introduce more security bugs right now.
>>   Seth being a braver man than I am has already merged his changes into
>>   the Ubuntu kernel.
>> 
>>   Right now we are targeting fuse, because fuse is already designed to
>>   handle poisoned filesystem images.  So to safely enable this kind of
>>   mapping for fuse is not a giant step.
>> 
>>   The big thing from my point of view is to get the VFS interfaces
>>   correct so that the VFS handles all of the weird cases that come up
>>   with uids and gids that don't map, and any other weird cases.  Keeping
>>   the weird bits out of the filesystems.
>
> If by VFS interfaces, you mean where we've already got the mapping 
> confined, absolutely.

Yes.  It is just making certain we handle INVALID_UID and INVALID_GID
that results from a mapping failure.  As we don't handle that in 4.6.0.

>> James I think you are missing the fact that all filesystems already 
>> have the make_kuid and make_kgid calls right where the data comes off
>> disk,
>
> I beg to differ: they certainly don't.  The underlying filesystem
> populates the inode in ->lookup with the data off the disk which goes
> into the inode as a kuid_t/kgid_t  It remains forever in the inode as
> that.  We convert it as it goes out of the kernel in the stat calls
> (actually stat.c:cp_old/new_stat())

They do.  i_uid_write calls make_kuid to map the in comming uid from
disk into a kuid_t.  That is all I was referring to.

The only thing I am looking at infrastructure wise it to make it so that
we cleanly handle when the first parameter to make_kuid is not
_user_ns.  That is the core point of Seths work.

>>  and the from_kuid and from_kgid calls right where the on-disk data
>> is being created just before it goes on disk.  Which means that the
>> actual impact on filesystems of the translation is trivial.
>
> Are you looking at a different tree from me?  I'm actually just looking
> at Linus git head.

Take a look at i_uid_read and 

Re: [RFC v2 PATCH 0/8] VFS:userns: support portable root filesystems

2016-05-17 Thread Djalal Harouni
On Sat, May 14, 2016 at 06:46:54AM -0700, James Bottomley wrote:
> On Sat, 2016-05-14 at 10:53 +0100, Djalal Harouni wrote:
> > On Thu, May 12, 2016 at 03:24:12PM -0700, James Bottomley wrote:
> > > On Thu, 2016-05-12 at 20:55 +0100, Djalal Harouni wrote:
> > > > On Wed, May 11, 2016 at 11:33:38AM -0700, James Bottomley wrote:

[...]
> > In this series we don't hijack setfsuid() in an indirect way, 
> > setfsuid maps UIDs into current userns according to rules set by 
> > parent. Changing current_fsuid() to some other mapping is a way to 
> > allow processes to bypass that and use it to access other inodes...
> > This should not change and fsuid should continue to follow these
> > rules...
> 
> Both solutions do this

James, I don't update current_fsuid() nor any other creds field in this
RFC. For the reason that if I've a pinned mapping of 0:10:65536 that
containers or apps want to use for their own purpose, an app X started
by privileged process and sets global uid to 10 and its current user
namespace 0:10:65536, and that app X forks another app Y with global
uid 10 sandbox it, hide other processes, sets its user namespace
mapping to 1000:10:1 for app Y, same thing for app Z  2000:10:1
restrict the set of syscalls for both Y and Z... even with all this they
will be able keep their access to inode->i_uid == 0 where we don't want
that since we don't give a mapping to 0... we just want them to access
inode->i_uid == 1000 for app Y and 2000 for app Z... they cross user
namespaces... they use another mapping... and if Y forks to another app
and even if it sets a new userns mapping with a new restricted range, it
will continue to use the old range 65536 and inodes will show up with
real uids instead of nobody..


> > A cred->fsuid solution is safe or used to be safe only inside
> > init_user_ns where there is always a mapping or in context of current
> > user namespace. In an other user namespace with 0:1000:1 mapping, 
> >  you can't set it to arbitrary mapping like 0:4000:1... It will give
> > confined processes access to inodes that satisfy the kuid_t 4000 
> > mapping and which the app/container wants to deny, they only want
> > 0:1000:1. ..
> 
> OK, so both solutions are safe here too.  Your safety comes from only
> remapping in the userns; mine comes from the normal filesystem acl
> rules: either the userns for different users all have disjoint ids
> regulated by /etc/subuidmap or they're all using the same one (like
> docker 1.10) in either case, you could regulate by having the mount
> under a directory which is accessible only to the userns owner.

Please see above comment. Nested unprivileged apps may want to restrict
syscall operations and access to inodes, maybe we don't want the forked
sandboxed app to have access to inodes, and it will be hard if not
impossible if you update global creds each time...


> > We don't cross user namespaces, we don't use different mappings for
> > cred->uid, cred->fsuid...  A clean solution is to shift inodes 
> > UID/GID and not change fsuid to cross namespaces. Not to mention how 
> > it may interact with capabilities...
> 
> This is a subjective question on what constitutes "clean".  I think we
> both think the other solution isn't clean, so that's for others to
> adjudicate.

If you see it that way :-) , I just want to access from user namespace
in the safest way as possible, if there is a better solution or if my
patches are buggy, I'll drop them... no problem!


> > We follow user namespace rules and we keep "the parent defines a 
> > range that the children can't escape" semantics.  There is a clear 
> > relation between user namespaces that should not be broken.
> 
> OK, so I separated the problem into a userns one, which remaps for the
> processes in user space, and a vfs one which remaps the on-disk id. 
>  However, they could be combined by allowing the userns to mount
> shiftfs but only on designated filesystems and setting the uidmappings
> to the same ones as the userns.
> 
> > We explicitly don't define a new user namespace mapping nor add a new
> > interface for the simple reason it's: *too complicated*. We can do 
> > that, but no thanks! May be in future if there is a real need or 
> > things are clear... The current user namespace interface is getting 
> > standard and stable, so we just keep it that way and make it
> > consistant inside VFS.
> 
> I don't accept the too complicated point.  For fully unprivileged
> containers, the host admin already has to set up the subuid/subgid map
> files which is most of the complexity.  Once that's done, the same maps
> can be used to shift mount.  Once it's all set up, no further
> intervention is required.

Well, please check my first comment. In this RFC you don't have to be
always the real root or a privileged parent to do so... it allows nesting
since it seems that the maintainers want nesting support.


> > We give VFS control of that, and we make mount namespaces the central
> > 

Re: [RFC v2 PATCH 0/8] VFS:userns: support portable root filesystems

2016-05-17 Thread Djalal Harouni
On Sat, May 14, 2016 at 06:46:54AM -0700, James Bottomley wrote:
> On Sat, 2016-05-14 at 10:53 +0100, Djalal Harouni wrote:
> > On Thu, May 12, 2016 at 03:24:12PM -0700, James Bottomley wrote:
> > > On Thu, 2016-05-12 at 20:55 +0100, Djalal Harouni wrote:
> > > > On Wed, May 11, 2016 at 11:33:38AM -0700, James Bottomley wrote:

[...]
> > In this series we don't hijack setfsuid() in an indirect way, 
> > setfsuid maps UIDs into current userns according to rules set by 
> > parent. Changing current_fsuid() to some other mapping is a way to 
> > allow processes to bypass that and use it to access other inodes...
> > This should not change and fsuid should continue to follow these
> > rules...
> 
> Both solutions do this

James, I don't update current_fsuid() nor any other creds field in this
RFC. For the reason that if I've a pinned mapping of 0:10:65536 that
containers or apps want to use for their own purpose, an app X started
by privileged process and sets global uid to 10 and its current user
namespace 0:10:65536, and that app X forks another app Y with global
uid 10 sandbox it, hide other processes, sets its user namespace
mapping to 1000:10:1 for app Y, same thing for app Z  2000:10:1
restrict the set of syscalls for both Y and Z... even with all this they
will be able keep their access to inode->i_uid == 0 where we don't want
that since we don't give a mapping to 0... we just want them to access
inode->i_uid == 1000 for app Y and 2000 for app Z... they cross user
namespaces... they use another mapping... and if Y forks to another app
and even if it sets a new userns mapping with a new restricted range, it
will continue to use the old range 65536 and inodes will show up with
real uids instead of nobody..


> > A cred->fsuid solution is safe or used to be safe only inside
> > init_user_ns where there is always a mapping or in context of current
> > user namespace. In an other user namespace with 0:1000:1 mapping, 
> >  you can't set it to arbitrary mapping like 0:4000:1... It will give
> > confined processes access to inodes that satisfy the kuid_t 4000 
> > mapping and which the app/container wants to deny, they only want
> > 0:1000:1. ..
> 
> OK, so both solutions are safe here too.  Your safety comes from only
> remapping in the userns; mine comes from the normal filesystem acl
> rules: either the userns for different users all have disjoint ids
> regulated by /etc/subuidmap or they're all using the same one (like
> docker 1.10) in either case, you could regulate by having the mount
> under a directory which is accessible only to the userns owner.

Please see above comment. Nested unprivileged apps may want to restrict
syscall operations and access to inodes, maybe we don't want the forked
sandboxed app to have access to inodes, and it will be hard if not
impossible if you update global creds each time...


> > We don't cross user namespaces, we don't use different mappings for
> > cred->uid, cred->fsuid...  A clean solution is to shift inodes 
> > UID/GID and not change fsuid to cross namespaces. Not to mention how 
> > it may interact with capabilities...
> 
> This is a subjective question on what constitutes "clean".  I think we
> both think the other solution isn't clean, so that's for others to
> adjudicate.

If you see it that way :-) , I just want to access from user namespace
in the safest way as possible, if there is a better solution or if my
patches are buggy, I'll drop them... no problem!


> > We follow user namespace rules and we keep "the parent defines a 
> > range that the children can't escape" semantics.  There is a clear 
> > relation between user namespaces that should not be broken.
> 
> OK, so I separated the problem into a userns one, which remaps for the
> processes in user space, and a vfs one which remaps the on-disk id. 
>  However, they could be combined by allowing the userns to mount
> shiftfs but only on designated filesystems and setting the uidmappings
> to the same ones as the userns.
> 
> > We explicitly don't define a new user namespace mapping nor add a new
> > interface for the simple reason it's: *too complicated*. We can do 
> > that, but no thanks! May be in future if there is a real need or 
> > things are clear... The current user namespace interface is getting 
> > standard and stable, so we just keep it that way and make it
> > consistant inside VFS.
> 
> I don't accept the too complicated point.  For fully unprivileged
> containers, the host admin already has to set up the subuid/subgid map
> files which is most of the complexity.  Once that's done, the same maps
> can be used to shift mount.  Once it's all set up, no further
> intervention is required.

Well, please check my first comment. In this RFC you don't have to be
always the real root or a privileged parent to do so... it allows nesting
since it seems that the maintainers want nesting support.


> > We give VFS control of that, and we make mount namespaces the central
> > 

Re: [RFC v2 PATCH 0/8] VFS:userns: support portable root filesystems

2016-05-17 Thread Djalal Harouni
Hi Eric,

On Sat, May 14, 2016 at 09:21:55PM -0500, Eric W. Biederman wrote:
> James Bottomley  writes:
> 
> > On Sat, 2016-05-14 at 10:53 +0100, Djalal Harouni wrote:
> 
> Just a couple of quick comments from a very high level design point.
> 
> - I think a shiftfs is valuable in the same way that overlayfs is
>   valuable.
> 
>   Esepcially in the Docker case where a lot of containers want a shared
>   base image (for efficiency), but it is desirable to run those
>   containers in different user namespaces for safety.
> 
> - It is also the plan to make it possible to mount a filesystem where
>   the uids and gids of that filesystem on disk do not have a one to one
>   mapping to kernel uids and gids.  99% of the work has already be done,
>   for all filesystem except XFS.
> 
>   That said there are some significant issues to work through, before
>   something like that can be enabled.
> 
>   * Handling of uids/gids on disk that don't map into a kuid/kgid.
>   * Safety from poisoned filesystem images.
> 
>   I have slowly been working with Seth Forshee on these issues as
>   the last thing I want is to introduce more security bugs right now.
>   Seth being a braver man than I am has already merged his changes into
>   the Ubuntu kernel.
> 
>   Right now we are targeting fuse, because fuse is already designed to
>   handle poisoned filesystem images.  So to safely enable this kind of
>   mapping for fuse is not a giant step.

Alright!

>   The big thing from my point of view is to get the VFS interfaces
>   correct so that the VFS handles all of the weird cases that come up
>   with uids and gids that don't map, and any other weird cases.  Keeping
>   the weird bits out of the filesystems.

Indeed, I totally agree here.


> James, Djalal  I regert I have not been able to read through either of
> your patches cloesely yet.  From a high level view I believe there are
> use cases for both approaches, and the use cases do not necessarily
> overlap.
> 
> Djalal I think you are seeing the upsides and not the practical dangers
> of poisoned filesystem images.

Thanks for your reply Eric, I will let you sleep on the approach. Yes
it's totatly different thing, I think you can consider it as a first
step to use filesystems inside user namespace safely. Real root is still
the only one who mounts and sets the mount namespace shift flag that can
be inherited by unprivlieged userns users.. So real root is *still* in
control of things. The solution is flexible. At the same time you have
the fuse patches for ones that want to use it for unprivileged mounts, and
later and it depends on the future and the state of art or how things
are and improve...

The real problem seems poisoned filesystem images, ok I agree. However
this series considers at the moment only real root is the one who has to
mount filesystems that will be used for user namespaces.

So nothing real changes, just consider it like this:
1) root of init_user_ns mounts filesystems with mount shift flags and
create shift mount namespace.
2) then give access for inodes that have inode->{uid/gid} that match
the inside mapping of the calling process. This is like real root doing
recursive chown of files to give rwx permission but without hitting the
real disk. Every thing is virtual.

So nothing really changes for poisoned filesystems since unprivileged
users can't mount them, only real is able to do so, and he can verify
the image before doing so...

Now, the problem that I can see is if there is some special inodes
related to these filesystems and host resources that are marked 0400
only for real root, in this case we have to add the needed capability
check, capable in init_user_ns. For ioctl I guess they are already safe
since they should have the appropriate capable check, but I will check
it of course.

Now, as Seth has been working with fuse mounts, and I guess they will be
merged, I will of course check with him so everything is synced and that
this approach will continue to work after his patches are merged.


> James I think you are missing the fact that all filesystems already have
> the make_kuid and make_kgid calls right where the data comes off disk,
> and the from_kuid and from_kgid calls right where the on-disk data is
> being created just before it goes on disk.  Which means that the actual
> impact on filesystems of the translation is trivial.
> 
> Where the actual impact of filesystems is much higher is the
> infrastructure needed to ensure poisoned filesystem images do not cause
> a kernel compromise.  That extends to the filesystem testing and code
> review process beyond and is more than just a kernel problem.  Hardening
> that attack surface of the disk side of filesystems is difficult
> especially when not impacting filesystem performance.
> 
> 
> So I don't think it makes sense to frame this as an either/or situation.
> I think there is a need for both solutions.
> 
> Djalal if you could work with Seth 

Re: [RFC v2 PATCH 0/8] VFS:userns: support portable root filesystems

2016-05-17 Thread Djalal Harouni
Hi Eric,

On Sat, May 14, 2016 at 09:21:55PM -0500, Eric W. Biederman wrote:
> James Bottomley  writes:
> 
> > On Sat, 2016-05-14 at 10:53 +0100, Djalal Harouni wrote:
> 
> Just a couple of quick comments from a very high level design point.
> 
> - I think a shiftfs is valuable in the same way that overlayfs is
>   valuable.
> 
>   Esepcially in the Docker case where a lot of containers want a shared
>   base image (for efficiency), but it is desirable to run those
>   containers in different user namespaces for safety.
> 
> - It is also the plan to make it possible to mount a filesystem where
>   the uids and gids of that filesystem on disk do not have a one to one
>   mapping to kernel uids and gids.  99% of the work has already be done,
>   for all filesystem except XFS.
> 
>   That said there are some significant issues to work through, before
>   something like that can be enabled.
> 
>   * Handling of uids/gids on disk that don't map into a kuid/kgid.
>   * Safety from poisoned filesystem images.
> 
>   I have slowly been working with Seth Forshee on these issues as
>   the last thing I want is to introduce more security bugs right now.
>   Seth being a braver man than I am has already merged his changes into
>   the Ubuntu kernel.
> 
>   Right now we are targeting fuse, because fuse is already designed to
>   handle poisoned filesystem images.  So to safely enable this kind of
>   mapping for fuse is not a giant step.

Alright!

>   The big thing from my point of view is to get the VFS interfaces
>   correct so that the VFS handles all of the weird cases that come up
>   with uids and gids that don't map, and any other weird cases.  Keeping
>   the weird bits out of the filesystems.

Indeed, I totally agree here.


> James, Djalal  I regert I have not been able to read through either of
> your patches cloesely yet.  From a high level view I believe there are
> use cases for both approaches, and the use cases do not necessarily
> overlap.
> 
> Djalal I think you are seeing the upsides and not the practical dangers
> of poisoned filesystem images.

Thanks for your reply Eric, I will let you sleep on the approach. Yes
it's totatly different thing, I think you can consider it as a first
step to use filesystems inside user namespace safely. Real root is still
the only one who mounts and sets the mount namespace shift flag that can
be inherited by unprivlieged userns users.. So real root is *still* in
control of things. The solution is flexible. At the same time you have
the fuse patches for ones that want to use it for unprivileged mounts, and
later and it depends on the future and the state of art or how things
are and improve...

The real problem seems poisoned filesystem images, ok I agree. However
this series considers at the moment only real root is the one who has to
mount filesystems that will be used for user namespaces.

So nothing real changes, just consider it like this:
1) root of init_user_ns mounts filesystems with mount shift flags and
create shift mount namespace.
2) then give access for inodes that have inode->{uid/gid} that match
the inside mapping of the calling process. This is like real root doing
recursive chown of files to give rwx permission but without hitting the
real disk. Every thing is virtual.

So nothing really changes for poisoned filesystems since unprivileged
users can't mount them, only real is able to do so, and he can verify
the image before doing so...

Now, the problem that I can see is if there is some special inodes
related to these filesystems and host resources that are marked 0400
only for real root, in this case we have to add the needed capability
check, capable in init_user_ns. For ioctl I guess they are already safe
since they should have the appropriate capable check, but I will check
it of course.

Now, as Seth has been working with fuse mounts, and I guess they will be
merged, I will of course check with him so everything is synced and that
this approach will continue to work after his patches are merged.


> James I think you are missing the fact that all filesystems already have
> the make_kuid and make_kgid calls right where the data comes off disk,
> and the from_kuid and from_kgid calls right where the on-disk data is
> being created just before it goes on disk.  Which means that the actual
> impact on filesystems of the translation is trivial.
> 
> Where the actual impact of filesystems is much higher is the
> infrastructure needed to ensure poisoned filesystem images do not cause
> a kernel compromise.  That extends to the filesystem testing and code
> review process beyond and is more than just a kernel problem.  Hardening
> that attack surface of the disk side of filesystems is difficult
> especially when not impacting filesystem performance.
> 
> 
> So I don't think it makes sense to frame this as an either/or situation.
> I think there is a need for both solutions.
> 
> Djalal if you could work with Seth I think that would be very useful.  I

Re: [RFC v2 PATCH 0/8] VFS:userns: support portable root filesystems

2016-05-16 Thread James Bottomley
On Sat, 2016-05-14 at 21:21 -0500, Eric W. Biederman wrote:
> James Bottomley  writes:
> 
> > On Sat, 2016-05-14 at 10:53 +0100, Djalal Harouni wrote:
> 
> Just a couple of quick comments from a very high level design point.
> 
> - I think a shiftfs is valuable in the same way that overlayfs is
>   valuable.
> 
>   Esepcially in the Docker case where a lot of containers want a shared
>   base image (for efficiency), but it is desirable to run those
>   containers in different user namespaces for safety.
> 
> - It is also the plan to make it possible to mount a filesystem where
>   the uids and gids of that filesystem on disk do not have a one to one
>   mapping to kernel uids and gids.  99% of the work has already be done,
>   for all filesystem except XFS.

Can you elaborate a bit more on why we want to do this?  I think only
having a single shift of uid_t to kuid_t across the kernel to user
boundary is a nice feature of user namespaces.  Architecturally, it's
not such a big thing to do it as the data goes on to the disk as well,
but what's the use case for it?

>   That said there are some significant issues to work through, before
>   something like that can be enabled.
> 
>   * Handling of uids/gids on disk that don't map into a kuid/kgid.

So I think this is nicely handled in the capability checks in
generic_permission() (capable_wrt_inode_uidgid()) is there a need to
make it more complex (and thus more error prone)?

>   * Safety from poisoned filesystem images.

By poisoned FS image, you mean an image over whose internal data the
user has control?  The basic problem of how do we give users write
access to data devices they can then cause to be mounted as
filesystems?

>   I have slowly been working with Seth Forshee on these issues as
>   the last thing I want is to introduce more security bugs right now.
>   Seth being a braver man than I am has already merged his changes into
>   the Ubuntu kernel.
> 
>   Right now we are targeting fuse, because fuse is already designed to
>   handle poisoned filesystem images.  So to safely enable this kind of
>   mapping for fuse is not a giant step.
> 
>   The big thing from my point of view is to get the VFS interfaces
>   correct so that the VFS handles all of the weird cases that come up
>   with uids and gids that don't map, and any other weird cases.  Keeping
>   the weird bits out of the filesystems.

If by VFS interfaces, you mean where we've already got the mapping 
confined, absolutely.

> James I think you are missing the fact that all filesystems already 
> have the make_kuid and make_kgid calls right where the data comes off
> disk,

I beg to differ: they certainly don't.  The underlying filesystem
populates the inode in ->lookup with the data off the disk which goes
into the inode as a kuid_t/kgid_t  It remains forever in the inode as
that.  We convert it as it goes out of the kernel in the stat calls
(actually stat.c:cp_old/new_stat())

>  and the from_kuid and from_kgid calls right where the on-disk data
> is being created just before it goes on disk.  Which means that the
> actual impact on filesystems of the translation is trivial.

Are you looking at a different tree from me?  I'm actually just looking
at Linus git head.

James




Re: [RFC v2 PATCH 0/8] VFS:userns: support portable root filesystems

2016-05-16 Thread James Bottomley
On Sat, 2016-05-14 at 21:21 -0500, Eric W. Biederman wrote:
> James Bottomley  writes:
> 
> > On Sat, 2016-05-14 at 10:53 +0100, Djalal Harouni wrote:
> 
> Just a couple of quick comments from a very high level design point.
> 
> - I think a shiftfs is valuable in the same way that overlayfs is
>   valuable.
> 
>   Esepcially in the Docker case where a lot of containers want a shared
>   base image (for efficiency), but it is desirable to run those
>   containers in different user namespaces for safety.
> 
> - It is also the plan to make it possible to mount a filesystem where
>   the uids and gids of that filesystem on disk do not have a one to one
>   mapping to kernel uids and gids.  99% of the work has already be done,
>   for all filesystem except XFS.

Can you elaborate a bit more on why we want to do this?  I think only
having a single shift of uid_t to kuid_t across the kernel to user
boundary is a nice feature of user namespaces.  Architecturally, it's
not such a big thing to do it as the data goes on to the disk as well,
but what's the use case for it?

>   That said there are some significant issues to work through, before
>   something like that can be enabled.
> 
>   * Handling of uids/gids on disk that don't map into a kuid/kgid.

So I think this is nicely handled in the capability checks in
generic_permission() (capable_wrt_inode_uidgid()) is there a need to
make it more complex (and thus more error prone)?

>   * Safety from poisoned filesystem images.

By poisoned FS image, you mean an image over whose internal data the
user has control?  The basic problem of how do we give users write
access to data devices they can then cause to be mounted as
filesystems?

>   I have slowly been working with Seth Forshee on these issues as
>   the last thing I want is to introduce more security bugs right now.
>   Seth being a braver man than I am has already merged his changes into
>   the Ubuntu kernel.
> 
>   Right now we are targeting fuse, because fuse is already designed to
>   handle poisoned filesystem images.  So to safely enable this kind of
>   mapping for fuse is not a giant step.
> 
>   The big thing from my point of view is to get the VFS interfaces
>   correct so that the VFS handles all of the weird cases that come up
>   with uids and gids that don't map, and any other weird cases.  Keeping
>   the weird bits out of the filesystems.

If by VFS interfaces, you mean where we've already got the mapping 
confined, absolutely.

> James I think you are missing the fact that all filesystems already 
> have the make_kuid and make_kgid calls right where the data comes off
> disk,

I beg to differ: they certainly don't.  The underlying filesystem
populates the inode in ->lookup with the data off the disk which goes
into the inode as a kuid_t/kgid_t  It remains forever in the inode as
that.  We convert it as it goes out of the kernel in the stat calls
(actually stat.c:cp_old/new_stat())

>  and the from_kuid and from_kgid calls right where the on-disk data
> is being created just before it goes on disk.  Which means that the
> actual impact on filesystems of the translation is trivial.

Are you looking at a different tree from me?  I'm actually just looking
at Linus git head.

James




Re: [RFC v2 PATCH 0/8] VFS:userns: support portable root filesystems

2016-05-16 Thread Seth Forshee
On Mon, May 16, 2016 at 11:42:46AM -0500, Eric W. Biederman wrote:
> Seth Forshee  writes:
> 
> > On Sat, May 14, 2016 at 09:21:55PM -0500, Eric W. Biederman wrote:
> >>   I have slowly been working with Seth Forshee on these issues as
> >>   the last thing I want is to introduce more security bugs right now.
> >>   Seth being a braver man than I am has already merged his changes into
> >>   the Ubuntu kernel.
> >
> > Maybe not quite so brave as you think. I also threw on a patch to
> > disable the feature unless explicitly enabled by a sys admin.
> >
> >> James I think you are missing the fact that all filesystems already have
> >> the make_kuid and make_kgid calls right where the data comes off disk,
> >> and the from_kuid and from_kgid calls right where the on-disk data is
> >> being created just before it goes on disk.  Which means that the actual
> >> impact on filesystems of the translation is trivial.
> >
> > It is fairly simple but a there's bit more that just id conversions to
> > change. With ext4 I found that there were mount options which needed to
> > be restricted, some capability checks to update, and access to external
> > journal devices must be checked. In all it wasn't a whole lot of changes
> > to the filesystem though. Fuse was a bit more involved, but the
> > complexities there won't apply to other filesystems.
> >
> >> Djalal if you could work with Seth I think that would be very useful.  I
> >> know I am dragging my heels there but I really hope I can dig in and get
> >> everything reviewed and merged soonish.
> >
> > That would make me very happy :-)
> 
> It has missed this merge window :( But I am hoping with am aiming to
> review them and get your patches (or modified versions of your patches)
> into my tree as soon after rc1 as humanly possible.
> 
> Part of that will have to be the fix for mqueuefs, that Docker just hit.

Yeah, I've got a patch that's been tested to fix the bug, so I'll send
new patches which include that before long.

Seth



Re: [RFC v2 PATCH 0/8] VFS:userns: support portable root filesystems

2016-05-16 Thread Seth Forshee
On Mon, May 16, 2016 at 11:42:46AM -0500, Eric W. Biederman wrote:
> Seth Forshee  writes:
> 
> > On Sat, May 14, 2016 at 09:21:55PM -0500, Eric W. Biederman wrote:
> >>   I have slowly been working with Seth Forshee on these issues as
> >>   the last thing I want is to introduce more security bugs right now.
> >>   Seth being a braver man than I am has already merged his changes into
> >>   the Ubuntu kernel.
> >
> > Maybe not quite so brave as you think. I also threw on a patch to
> > disable the feature unless explicitly enabled by a sys admin.
> >
> >> James I think you are missing the fact that all filesystems already have
> >> the make_kuid and make_kgid calls right where the data comes off disk,
> >> and the from_kuid and from_kgid calls right where the on-disk data is
> >> being created just before it goes on disk.  Which means that the actual
> >> impact on filesystems of the translation is trivial.
> >
> > It is fairly simple but a there's bit more that just id conversions to
> > change. With ext4 I found that there were mount options which needed to
> > be restricted, some capability checks to update, and access to external
> > journal devices must be checked. In all it wasn't a whole lot of changes
> > to the filesystem though. Fuse was a bit more involved, but the
> > complexities there won't apply to other filesystems.
> >
> >> Djalal if you could work with Seth I think that would be very useful.  I
> >> know I am dragging my heels there but I really hope I can dig in and get
> >> everything reviewed and merged soonish.
> >
> > That would make me very happy :-)
> 
> It has missed this merge window :( But I am hoping with am aiming to
> review them and get your patches (or modified versions of your patches)
> into my tree as soon after rc1 as humanly possible.
> 
> Part of that will have to be the fix for mqueuefs, that Docker just hit.

Yeah, I've got a patch that's been tested to fix the bug, so I'll send
new patches which include that before long.

Seth



Re: [RFC v2 PATCH 0/8] VFS:userns: support portable root filesystems

2016-05-16 Thread Eric W. Biederman
Seth Forshee  writes:

> On Sat, May 14, 2016 at 09:21:55PM -0500, Eric W. Biederman wrote:
>>   I have slowly been working with Seth Forshee on these issues as
>>   the last thing I want is to introduce more security bugs right now.
>>   Seth being a braver man than I am has already merged his changes into
>>   the Ubuntu kernel.
>
> Maybe not quite so brave as you think. I also threw on a patch to
> disable the feature unless explicitly enabled by a sys admin.
>
>> James I think you are missing the fact that all filesystems already have
>> the make_kuid and make_kgid calls right where the data comes off disk,
>> and the from_kuid and from_kgid calls right where the on-disk data is
>> being created just before it goes on disk.  Which means that the actual
>> impact on filesystems of the translation is trivial.
>
> It is fairly simple but a there's bit more that just id conversions to
> change. With ext4 I found that there were mount options which needed to
> be restricted, some capability checks to update, and access to external
> journal devices must be checked. In all it wasn't a whole lot of changes
> to the filesystem though. Fuse was a bit more involved, but the
> complexities there won't apply to other filesystems.
>
>> Djalal if you could work with Seth I think that would be very useful.  I
>> know I am dragging my heels there but I really hope I can dig in and get
>> everything reviewed and merged soonish.
>
> That would make me very happy :-)

It has missed this merge window :( But I am hoping with am aiming to
review them and get your patches (or modified versions of your patches)
into my tree as soon after rc1 as humanly possible.

Part of that will have to be the fix for mqueuefs, that Docker just hit.

> I'm happy to look with Djalal for commonalities. I did skim his patches
> before, and based on that all I really expect to find are things related
> to permission checks when ids don't map. The rest seems fundamentally
> different.

Hmm.  Then I may have to look closer at what Djalal is doing then.  It
sounded like what you were doing and if not, I will scratch my head.

That said yes.  The biggy is getting the VFS changes to handle all of
the weird translation corner cases etc (that are part of your patches).

Eric



Re: [RFC v2 PATCH 0/8] VFS:userns: support portable root filesystems

2016-05-16 Thread Eric W. Biederman
Seth Forshee  writes:

> On Sat, May 14, 2016 at 09:21:55PM -0500, Eric W. Biederman wrote:
>>   I have slowly been working with Seth Forshee on these issues as
>>   the last thing I want is to introduce more security bugs right now.
>>   Seth being a braver man than I am has already merged his changes into
>>   the Ubuntu kernel.
>
> Maybe not quite so brave as you think. I also threw on a patch to
> disable the feature unless explicitly enabled by a sys admin.
>
>> James I think you are missing the fact that all filesystems already have
>> the make_kuid and make_kgid calls right where the data comes off disk,
>> and the from_kuid and from_kgid calls right where the on-disk data is
>> being created just before it goes on disk.  Which means that the actual
>> impact on filesystems of the translation is trivial.
>
> It is fairly simple but a there's bit more that just id conversions to
> change. With ext4 I found that there were mount options which needed to
> be restricted, some capability checks to update, and access to external
> journal devices must be checked. In all it wasn't a whole lot of changes
> to the filesystem though. Fuse was a bit more involved, but the
> complexities there won't apply to other filesystems.
>
>> Djalal if you could work with Seth I think that would be very useful.  I
>> know I am dragging my heels there but I really hope I can dig in and get
>> everything reviewed and merged soonish.
>
> That would make me very happy :-)

It has missed this merge window :( But I am hoping with am aiming to
review them and get your patches (or modified versions of your patches)
into my tree as soon after rc1 as humanly possible.

Part of that will have to be the fix for mqueuefs, that Docker just hit.

> I'm happy to look with Djalal for commonalities. I did skim his patches
> before, and based on that all I really expect to find are things related
> to permission checks when ids don't map. The rest seems fundamentally
> different.

Hmm.  Then I may have to look closer at what Djalal is doing then.  It
sounded like what you were doing and if not, I will scratch my head.

That said yes.  The biggy is getting the VFS changes to handle all of
the weird translation corner cases etc (that are part of your patches).

Eric



Re: [RFC v2 PATCH 0/8] VFS:userns: support portable root filesystems

2016-05-16 Thread Seth Forshee
On Sat, May 14, 2016 at 09:21:55PM -0500, Eric W. Biederman wrote:
>   I have slowly been working with Seth Forshee on these issues as
>   the last thing I want is to introduce more security bugs right now.
>   Seth being a braver man than I am has already merged his changes into
>   the Ubuntu kernel.

Maybe not quite so brave as you think. I also threw on a patch to
disable the feature unless explicitly enabled by a sys admin.

> James I think you are missing the fact that all filesystems already have
> the make_kuid and make_kgid calls right where the data comes off disk,
> and the from_kuid and from_kgid calls right where the on-disk data is
> being created just before it goes on disk.  Which means that the actual
> impact on filesystems of the translation is trivial.

It is fairly simple but a there's bit more that just id conversions to
change. With ext4 I found that there were mount options which needed to
be restricted, some capability checks to update, and access to external
journal devices must be checked. In all it wasn't a whole lot of changes
to the filesystem though. Fuse was a bit more involved, but the
complexities there won't apply to other filesystems.

> Djalal if you could work with Seth I think that would be very useful.  I
> know I am dragging my heels there but I really hope I can dig in and get
> everything reviewed and merged soonish.

That would make me very happy :-)

I'm happy to look with Djalal for commonalities. I did skim his patches
before, and based on that all I really expect to find are things related
to permission checks when ids don't map. The rest seems fundamentally
different.

Seth


Re: [RFC v2 PATCH 0/8] VFS:userns: support portable root filesystems

2016-05-16 Thread Seth Forshee
On Sat, May 14, 2016 at 09:21:55PM -0500, Eric W. Biederman wrote:
>   I have slowly been working with Seth Forshee on these issues as
>   the last thing I want is to introduce more security bugs right now.
>   Seth being a braver man than I am has already merged his changes into
>   the Ubuntu kernel.

Maybe not quite so brave as you think. I also threw on a patch to
disable the feature unless explicitly enabled by a sys admin.

> James I think you are missing the fact that all filesystems already have
> the make_kuid and make_kgid calls right where the data comes off disk,
> and the from_kuid and from_kgid calls right where the on-disk data is
> being created just before it goes on disk.  Which means that the actual
> impact on filesystems of the translation is trivial.

It is fairly simple but a there's bit more that just id conversions to
change. With ext4 I found that there were mount options which needed to
be restricted, some capability checks to update, and access to external
journal devices must be checked. In all it wasn't a whole lot of changes
to the filesystem though. Fuse was a bit more involved, but the
complexities there won't apply to other filesystems.

> Djalal if you could work with Seth I think that would be very useful.  I
> know I am dragging my heels there but I really hope I can dig in and get
> everything reviewed and merged soonish.

That would make me very happy :-)

I'm happy to look with Djalal for commonalities. I did skim his patches
before, and based on that all I really expect to find are things related
to permission checks when ids don't map. The rest seems fundamentally
different.

Seth


Re: [RFC v2 PATCH 0/8] VFS:userns: support portable root filesystems

2016-05-15 Thread James Bottomley
On Sat, 2016-05-14 at 21:21 -0500, Eric W. Biederman wrote:
> James if you could see shiftfs with a different set of merits than 
> what to Djalal is doing I think that would be useful.  As it would 
> allow everyone to concentrate on getting the bugs out of their
> solutions.

Just to reply to this specific point.  Djalal's patches can't actually
work for me because I use subtree based roots rather than whole fs
roots ... it's mostly because I work with image directories, not the
full mounted images themselves.  For stuff I unpack into /home, I could
see having /home on a separate directory and adding the vfs_shift_
flags.  however, I'm not doing (and it would be really unsafe to do)
that for / to get my images that unpack in /var/tmp (like the obs build
roots).

However, half the ugliness of the patch set is that it needs lower
layer FS support because vfs_shift_ are mount flags in the superblock. 
 If they were made subtree flags instead (so MNT_ flags), I think you
could eliminate the need to modify any underlying filesystems and they
would allow us to mark subtrees for shifting.  the mount command would
need modifying to add them (like it was for --shared and --private) so
we'd need an additional --vfs-shift --ufs-shift to mark the subtree but
then the series would work for bind mounting subtrees, which is what I
need.  And they would work for *any* filesystem without modification.

This would probably be the better of both worlds because it will work
for the docker case as well.

James



Re: [RFC v2 PATCH 0/8] VFS:userns: support portable root filesystems

2016-05-15 Thread James Bottomley
On Sat, 2016-05-14 at 21:21 -0500, Eric W. Biederman wrote:
> James if you could see shiftfs with a different set of merits than 
> what to Djalal is doing I think that would be useful.  As it would 
> allow everyone to concentrate on getting the bugs out of their
> solutions.

Just to reply to this specific point.  Djalal's patches can't actually
work for me because I use subtree based roots rather than whole fs
roots ... it's mostly because I work with image directories, not the
full mounted images themselves.  For stuff I unpack into /home, I could
see having /home on a separate directory and adding the vfs_shift_
flags.  however, I'm not doing (and it would be really unsafe to do)
that for / to get my images that unpack in /var/tmp (like the obs build
roots).

However, half the ugliness of the patch set is that it needs lower
layer FS support because vfs_shift_ are mount flags in the superblock. 
 If they were made subtree flags instead (so MNT_ flags), I think you
could eliminate the need to modify any underlying filesystems and they
would allow us to mark subtrees for shifting.  the mount command would
need modifying to add them (like it was for --shared and --private) so
we'd need an additional --vfs-shift --ufs-shift to mark the subtree but
then the series would work for bind mounting subtrees, which is what I
need.  And they would work for *any* filesystem without modification.

This would probably be the better of both worlds because it will work
for the docker case as well.

James



Re: [RFC v2 PATCH 0/8] VFS:userns: support portable root filesystems

2016-05-14 Thread Eric W. Biederman
James Bottomley  writes:

> On Sat, 2016-05-14 at 10:53 +0100, Djalal Harouni wrote:

Just a couple of quick comments from a very high level design point.

- I think a shiftfs is valuable in the same way that overlayfs is
  valuable.

  Esepcially in the Docker case where a lot of containers want a shared
  base image (for efficiency), but it is desirable to run those
  containers in different user namespaces for safety.

- It is also the plan to make it possible to mount a filesystem where
  the uids and gids of that filesystem on disk do not have a one to one
  mapping to kernel uids and gids.  99% of the work has already be done,
  for all filesystem except XFS.

  That said there are some significant issues to work through, before
  something like that can be enabled.

  * Handling of uids/gids on disk that don't map into a kuid/kgid.
  * Safety from poisoned filesystem images.

  I have slowly been working with Seth Forshee on these issues as
  the last thing I want is to introduce more security bugs right now.
  Seth being a braver man than I am has already merged his changes into
  the Ubuntu kernel.

  Right now we are targeting fuse, because fuse is already designed to
  handle poisoned filesystem images.  So to safely enable this kind of
  mapping for fuse is not a giant step.

  The big thing from my point of view is to get the VFS interfaces
  correct so that the VFS handles all of the weird cases that come up
  with uids and gids that don't map, and any other weird cases.  Keeping
  the weird bits out of the filesystems.

James, Djalal  I regert I have not been able to read through either of
your patches cloesely yet.  From a high level view I believe there are
use cases for both approaches, and the use cases do not necessarily
overlap.

Djalal I think you are seeing the upsides and not the practical dangers
of poisoned filesystem images.

James I think you are missing the fact that all filesystems already have
the make_kuid and make_kgid calls right where the data comes off disk,
and the from_kuid and from_kgid calls right where the on-disk data is
being created just before it goes on disk.  Which means that the actual
impact on filesystems of the translation is trivial.

Where the actual impact of filesystems is much higher is the
infrastructure needed to ensure poisoned filesystem images do not cause
a kernel compromise.  That extends to the filesystem testing and code
review process beyond and is more than just a kernel problem.  Hardening
that attack surface of the disk side of filesystems is difficult
especially when not impacting filesystem performance.


So I don't think it makes sense to frame this as an either/or situation.
I think there is a need for both solutions.

Djalal if you could work with Seth I think that would be very useful.  I
know I am dragging my heels there but I really hope I can dig in and get
everything reviewed and merged soonish.

James if you could see shiftfs with a different set of merits than what
to Djalal is doing I think that would be useful.  As it would allow
everyone to concentrate on getting the bugs out of their solutions.



That said I am not certain shiftfs makes sense without Seth's patches to
handle the weird cases at the VFS level.What do you do with uids and
gids that don't map?  You can reinvent how to handle the strange cases
in shfitfs or we can work on solving this problem at the VFS level so
people don't have to go through the error prone work of reinventing
solutions.


The big ugly nasty in all of this is that we are fundamentally dealing
with uids and gids which are security identifiers.  Practically any bug
is exploitable and CVE worthy.  So it make sense to tread very
carefully.  Even with care it can takes months if not years to get
the number of bugs down to a level where you are not the favorite target
of people looking for exploitable kernel bugs.
 
Eric


Re: [RFC v2 PATCH 0/8] VFS:userns: support portable root filesystems

2016-05-14 Thread Eric W. Biederman
James Bottomley  writes:

> On Sat, 2016-05-14 at 10:53 +0100, Djalal Harouni wrote:

Just a couple of quick comments from a very high level design point.

- I think a shiftfs is valuable in the same way that overlayfs is
  valuable.

  Esepcially in the Docker case where a lot of containers want a shared
  base image (for efficiency), but it is desirable to run those
  containers in different user namespaces for safety.

- It is also the plan to make it possible to mount a filesystem where
  the uids and gids of that filesystem on disk do not have a one to one
  mapping to kernel uids and gids.  99% of the work has already be done,
  for all filesystem except XFS.

  That said there are some significant issues to work through, before
  something like that can be enabled.

  * Handling of uids/gids on disk that don't map into a kuid/kgid.
  * Safety from poisoned filesystem images.

  I have slowly been working with Seth Forshee on these issues as
  the last thing I want is to introduce more security bugs right now.
  Seth being a braver man than I am has already merged his changes into
  the Ubuntu kernel.

  Right now we are targeting fuse, because fuse is already designed to
  handle poisoned filesystem images.  So to safely enable this kind of
  mapping for fuse is not a giant step.

  The big thing from my point of view is to get the VFS interfaces
  correct so that the VFS handles all of the weird cases that come up
  with uids and gids that don't map, and any other weird cases.  Keeping
  the weird bits out of the filesystems.

James, Djalal  I regert I have not been able to read through either of
your patches cloesely yet.  From a high level view I believe there are
use cases for both approaches, and the use cases do not necessarily
overlap.

Djalal I think you are seeing the upsides and not the practical dangers
of poisoned filesystem images.

James I think you are missing the fact that all filesystems already have
the make_kuid and make_kgid calls right where the data comes off disk,
and the from_kuid and from_kgid calls right where the on-disk data is
being created just before it goes on disk.  Which means that the actual
impact on filesystems of the translation is trivial.

Where the actual impact of filesystems is much higher is the
infrastructure needed to ensure poisoned filesystem images do not cause
a kernel compromise.  That extends to the filesystem testing and code
review process beyond and is more than just a kernel problem.  Hardening
that attack surface of the disk side of filesystems is difficult
especially when not impacting filesystem performance.


So I don't think it makes sense to frame this as an either/or situation.
I think there is a need for both solutions.

Djalal if you could work with Seth I think that would be very useful.  I
know I am dragging my heels there but I really hope I can dig in and get
everything reviewed and merged soonish.

James if you could see shiftfs with a different set of merits than what
to Djalal is doing I think that would be useful.  As it would allow
everyone to concentrate on getting the bugs out of their solutions.



That said I am not certain shiftfs makes sense without Seth's patches to
handle the weird cases at the VFS level.What do you do with uids and
gids that don't map?  You can reinvent how to handle the strange cases
in shfitfs or we can work on solving this problem at the VFS level so
people don't have to go through the error prone work of reinventing
solutions.


The big ugly nasty in all of this is that we are fundamentally dealing
with uids and gids which are security identifiers.  Practically any bug
is exploitable and CVE worthy.  So it make sense to tread very
carefully.  Even with care it can takes months if not years to get
the number of bugs down to a level where you are not the favorite target
of people looking for exploitable kernel bugs.
 
Eric


Re: [RFC v2 PATCH 0/8] VFS:userns: support portable root filesystems

2016-05-14 Thread James Bottomley
On Sat, 2016-05-14 at 10:53 +0100, Djalal Harouni wrote:
> On Thu, May 12, 2016 at 03:24:12PM -0700, James Bottomley wrote:
> > On Thu, 2016-05-12 at 20:55 +0100, Djalal Harouni wrote:
> > > On Wed, May 11, 2016 at 11:33:38AM -0700, James Bottomley wrote:
> > > > On Wed, 2016-05-11 at 17:42 +0100, Djalal Harouni wrote:
> [...]
> 
> > > > 
> > > > The credentials are per thread, so it's a standard way of doing
> > > > credential shifting and no other threads of execution in the
> > > > same 
> > > > task get access. As long as you bound the
> > > > override_creds()/revert_c
> > > > reds() pairs within the kernel, you're safe.
> > > 
> > > No, and here sorry I mean shifted.
> > > 
> > > current_fsuid() is global through all fs operations which means
> > > it
> > > crosses user namespaces... it was safe the days of only
> > > init_user_ns,
> > > not anymore... You give a mapping inside containers to fsuid
> > > where 
> > > they don't want to have it... this allows to operate on inodes
> > > inside
> > > other containers... update current_fsuid() even if we want that
> > > user 
> > > to be nobody inside the container... and later it can access the 
> > > inodes of the shifted fs... and by same current of course...
> > 
> > OK, I still don't understand what you're getting at.  There are
> > three
> > per-thread uids: uid, euid and fsuid (real, effective and
> > filesystem). 
> >  They're all either settable via syscall or inherited on fork. 
> >  They're
> > all kernel side, meaning they're kuid_t.  Their values stay
> > invariant
> > as you move through namespaces.  They change (and get mapped
> > according
> > to the current user namespace setting) when you call set[fe]uid()
> > So
> > when I enter a user namespace with mapping
> > 
> > 0 10 1000
> > 
> > and call setuid(0) (which sets all three). they all pick up the
> > kuid_t
> > of 10.  This means that writing a file inside the user
> > namespace
> > after calling setuid(0) appears as real uid 10 on the medium
> > even
> > though if I call getuid() from the namespace, I get back 0.  What
> > shiftfs does is hijack temporarily the kernel fsuid/fsgid for
> > permission checks, so you can remap to any old uid on the medium
> > (although usually you'd pass in uidmap=0:10:1000") it maps back
> > from kuid_t 10 to kuid_t 0, which is why the container can now
> > read
> > and write the underlying medium at on-media id 0 even through root
> > inside the container has kuid_t 10.  There's no permanent
> > change of
> > fsuid and it stays at its invariant value for the thread except as
> > a
> > temporary measure to do the permission checks on the underlying of
> > the
> > shifted filesystem.
> > 
> > > > > The worst thing is that current_fsuid() does not follow now
> > > > > the
> > > > > /proc/self/uid_map interface! this is a serious vulnerability
> > > > > and 
> > > > > a mix of the current semantics... it's updated but using
> > > > > other
> > > > > rules...?
> > > > 
> > > > current_fsuid() is aready mapped via the userns; it's already a
> > > > kuid_t at its final value.  Shifting that is what you want to
> > > > remap
> > > > underlying volume uid/gid's.  The uidmap/gidmap inputs to this
> > > > are 
> > > > shifts on the final underlying uid/gids.
> > > 
> > > => some points:
> > > Changing setfsuid() its interfaces and rules... or an indrect way
> > > to
> > > break another syscall...
> > 
> > There is no change to setfsuid().
> > 
> > > The userns used for *mapping* is totatly different and not
> > > standard..
> > > . losing "init_user_ns and its decendents userns *semantics*...",
> > > a
> > > yet a totatly unlinked mapping...
> > 
> > There is no user namespace mapping at all.  This is a simple shift,
> > kernel side, of uids and gids at their kuid_t values.
> > 
> > > Breaking current_uid(),current_euid(),current_fsuid() which are
> > > mapped but in *different* user namespaces... hence different
> > > values
> > > inside namespaces... you can change your userns mapping but that
> > > current_fsuid specific one will always be remapped to some other 
> > > value inside even if you don't want it... It crosses user 
> > > namespaces...  uid and euid are remapped according to
> > > /proc/self/uid_
> > > map, fsuid is remapped according to this new interface...
> > > 
> > > Hard coding the mapping, nested containers/apps may *share* fsuid
> > > and
> > > can't get rid of it even if they change the inside userns mapping
> > > to
> > > disable, split, reduce mapped users or offer better isolation
> > > they
> > > can't... no way to make private inodes inside containers if they 
> > > share the final fsuid, inside container mapping is ignored...
> > > ...
> > 
> > OK, I think there's a misunderstanding about how credential
> > overrides
> > work.  They're not permanent changes to the credentials, they're
> > temporary ones to get stuff done within the kernel at a temporary
> > privilege.  You can make credentials permanent if you 

Re: [RFC v2 PATCH 0/8] VFS:userns: support portable root filesystems

2016-05-14 Thread James Bottomley
On Sat, 2016-05-14 at 10:53 +0100, Djalal Harouni wrote:
> On Thu, May 12, 2016 at 03:24:12PM -0700, James Bottomley wrote:
> > On Thu, 2016-05-12 at 20:55 +0100, Djalal Harouni wrote:
> > > On Wed, May 11, 2016 at 11:33:38AM -0700, James Bottomley wrote:
> > > > On Wed, 2016-05-11 at 17:42 +0100, Djalal Harouni wrote:
> [...]
> 
> > > > 
> > > > The credentials are per thread, so it's a standard way of doing
> > > > credential shifting and no other threads of execution in the
> > > > same 
> > > > task get access. As long as you bound the
> > > > override_creds()/revert_c
> > > > reds() pairs within the kernel, you're safe.
> > > 
> > > No, and here sorry I mean shifted.
> > > 
> > > current_fsuid() is global through all fs operations which means
> > > it
> > > crosses user namespaces... it was safe the days of only
> > > init_user_ns,
> > > not anymore... You give a mapping inside containers to fsuid
> > > where 
> > > they don't want to have it... this allows to operate on inodes
> > > inside
> > > other containers... update current_fsuid() even if we want that
> > > user 
> > > to be nobody inside the container... and later it can access the 
> > > inodes of the shifted fs... and by same current of course...
> > 
> > OK, I still don't understand what you're getting at.  There are
> > three
> > per-thread uids: uid, euid and fsuid (real, effective and
> > filesystem). 
> >  They're all either settable via syscall or inherited on fork. 
> >  They're
> > all kernel side, meaning they're kuid_t.  Their values stay
> > invariant
> > as you move through namespaces.  They change (and get mapped
> > according
> > to the current user namespace setting) when you call set[fe]uid()
> > So
> > when I enter a user namespace with mapping
> > 
> > 0 10 1000
> > 
> > and call setuid(0) (which sets all three). they all pick up the
> > kuid_t
> > of 10.  This means that writing a file inside the user
> > namespace
> > after calling setuid(0) appears as real uid 10 on the medium
> > even
> > though if I call getuid() from the namespace, I get back 0.  What
> > shiftfs does is hijack temporarily the kernel fsuid/fsgid for
> > permission checks, so you can remap to any old uid on the medium
> > (although usually you'd pass in uidmap=0:10:1000") it maps back
> > from kuid_t 10 to kuid_t 0, which is why the container can now
> > read
> > and write the underlying medium at on-media id 0 even through root
> > inside the container has kuid_t 10.  There's no permanent
> > change of
> > fsuid and it stays at its invariant value for the thread except as
> > a
> > temporary measure to do the permission checks on the underlying of
> > the
> > shifted filesystem.
> > 
> > > > > The worst thing is that current_fsuid() does not follow now
> > > > > the
> > > > > /proc/self/uid_map interface! this is a serious vulnerability
> > > > > and 
> > > > > a mix of the current semantics... it's updated but using
> > > > > other
> > > > > rules...?
> > > > 
> > > > current_fsuid() is aready mapped via the userns; it's already a
> > > > kuid_t at its final value.  Shifting that is what you want to
> > > > remap
> > > > underlying volume uid/gid's.  The uidmap/gidmap inputs to this
> > > > are 
> > > > shifts on the final underlying uid/gids.
> > > 
> > > => some points:
> > > Changing setfsuid() its interfaces and rules... or an indrect way
> > > to
> > > break another syscall...
> > 
> > There is no change to setfsuid().
> > 
> > > The userns used for *mapping* is totatly different and not
> > > standard..
> > > . losing "init_user_ns and its decendents userns *semantics*...",
> > > a
> > > yet a totatly unlinked mapping...
> > 
> > There is no user namespace mapping at all.  This is a simple shift,
> > kernel side, of uids and gids at their kuid_t values.
> > 
> > > Breaking current_uid(),current_euid(),current_fsuid() which are
> > > mapped but in *different* user namespaces... hence different
> > > values
> > > inside namespaces... you can change your userns mapping but that
> > > current_fsuid specific one will always be remapped to some other 
> > > value inside even if you don't want it... It crosses user 
> > > namespaces...  uid and euid are remapped according to
> > > /proc/self/uid_
> > > map, fsuid is remapped according to this new interface...
> > > 
> > > Hard coding the mapping, nested containers/apps may *share* fsuid
> > > and
> > > can't get rid of it even if they change the inside userns mapping
> > > to
> > > disable, split, reduce mapped users or offer better isolation
> > > they
> > > can't... no way to make private inodes inside containers if they 
> > > share the final fsuid, inside container mapping is ignored...
> > > ...
> > 
> > OK, I think there's a misunderstanding about how credential
> > overrides
> > work.  They're not permanent changes to the credentials, they're
> > temporary ones to get stuff done within the kernel at a temporary
> > privilege.  You can make credentials permanent if you 

Re: [RFC v2 PATCH 0/8] VFS:userns: support portable root filesystems

2016-05-14 Thread Djalal Harouni
On Thu, May 12, 2016 at 03:24:12PM -0700, James Bottomley wrote:
> On Thu, 2016-05-12 at 20:55 +0100, Djalal Harouni wrote:
> > On Wed, May 11, 2016 at 11:33:38AM -0700, James Bottomley wrote:
> > > On Wed, 2016-05-11 at 17:42 +0100, Djalal Harouni wrote:
[...]

> > > 
> > > The credentials are per thread, so it's a standard way of doing
> > > credential shifting and no other threads of execution in the same 
> > > task get access. As long as you bound the override_creds()/revert_c
> > > reds() pairs within the kernel, you're safe.
> > 
> > No, and here sorry I mean shifted.
> > 
> > current_fsuid() is global through all fs operations which means it
> > crosses user namespaces... it was safe the days of only init_user_ns,
> > not anymore... You give a mapping inside containers to fsuid where 
> > they don't want to have it... this allows to operate on inodes inside
> > other containers... update current_fsuid() even if we want that user 
> > to be nobody inside the container... and later it can access the 
> > inodes of the shifted fs... and by same current of course...
> 
> OK, I still don't understand what you're getting at.  There are three
> per-thread uids: uid, euid and fsuid (real, effective and filesystem). 
>  They're all either settable via syscall or inherited on fork.  They're
> all kernel side, meaning they're kuid_t.  Their values stay invariant
> as you move through namespaces.  They change (and get mapped according
> to the current user namespace setting) when you call set[fe]uid() So
> when I enter a user namespace with mapping
> 
> 0 10 1000
> 
> and call setuid(0) (which sets all three). they all pick up the kuid_t
> of 10.  This means that writing a file inside the user namespace
> after calling setuid(0) appears as real uid 10 on the medium even
> though if I call getuid() from the namespace, I get back 0.  What
> shiftfs does is hijack temporarily the kernel fsuid/fsgid for
> permission checks, so you can remap to any old uid on the medium
> (although usually you'd pass in uidmap=0:10:1000") it maps back
> from kuid_t 10 to kuid_t 0, which is why the container can now read
> and write the underlying medium at on-media id 0 even through root
> inside the container has kuid_t 10.  There's no permanent change of
> fsuid and it stays at its invariant value for the thread except as a
> temporary measure to do the permission checks on the underlying of the
> shifted filesystem.
> 
> > > > The worst thing is that current_fsuid() does not follow now the
> > > > /proc/self/uid_map interface! this is a serious vulnerability and 
> > > > a mix of the current semantics... it's updated but using other
> > > > rules...?
> > > 
> > > current_fsuid() is aready mapped via the userns; it's already a 
> > > kuid_t at its final value.  Shifting that is what you want to remap
> > > underlying volume uid/gid's.  The uidmap/gidmap inputs to this are 
> > > shifts on the final underlying uid/gids.
> > 
> > => some points:
> > Changing setfsuid() its interfaces and rules... or an indrect way to
> > break another syscall...
> 
> There is no change to setfsuid().
> 
> > The userns used for *mapping* is totatly different and not standard..
> > . losing "init_user_ns and its decendents userns *semantics*...", a
> > yet a totatly unlinked mapping...
> 
> There is no user namespace mapping at all.  This is a simple shift,
> kernel side, of uids and gids at their kuid_t values.
> 
> > Breaking current_uid(),current_euid(),current_fsuid() which are
> > mapped but in *different* user namespaces... hence different values
> > inside namespaces... you can change your userns mapping but that
> > current_fsuid specific one will always be remapped to some other 
> > value inside even if you don't want it... It crosses user 
> > namespaces...  uid and euid are remapped according to /proc/self/uid_
> > map, fsuid is remapped according to this new interface...
> > 
> > Hard coding the mapping, nested containers/apps may *share* fsuid and
> > can't get rid of it even if they change the inside userns mapping to
> > disable, split, reduce mapped users or offer better isolation they
> > can't... no way to make private inodes inside containers if they 
> > share the final fsuid, inside container mapping is ignored...
> > ...
> 
> OK, I think there's a misunderstanding about how credential overrides
> work.  They're not permanent changes to the credentials, they're
> temporary ones to get stuff done within the kernel at a temporary
> privilege.  You can make credentials permanent if you go through
> prepare_creds()/commit_creds(), but for making them temporary you do
> prepare_creds()/override_creds() and then revert_creds() once you're
> done using them.
> 
> If you want to see a current use of this, try fs/open.c:faccessat. 
>  What it's doing is temporarily overriding fsuid with the real uid to
> check the permissions before reverting the credentials and returning to
> the user.

Thank you for explaining 

Re: [RFC v2 PATCH 0/8] VFS:userns: support portable root filesystems

2016-05-14 Thread Djalal Harouni
On Thu, May 12, 2016 at 03:24:12PM -0700, James Bottomley wrote:
> On Thu, 2016-05-12 at 20:55 +0100, Djalal Harouni wrote:
> > On Wed, May 11, 2016 at 11:33:38AM -0700, James Bottomley wrote:
> > > On Wed, 2016-05-11 at 17:42 +0100, Djalal Harouni wrote:
[...]

> > > 
> > > The credentials are per thread, so it's a standard way of doing
> > > credential shifting and no other threads of execution in the same 
> > > task get access. As long as you bound the override_creds()/revert_c
> > > reds() pairs within the kernel, you're safe.
> > 
> > No, and here sorry I mean shifted.
> > 
> > current_fsuid() is global through all fs operations which means it
> > crosses user namespaces... it was safe the days of only init_user_ns,
> > not anymore... You give a mapping inside containers to fsuid where 
> > they don't want to have it... this allows to operate on inodes inside
> > other containers... update current_fsuid() even if we want that user 
> > to be nobody inside the container... and later it can access the 
> > inodes of the shifted fs... and by same current of course...
> 
> OK, I still don't understand what you're getting at.  There are three
> per-thread uids: uid, euid and fsuid (real, effective and filesystem). 
>  They're all either settable via syscall or inherited on fork.  They're
> all kernel side, meaning they're kuid_t.  Their values stay invariant
> as you move through namespaces.  They change (and get mapped according
> to the current user namespace setting) when you call set[fe]uid() So
> when I enter a user namespace with mapping
> 
> 0 10 1000
> 
> and call setuid(0) (which sets all three). they all pick up the kuid_t
> of 10.  This means that writing a file inside the user namespace
> after calling setuid(0) appears as real uid 10 on the medium even
> though if I call getuid() from the namespace, I get back 0.  What
> shiftfs does is hijack temporarily the kernel fsuid/fsgid for
> permission checks, so you can remap to any old uid on the medium
> (although usually you'd pass in uidmap=0:10:1000") it maps back
> from kuid_t 10 to kuid_t 0, which is why the container can now read
> and write the underlying medium at on-media id 0 even through root
> inside the container has kuid_t 10.  There's no permanent change of
> fsuid and it stays at its invariant value for the thread except as a
> temporary measure to do the permission checks on the underlying of the
> shifted filesystem.
> 
> > > > The worst thing is that current_fsuid() does not follow now the
> > > > /proc/self/uid_map interface! this is a serious vulnerability and 
> > > > a mix of the current semantics... it's updated but using other
> > > > rules...?
> > > 
> > > current_fsuid() is aready mapped via the userns; it's already a 
> > > kuid_t at its final value.  Shifting that is what you want to remap
> > > underlying volume uid/gid's.  The uidmap/gidmap inputs to this are 
> > > shifts on the final underlying uid/gids.
> > 
> > => some points:
> > Changing setfsuid() its interfaces and rules... or an indrect way to
> > break another syscall...
> 
> There is no change to setfsuid().
> 
> > The userns used for *mapping* is totatly different and not standard..
> > . losing "init_user_ns and its decendents userns *semantics*...", a
> > yet a totatly unlinked mapping...
> 
> There is no user namespace mapping at all.  This is a simple shift,
> kernel side, of uids and gids at their kuid_t values.
> 
> > Breaking current_uid(),current_euid(),current_fsuid() which are
> > mapped but in *different* user namespaces... hence different values
> > inside namespaces... you can change your userns mapping but that
> > current_fsuid specific one will always be remapped to some other 
> > value inside even if you don't want it... It crosses user 
> > namespaces...  uid and euid are remapped according to /proc/self/uid_
> > map, fsuid is remapped according to this new interface...
> > 
> > Hard coding the mapping, nested containers/apps may *share* fsuid and
> > can't get rid of it even if they change the inside userns mapping to
> > disable, split, reduce mapped users or offer better isolation they
> > can't... no way to make private inodes inside containers if they 
> > share the final fsuid, inside container mapping is ignored...
> > ...
> 
> OK, I think there's a misunderstanding about how credential overrides
> work.  They're not permanent changes to the credentials, they're
> temporary ones to get stuff done within the kernel at a temporary
> privilege.  You can make credentials permanent if you go through
> prepare_creds()/commit_creds(), but for making them temporary you do
> prepare_creds()/override_creds() and then revert_creds() once you're
> done using them.
> 
> If you want to see a current use of this, try fs/open.c:faccessat. 
>  What it's doing is temporarily overriding fsuid with the real uid to
> check the permissions before reverting the credentials and returning to
> the user.

Thank you for explaining 

Re: [RFC v2 PATCH 0/8] VFS:userns: support portable root filesystems

2016-05-12 Thread James Bottomley
On Thu, 2016-05-12 at 20:55 +0100, Djalal Harouni wrote:
> On Wed, May 11, 2016 at 11:33:38AM -0700, James Bottomley wrote:
> > On Wed, 2016-05-11 at 17:42 +0100, Djalal Harouni wrote:
> > > On Tue, May 10, 2016 at 04:36:56PM -0700, James Bottomley wrote:
> [...]
> > > Hmm anyway you are mounting this on behalf of filesystems, so if
> > > you 
> > > add the recursive thing, you will just probably make everything 
> > > worse, by making any /proc, /sys dentry that's under that path 
> > > shiftable, and unprivileged users can just create user namespaces
> > > and 
> > > read /proc/* and all the other stuff that doesn't have capable() 
> > > related to the init_user_ns host...
> > 
> > That's up to the admin who does the shifting.  Recursive would be
> > an
> > option if added.
> 
> Hmm, not sure if you get my point... you just made it an admin 
> problem where admins want to mount an image downloaded verify it and 
> use it for their container with /proc...! that's another problem!

You can't allow unprivileged containers to shift uids on arbitrary
filesystems, so the admin always has to do something for the initial
setup.

> > >   what if you have paths like /filesystem0/uidshiftedY/dir,
> > > /filesystem0/uidshiftedX/dir , /filesystem0/notshifted/dir 
> > > where some of them are also bind mounts that point to same dentry
> > > ?
> > 
> > Without recursive semantics, you see the underlying inode.  With 
> > them, you see the upper vfsmnts.  Shiftfs isn't idempotent, so you 
> > would need to be careful about nesting.  However, that's an admin
> > problem.
> > 
> > > Also, you create a totally new user namespace interface here! by 
> > > making your own new interface we just lose the notion of 
> > > init_user_ns and its children and mapping ?
> > 
> > I don't quite understand this; the only use of the init_user_ns is 
> > the capable(CAP_SYS_ADMIN) in fill_super which is how only the real
> > admin can mount at a shifted uid/gid.  Otherwise, there's no need 
> > to see into the userns because filesystems see the kuid_t/kgid_t 
> > which is what I'm shifting.
> > 
> > > I'm not sure of the implication of all this... your user 
> > > namespace mapping is not related at all to init_user_ns! it seems 
> > > that it has its own init_user_ns ?   does a capable() check now 
> > > on a shifted filesystem relates to that and hence to your mapping 
> > > or to the real init_user_ns ?
> > 
> > capable(CAP_SYS_ADMIN) == ns_capable(_user_ns, CAP_SYS_ADMIN)
> > 
> > Or is there a misunderstanding here about how user namespaces work
> > inside the kernel?  The design is that the ID shift is done as you
> > cross the kernel boundary, so a filesystem, being usually all in
> > -kernel operating via the VFS interfaces, ideally never needs to 
> > make any from_kuid/make_kuid calls.  However, there are ways 
> > filesystems can send data across the kernel/user bounary outside of 
> > the usual vfs interfaces (ioctls being the most usual one) so in 
> > that specific code, they have to do the kuid_t to uid_t changes 
> > themselves.  Shiftfs never sends data to the user outside of the 
> > VFS so it never needs to do this and can operate entirely on
> > kuid_ts.
> > 
> > > > There's a bit of an open question of whether it should have vfs
> > > > changes: the way the struct file f_inode and f_ops are hijacked 
> > > > is a bit nasty and perhaps d_select_inode() could be made a bit
> > > > cleverer to help us here instead.
> > > 
> > > I'm not sure if this PoC works... but you sure you didn't 
> > > introduce a serious vulnerability here ? you use a new mapping 
> > > and you update current_fsuid() creds up, which is global on any 
> > > fs operation, so may be: lets operate on any inode, update our 
> > > current_fsuid()... and access the rest of *unshifted filesystems*
> > > ... !?
> > 
> > The credentials are per thread, so it's a standard way of doing
> > credential shifting and no other threads of execution in the same 
> > task get access. As long as you bound the override_creds()/revert_c
> > reds() pairs within the kernel, you're safe.
> 
> No, and here sorry I mean shifted.
> 
> current_fsuid() is global through all fs operations which means it
> crosses user namespaces... it was safe the days of only init_user_ns,
> not anymore... You give a mapping inside containers to fsuid where 
> they don't want to have it... this allows to operate on inodes inside
> other containers... update current_fsuid() even if we want that user 
> to be nobody inside the container... and later it can access the 
> inodes of the shifted fs... and by same current of course...

OK, I still don't understand what you're getting at.  There are three
per-thread uids: uid, euid and fsuid (real, effective and filesystem). 
 They're all either settable via syscall or inherited on fork.  They're
all kernel side, meaning they're kuid_t.  Their values stay invariant
as you move through namespaces.  They change (and get mapped according
to the current 

Re: [RFC v2 PATCH 0/8] VFS:userns: support portable root filesystems

2016-05-12 Thread James Bottomley
On Thu, 2016-05-12 at 20:55 +0100, Djalal Harouni wrote:
> On Wed, May 11, 2016 at 11:33:38AM -0700, James Bottomley wrote:
> > On Wed, 2016-05-11 at 17:42 +0100, Djalal Harouni wrote:
> > > On Tue, May 10, 2016 at 04:36:56PM -0700, James Bottomley wrote:
> [...]
> > > Hmm anyway you are mounting this on behalf of filesystems, so if
> > > you 
> > > add the recursive thing, you will just probably make everything 
> > > worse, by making any /proc, /sys dentry that's under that path 
> > > shiftable, and unprivileged users can just create user namespaces
> > > and 
> > > read /proc/* and all the other stuff that doesn't have capable() 
> > > related to the init_user_ns host...
> > 
> > That's up to the admin who does the shifting.  Recursive would be
> > an
> > option if added.
> 
> Hmm, not sure if you get my point... you just made it an admin 
> problem where admins want to mount an image downloaded verify it and 
> use it for their container with /proc...! that's another problem!

You can't allow unprivileged containers to shift uids on arbitrary
filesystems, so the admin always has to do something for the initial
setup.

> > >   what if you have paths like /filesystem0/uidshiftedY/dir,
> > > /filesystem0/uidshiftedX/dir , /filesystem0/notshifted/dir 
> > > where some of them are also bind mounts that point to same dentry
> > > ?
> > 
> > Without recursive semantics, you see the underlying inode.  With 
> > them, you see the upper vfsmnts.  Shiftfs isn't idempotent, so you 
> > would need to be careful about nesting.  However, that's an admin
> > problem.
> > 
> > > Also, you create a totally new user namespace interface here! by 
> > > making your own new interface we just lose the notion of 
> > > init_user_ns and its children and mapping ?
> > 
> > I don't quite understand this; the only use of the init_user_ns is 
> > the capable(CAP_SYS_ADMIN) in fill_super which is how only the real
> > admin can mount at a shifted uid/gid.  Otherwise, there's no need 
> > to see into the userns because filesystems see the kuid_t/kgid_t 
> > which is what I'm shifting.
> > 
> > > I'm not sure of the implication of all this... your user 
> > > namespace mapping is not related at all to init_user_ns! it seems 
> > > that it has its own init_user_ns ?   does a capable() check now 
> > > on a shifted filesystem relates to that and hence to your mapping 
> > > or to the real init_user_ns ?
> > 
> > capable(CAP_SYS_ADMIN) == ns_capable(_user_ns, CAP_SYS_ADMIN)
> > 
> > Or is there a misunderstanding here about how user namespaces work
> > inside the kernel?  The design is that the ID shift is done as you
> > cross the kernel boundary, so a filesystem, being usually all in
> > -kernel operating via the VFS interfaces, ideally never needs to 
> > make any from_kuid/make_kuid calls.  However, there are ways 
> > filesystems can send data across the kernel/user bounary outside of 
> > the usual vfs interfaces (ioctls being the most usual one) so in 
> > that specific code, they have to do the kuid_t to uid_t changes 
> > themselves.  Shiftfs never sends data to the user outside of the 
> > VFS so it never needs to do this and can operate entirely on
> > kuid_ts.
> > 
> > > > There's a bit of an open question of whether it should have vfs
> > > > changes: the way the struct file f_inode and f_ops are hijacked 
> > > > is a bit nasty and perhaps d_select_inode() could be made a bit
> > > > cleverer to help us here instead.
> > > 
> > > I'm not sure if this PoC works... but you sure you didn't 
> > > introduce a serious vulnerability here ? you use a new mapping 
> > > and you update current_fsuid() creds up, which is global on any 
> > > fs operation, so may be: lets operate on any inode, update our 
> > > current_fsuid()... and access the rest of *unshifted filesystems*
> > > ... !?
> > 
> > The credentials are per thread, so it's a standard way of doing
> > credential shifting and no other threads of execution in the same 
> > task get access. As long as you bound the override_creds()/revert_c
> > reds() pairs within the kernel, you're safe.
> 
> No, and here sorry I mean shifted.
> 
> current_fsuid() is global through all fs operations which means it
> crosses user namespaces... it was safe the days of only init_user_ns,
> not anymore... You give a mapping inside containers to fsuid where 
> they don't want to have it... this allows to operate on inodes inside
> other containers... update current_fsuid() even if we want that user 
> to be nobody inside the container... and later it can access the 
> inodes of the shifted fs... and by same current of course...

OK, I still don't understand what you're getting at.  There are three
per-thread uids: uid, euid and fsuid (real, effective and filesystem). 
 They're all either settable via syscall or inherited on fork.  They're
all kernel side, meaning they're kuid_t.  Their values stay invariant
as you move through namespaces.  They change (and get mapped according
to the current 

Re: [RFC v2 PATCH 0/8] VFS:userns: support portable root filesystems

2016-05-12 Thread Djalal Harouni
On Wed, May 11, 2016 at 11:33:38AM -0700, James Bottomley wrote:
> On Wed, 2016-05-11 at 17:42 +0100, Djalal Harouni wrote:
> > On Tue, May 10, 2016 at 04:36:56PM -0700, James Bottomley wrote:
[...]
> > Hmm anyway you are mounting this on behalf of filesystems, so if you 
> > add the recursive thing, you will just probably make everything 
> > worse, by making any /proc, /sys dentry that's under that path 
> > shiftable, and unprivileged users can just create user namespaces and 
> > read /proc/* and all the other stuff that doesn't have capable() 
> > related to the init_user_ns host...
> 
> That's up to the admin who does the shifting.  Recursive would be an
> option if added.

Hmm, not sure if you get my point... you just made it an admin problem
where admins want to mount an image downloaded verify it and use it for
their container with /proc...! that's another problem!


> >   what if you have paths like /filesystem0/uidshiftedY/dir,
> > /filesystem0/uidshiftedX/dir , /filesystem0/notshifted/dir 
> > where some of them are also bind mounts that point to same dentry ?
> 
> Without recursive semantics, you see the underlying inode.  With them,
> you see the upper vfsmnts.  Shiftfs isn't idempotent, so you would need
> to be careful about nesting.  However, that's an admin problem.
> 
> > Also, you create a totally new user namespace interface here! by 
> > making your own new interface we just lose the notion of init_user_ns 
> > and its children and mapping ?
> 
> I don't quite understand this; the only use of the init_user_ns is the
> capable(CAP_SYS_ADMIN) in fill_super which is how only the real admin
> can mount at a shifted uid/gid.  Otherwise, there's no need to see into
> the userns because filesystems see the kuid_t/kgid_t which is what I'm
> shifting.
> 
> > I'm not sure of the implication of all this... your user namespace
> > mapping is not related at all to init_user_ns! it seems that it has
> > its own init_user_ns ?   does a capable() check now on a shifted
> > filesystem relates to that and hence to your mapping or to the real
> > init_user_ns ?
> 
> capable(CAP_SYS_ADMIN) == ns_capable(_user_ns, CAP_SYS_ADMIN)
> 
> Or is there a misunderstanding here about how user namespaces work
> inside the kernel?  The design is that the ID shift is done as you
> cross the kernel boundary, so a filesystem, being usually all in-kernel
> operating via the VFS interfaces, ideally never needs to make any
> from_kuid/make_kuid calls.  However, there are ways filesystems can
> send data across the kernel/user bounary outside of the usual vfs
> interfaces (ioctls being the most usual one) so in that specific code,
> they have to do the kuid_t to uid_t changes themselves.  Shiftfs never
> sends data to the user outside of the VFS so it never needs to do this
> and can operate entirely on kuid_ts.
> 
> > > There's a bit of an open question of whether it should have vfs
> > > changes: the way the struct file f_inode and f_ops are hijacked is 
> > > a bit nasty and perhaps d_select_inode() could be made a bit 
> > > cleverer to help us here instead.
> > 
> > I'm not sure if this PoC works... but you sure you didn't introduce
> > a serious vulnerability here ? you use a new mapping and you update
> > current_fsuid() creds up, which is global on any fs operation, so may
> > be: lets operate on any inode, update our current_fsuid()... and
> > access the rest of *unshifted filesystems*... !?
> 
> The credentials are per thread, so it's a standard way of doing
> credential shifting and no other threads of execution in the same task
> get access. As long as you bound the override_creds()/revert_creds()
> pairs within the kernel, you're safe.

No, and here sorry I mean shifted.

current_fsuid() is global through all fs operations which means it
crosses user namespaces... it was safe the days of only init_user_ns,
not anymore... You give a mapping inside containers to fsuid where they
don't want to have it... this allows to operate on inodes inside other
containers... update current_fsuid() even if we want that user to be
nobody inside the container... and later it can access the inodes of
the shifted fs... and by same current of course...



> > The worst thing is that current_fsuid() does not follow now the
> > /proc/self/uid_map interface! this is a serious vulnerability and a 
> > mix of the current semantics... it's updated but using other
> > rules...?
> 
> current_fsuid() is aready mapped via the userns; it's already a kuid_t
> at its final value.  Shifting that is what you want to remap underlying
> volume uid/gid's.  The uidmap/gidmap inputs to this are shifts on the
> final underlying uid/gids.

=> some points:
Changing setfsuid() its interfaces and rules... or an indrect way to
break another syscall...

The userns used for *mapping* is totatly different and not standard...
losing "init_user_ns and its decendents userns *semantics*...", a
yet a totatly unlinked mapping...


Breaking 

Re: [RFC v2 PATCH 0/8] VFS:userns: support portable root filesystems

2016-05-12 Thread Djalal Harouni
On Wed, May 11, 2016 at 11:33:38AM -0700, James Bottomley wrote:
> On Wed, 2016-05-11 at 17:42 +0100, Djalal Harouni wrote:
> > On Tue, May 10, 2016 at 04:36:56PM -0700, James Bottomley wrote:
[...]
> > Hmm anyway you are mounting this on behalf of filesystems, so if you 
> > add the recursive thing, you will just probably make everything 
> > worse, by making any /proc, /sys dentry that's under that path 
> > shiftable, and unprivileged users can just create user namespaces and 
> > read /proc/* and all the other stuff that doesn't have capable() 
> > related to the init_user_ns host...
> 
> That's up to the admin who does the shifting.  Recursive would be an
> option if added.

Hmm, not sure if you get my point... you just made it an admin problem
where admins want to mount an image downloaded verify it and use it for
their container with /proc...! that's another problem!


> >   what if you have paths like /filesystem0/uidshiftedY/dir,
> > /filesystem0/uidshiftedX/dir , /filesystem0/notshifted/dir 
> > where some of them are also bind mounts that point to same dentry ?
> 
> Without recursive semantics, you see the underlying inode.  With them,
> you see the upper vfsmnts.  Shiftfs isn't idempotent, so you would need
> to be careful about nesting.  However, that's an admin problem.
> 
> > Also, you create a totally new user namespace interface here! by 
> > making your own new interface we just lose the notion of init_user_ns 
> > and its children and mapping ?
> 
> I don't quite understand this; the only use of the init_user_ns is the
> capable(CAP_SYS_ADMIN) in fill_super which is how only the real admin
> can mount at a shifted uid/gid.  Otherwise, there's no need to see into
> the userns because filesystems see the kuid_t/kgid_t which is what I'm
> shifting.
> 
> > I'm not sure of the implication of all this... your user namespace
> > mapping is not related at all to init_user_ns! it seems that it has
> > its own init_user_ns ?   does a capable() check now on a shifted
> > filesystem relates to that and hence to your mapping or to the real
> > init_user_ns ?
> 
> capable(CAP_SYS_ADMIN) == ns_capable(_user_ns, CAP_SYS_ADMIN)
> 
> Or is there a misunderstanding here about how user namespaces work
> inside the kernel?  The design is that the ID shift is done as you
> cross the kernel boundary, so a filesystem, being usually all in-kernel
> operating via the VFS interfaces, ideally never needs to make any
> from_kuid/make_kuid calls.  However, there are ways filesystems can
> send data across the kernel/user bounary outside of the usual vfs
> interfaces (ioctls being the most usual one) so in that specific code,
> they have to do the kuid_t to uid_t changes themselves.  Shiftfs never
> sends data to the user outside of the VFS so it never needs to do this
> and can operate entirely on kuid_ts.
> 
> > > There's a bit of an open question of whether it should have vfs
> > > changes: the way the struct file f_inode and f_ops are hijacked is 
> > > a bit nasty and perhaps d_select_inode() could be made a bit 
> > > cleverer to help us here instead.
> > 
> > I'm not sure if this PoC works... but you sure you didn't introduce
> > a serious vulnerability here ? you use a new mapping and you update
> > current_fsuid() creds up, which is global on any fs operation, so may
> > be: lets operate on any inode, update our current_fsuid()... and
> > access the rest of *unshifted filesystems*... !?
> 
> The credentials are per thread, so it's a standard way of doing
> credential shifting and no other threads of execution in the same task
> get access. As long as you bound the override_creds()/revert_creds()
> pairs within the kernel, you're safe.

No, and here sorry I mean shifted.

current_fsuid() is global through all fs operations which means it
crosses user namespaces... it was safe the days of only init_user_ns,
not anymore... You give a mapping inside containers to fsuid where they
don't want to have it... this allows to operate on inodes inside other
containers... update current_fsuid() even if we want that user to be
nobody inside the container... and later it can access the inodes of
the shifted fs... and by same current of course...



> > The worst thing is that current_fsuid() does not follow now the
> > /proc/self/uid_map interface! this is a serious vulnerability and a 
> > mix of the current semantics... it's updated but using other
> > rules...?
> 
> current_fsuid() is aready mapped via the userns; it's already a kuid_t
> at its final value.  Shifting that is what you want to remap underlying
> volume uid/gid's.  The uidmap/gidmap inputs to this are shifts on the
> final underlying uid/gids.

=> some points:
Changing setfsuid() its interfaces and rules... or an indrect way to
break another syscall...

The userns used for *mapping* is totatly different and not standard...
losing "init_user_ns and its decendents userns *semantics*...", a
yet a totatly unlinked mapping...


Breaking 

Re: [RFC v2 PATCH 0/8] VFS:userns: support portable root filesystems

2016-05-12 Thread Djalal Harouni
Hi Dave,

Tried to do my xfs homework first!

On Fri, May 06, 2016 at 12:50:36PM +1000, Dave Chinner wrote:
> On Thu, May 05, 2016 at 11:24:35PM +0100, Djalal Harouni wrote:
> > On Thu, May 05, 2016 at 10:23:14AM +1000, Dave Chinner wrote:
> > > On Wed, May 04, 2016 at 04:26:46PM +0200, Djalal Harouni wrote:
> > > > This is version 2 of the VFS:userns support portable root filesystems
> > > > RFC. Changes since version 1:
> > > > 
> > > > * Update documentation and remove some ambiguity about the feature.
> > > >   Based on Josh Triplett comments.
> > > > * Use a new email address to send the RFC :-)
> > > > 
> > > > 
> > > > This RFC tries to explore how to support filesystem operations inside
> > > > user namespace using only VFS and a per mount namespace solution. This
> > > > allows to take advantage of user namespace separations without
> > > > introducing any change at the filesystems level. All this is handled
> > > > with the virtual view of mount namespaces.
> > > 
> > > [...]
> > > 
> > > > As an example if the mapping 0:65535 inside mount namespace and outside
> > > > is 100:1065536, then 0:65535 will be the range that we use to
> > > > construct UIDs/GIDs mapping into init_user_ns and use it for on-disk
> > > > data. They represent the persistent values that we want to write to the
> > > > disk. Therefore, we don't keep track of any UID/GID shift that was 
> > > > applied
> > > > before, it gives portability and allows to use the previous mapping
> > > > which was freed for another root filesystem...
> > > 
> > > So let me get this straight. Two /isolated/ containers, different
> > > UID/GID mappings, sharing the same files and directories. Create a
> > > new file in a writeable directory in container 1, namespace
> > > information gets stripped from on-disk uid/gid representation.
> > > 
> > > Container 2 then reads that shared directory, finds the file written
> > > by container 1. As there is no no namespace component to the uid:gid
> > > stored in the inode, we apply the current namespace shift to the VFS
> > > inode uid/gid and so it maps to root in container 2 and we are
> > > allowed to read it?
> > 
> > Only if container 2 has the flag CLONE_MNTNS_SHIFT_UIDGID set in its own
> > mount namespace which only root can set or if it was already set in
> > parent, and have access to the shared dir which the container manager
> > should also configure before... even with that if in container 2 the
> > shift flag is not set then there is no mapping and things work as they
> > are now, but yes this setup is flawed! they should not share rootfs,
> > maybe in rare cases, some user data that's it.
> 
> 
> 
> I can't follow any of the logic you're explaining - you just
> confused me even more.  I thought this was to allow namespaces with
> different uid/gid mappings all to use the same backing store? And
> now you're saying that "no, they'll all have separate backing
> stores"?

Dave, absolutely for root file systems or probably most if not all use
cases, they should have *separate* backing devices. For (1) obvious
security reasons, (2) If they are writing to the filesystem, for quota,
otherwise the whole thing is useless.


> I suspect you need to describe the layering in a way a stupid dummy
> can understand, because trying to be clever with wacky examples is
> not working.

OK, see below please.


> > > Unless I've misunderstood something in this crazy mapping scheme,
> > > isn't this just a vector for unintentional containment breaches?
> > > 
> > > [...]
> > > 
> > > > Simple demo overlayfs, and  btrfs mounted with vfs_shift_uids and
> > > > vfs_shift_gids. The overlayfs mounts will share the same upperdir. We
> > > > create two user namesapces every one with its own mapping and where
> > > > container-uid-200 will pull changes from container-uid-100
> > > > upperdir automatically.
> > > 
> > > Ok, forget I asked - it's clearly intentional. This is beyond
> > > crazy, IMO.
> > 
> > This setup is flawed! that example was to show that files show up with
> > the right mapping with two different user namespaces. As Andy noted
> > they should have a backing device...
> 
> Did you mean "should have different backing devices" here? If not,
> I'm even more confused now...

Yes, I mean a separate different bakcing device.

Now some use cases may share *some* backing devices, but then it should
not be the same backing store of the host /

The container manager should mount a new backing device, maybe make a
snapshot of host / on it and use it for containers.


> > Anyway by the previous paragraph what I mean is that when the container
> > terminates it releases the UID shift range which can be re-used later
> > on another filesystem or on the same previous fs... whatever. Now if
> > the range is already in use, userspace should grab a new range give it
> > a new filesystem or a previous one which doesn't need to be shared and
> > everything should continue to work...
> 
> This sounds like 

Re: [RFC v2 PATCH 0/8] VFS:userns: support portable root filesystems

2016-05-12 Thread Djalal Harouni
Hi Dave,

Tried to do my xfs homework first!

On Fri, May 06, 2016 at 12:50:36PM +1000, Dave Chinner wrote:
> On Thu, May 05, 2016 at 11:24:35PM +0100, Djalal Harouni wrote:
> > On Thu, May 05, 2016 at 10:23:14AM +1000, Dave Chinner wrote:
> > > On Wed, May 04, 2016 at 04:26:46PM +0200, Djalal Harouni wrote:
> > > > This is version 2 of the VFS:userns support portable root filesystems
> > > > RFC. Changes since version 1:
> > > > 
> > > > * Update documentation and remove some ambiguity about the feature.
> > > >   Based on Josh Triplett comments.
> > > > * Use a new email address to send the RFC :-)
> > > > 
> > > > 
> > > > This RFC tries to explore how to support filesystem operations inside
> > > > user namespace using only VFS and a per mount namespace solution. This
> > > > allows to take advantage of user namespace separations without
> > > > introducing any change at the filesystems level. All this is handled
> > > > with the virtual view of mount namespaces.
> > > 
> > > [...]
> > > 
> > > > As an example if the mapping 0:65535 inside mount namespace and outside
> > > > is 100:1065536, then 0:65535 will be the range that we use to
> > > > construct UIDs/GIDs mapping into init_user_ns and use it for on-disk
> > > > data. They represent the persistent values that we want to write to the
> > > > disk. Therefore, we don't keep track of any UID/GID shift that was 
> > > > applied
> > > > before, it gives portability and allows to use the previous mapping
> > > > which was freed for another root filesystem...
> > > 
> > > So let me get this straight. Two /isolated/ containers, different
> > > UID/GID mappings, sharing the same files and directories. Create a
> > > new file in a writeable directory in container 1, namespace
> > > information gets stripped from on-disk uid/gid representation.
> > > 
> > > Container 2 then reads that shared directory, finds the file written
> > > by container 1. As there is no no namespace component to the uid:gid
> > > stored in the inode, we apply the current namespace shift to the VFS
> > > inode uid/gid and so it maps to root in container 2 and we are
> > > allowed to read it?
> > 
> > Only if container 2 has the flag CLONE_MNTNS_SHIFT_UIDGID set in its own
> > mount namespace which only root can set or if it was already set in
> > parent, and have access to the shared dir which the container manager
> > should also configure before... even with that if in container 2 the
> > shift flag is not set then there is no mapping and things work as they
> > are now, but yes this setup is flawed! they should not share rootfs,
> > maybe in rare cases, some user data that's it.
> 
> 
> 
> I can't follow any of the logic you're explaining - you just
> confused me even more.  I thought this was to allow namespaces with
> different uid/gid mappings all to use the same backing store? And
> now you're saying that "no, they'll all have separate backing
> stores"?

Dave, absolutely for root file systems or probably most if not all use
cases, they should have *separate* backing devices. For (1) obvious
security reasons, (2) If they are writing to the filesystem, for quota,
otherwise the whole thing is useless.


> I suspect you need to describe the layering in a way a stupid dummy
> can understand, because trying to be clever with wacky examples is
> not working.

OK, see below please.


> > > Unless I've misunderstood something in this crazy mapping scheme,
> > > isn't this just a vector for unintentional containment breaches?
> > > 
> > > [...]
> > > 
> > > > Simple demo overlayfs, and  btrfs mounted with vfs_shift_uids and
> > > > vfs_shift_gids. The overlayfs mounts will share the same upperdir. We
> > > > create two user namesapces every one with its own mapping and where
> > > > container-uid-200 will pull changes from container-uid-100
> > > > upperdir automatically.
> > > 
> > > Ok, forget I asked - it's clearly intentional. This is beyond
> > > crazy, IMO.
> > 
> > This setup is flawed! that example was to show that files show up with
> > the right mapping with two different user namespaces. As Andy noted
> > they should have a backing device...
> 
> Did you mean "should have different backing devices" here? If not,
> I'm even more confused now...

Yes, I mean a separate different bakcing device.

Now some use cases may share *some* backing devices, but then it should
not be the same backing store of the host /

The container manager should mount a new backing device, maybe make a
snapshot of host / on it and use it for containers.


> > Anyway by the previous paragraph what I mean is that when the container
> > terminates it releases the UID shift range which can be re-used later
> > on another filesystem or on the same previous fs... whatever. Now if
> > the range is already in use, userspace should grab a new range give it
> > a new filesystem or a previous one which doesn't need to be shared and
> > everything should continue to work...
> 
> This sounds like 

Re: [RFC v2 PATCH 0/8] VFS:userns: support portable root filesystems

2016-05-11 Thread James Bottomley
On Wed, 2016-05-11 at 17:42 +0100, Djalal Harouni wrote:
> On Tue, May 10, 2016 at 04:36:56PM -0700, James Bottomley wrote:
> > On Thu, 2016-05-05 at 18:08 -0400, James Bottomley wrote:
> [...]
> > > 
> > > OK, so the way attributes are populated on an inode is via 
> > > getattr.  You intercept that, you change the inode owner and 
> > > group that are installed on the inode.  That means that when you 
> > > list the directory, you see the shift and the shifted uid/gid are 
> > > used to check permissions for vfs_open().
> > 
> > Just to illustrate how this could be done, here's a functional 
> > proof of concept for a uid/gid shifting bind mount equivalent. 
> >  It's not actually a proper bind mount because it has to 
> > manufacture its own inodes.  As you can see, it can only be used by 
> > root, it will shift all the uid/gid bits as well as the permission 
> > comparisons.  It operates on subtrees, so it can shift the 
> > uids/gids on any filesystem or part of one and because the shifts 
> > are per superblock, it could actually shift the same subtree for 
> > multiple users on different shifts.  Best of all, it requires no 
> > vfs changes at all, being entirely implemented inside its own
> > filesystem type.
> 
> First, I guess this should be in a separate thread.. this way this 
> RFC was just hijacked!
> 
> Obviously as you say later in your response it may require a VFS
> change... 

I thought it may but viro didn't rip my head off for shifting the file
operations and inode, so perhaps it's OK as is.

> You have just consumed all inodes... what about containers or small 
> apps that are spawned quickly... it can even used maybe as a DoS... 
>  maybe you endup reporting different inode numbers... ?

Please explain?  Shiftfs deliberately doesn't populate its dentry
cache, so it basically has the same number inodes and dentries in use
as the lower filesystem would ordinarily have.

> 
> > You use it just like bind mount:
> > 
> > mount -t shiftfs  
> > 
> > except that it takes uidshift=x:y:z and gidshift=x:y:z multiple
> > times
> > as options.  It's currently not recursive and it definitely needs
> > polishing to show things like mount options and be properly Kconfig
> > using.
> 
> why it's not recursive ? and what if you have circular bind mounts ? 

Because, as I said, it's a proof of concept.  It can easily have MS_REC
semantics added.

> Hmm anyway you are mounting this on behalf of filesystems, so if you 
> add the recursive thing, you will just probably make everything 
> worse, by making any /proc, /sys dentry that's under that path 
> shiftable, and unprivileged users can just create user namespaces and 
> read /proc/* and all the other stuff that doesn't have capable() 
> related to the init_user_ns host...

That's up to the admin who does the shifting.  Recursive would be an
option if added.

>   what if you have paths like /filesystem0/uidshiftedY/dir,
> /filesystem0/uidshiftedX/dir , /filesystem0/notshifted/dir 
> where some of them are also bind mounts that point to same dentry ?

Without recursive semantics, you see the underlying inode.  With them,
you see the upper vfsmnts.  Shiftfs isn't idempotent, so you would need
to be careful about nesting.  However, that's an admin problem.

> Also, you create a totally new user namespace interface here! by 
> making your own new interface we just lose the notion of init_user_ns 
> and its children and mapping ?

I don't quite understand this; the only use of the init_user_ns is the
capable(CAP_SYS_ADMIN) in fill_super which is how only the real admin
can mount at a shifted uid/gid.  Otherwise, there's no need to see into
the userns because filesystems see the kuid_t/kgid_t which is what I'm
shifting.

> I'm not sure of the implication of all this... your user namespace
> mapping is not related at all to init_user_ns! it seems that it has
> its own init_user_ns ?   does a capable() check now on a shifted
> filesystem relates to that and hence to your mapping or to the real
> init_user_ns ?

capable(CAP_SYS_ADMIN) == ns_capable(_user_ns, CAP_SYS_ADMIN)

Or is there a misunderstanding here about how user namespaces work
inside the kernel?  The design is that the ID shift is done as you
cross the kernel boundary, so a filesystem, being usually all in-kernel
operating via the VFS interfaces, ideally never needs to make any
from_kuid/make_kuid calls.  However, there are ways filesystems can
send data across the kernel/user bounary outside of the usual vfs
interfaces (ioctls being the most usual one) so in that specific code,
they have to do the kuid_t to uid_t changes themselves.  Shiftfs never
sends data to the user outside of the VFS so it never needs to do this
and can operate entirely on kuid_ts.

> > There's a bit of an open question of whether it should have vfs
> > changes: the way the struct file f_inode and f_ops are hijacked is 
> > a bit nasty and perhaps d_select_inode() could be made a bit 
> > cleverer to help us here instead.

Re: [RFC v2 PATCH 0/8] VFS:userns: support portable root filesystems

2016-05-11 Thread James Bottomley
On Wed, 2016-05-11 at 17:42 +0100, Djalal Harouni wrote:
> On Tue, May 10, 2016 at 04:36:56PM -0700, James Bottomley wrote:
> > On Thu, 2016-05-05 at 18:08 -0400, James Bottomley wrote:
> [...]
> > > 
> > > OK, so the way attributes are populated on an inode is via 
> > > getattr.  You intercept that, you change the inode owner and 
> > > group that are installed on the inode.  That means that when you 
> > > list the directory, you see the shift and the shifted uid/gid are 
> > > used to check permissions for vfs_open().
> > 
> > Just to illustrate how this could be done, here's a functional 
> > proof of concept for a uid/gid shifting bind mount equivalent. 
> >  It's not actually a proper bind mount because it has to 
> > manufacture its own inodes.  As you can see, it can only be used by 
> > root, it will shift all the uid/gid bits as well as the permission 
> > comparisons.  It operates on subtrees, so it can shift the 
> > uids/gids on any filesystem or part of one and because the shifts 
> > are per superblock, it could actually shift the same subtree for 
> > multiple users on different shifts.  Best of all, it requires no 
> > vfs changes at all, being entirely implemented inside its own
> > filesystem type.
> 
> First, I guess this should be in a separate thread.. this way this 
> RFC was just hijacked!
> 
> Obviously as you say later in your response it may require a VFS
> change... 

I thought it may but viro didn't rip my head off for shifting the file
operations and inode, so perhaps it's OK as is.

> You have just consumed all inodes... what about containers or small 
> apps that are spawned quickly... it can even used maybe as a DoS... 
>  maybe you endup reporting different inode numbers... ?

Please explain?  Shiftfs deliberately doesn't populate its dentry
cache, so it basically has the same number inodes and dentries in use
as the lower filesystem would ordinarily have.

> 
> > You use it just like bind mount:
> > 
> > mount -t shiftfs  
> > 
> > except that it takes uidshift=x:y:z and gidshift=x:y:z multiple
> > times
> > as options.  It's currently not recursive and it definitely needs
> > polishing to show things like mount options and be properly Kconfig
> > using.
> 
> why it's not recursive ? and what if you have circular bind mounts ? 

Because, as I said, it's a proof of concept.  It can easily have MS_REC
semantics added.

> Hmm anyway you are mounting this on behalf of filesystems, so if you 
> add the recursive thing, you will just probably make everything 
> worse, by making any /proc, /sys dentry that's under that path 
> shiftable, and unprivileged users can just create user namespaces and 
> read /proc/* and all the other stuff that doesn't have capable() 
> related to the init_user_ns host...

That's up to the admin who does the shifting.  Recursive would be an
option if added.

>   what if you have paths like /filesystem0/uidshiftedY/dir,
> /filesystem0/uidshiftedX/dir , /filesystem0/notshifted/dir 
> where some of them are also bind mounts that point to same dentry ?

Without recursive semantics, you see the underlying inode.  With them,
you see the upper vfsmnts.  Shiftfs isn't idempotent, so you would need
to be careful about nesting.  However, that's an admin problem.

> Also, you create a totally new user namespace interface here! by 
> making your own new interface we just lose the notion of init_user_ns 
> and its children and mapping ?

I don't quite understand this; the only use of the init_user_ns is the
capable(CAP_SYS_ADMIN) in fill_super which is how only the real admin
can mount at a shifted uid/gid.  Otherwise, there's no need to see into
the userns because filesystems see the kuid_t/kgid_t which is what I'm
shifting.

> I'm not sure of the implication of all this... your user namespace
> mapping is not related at all to init_user_ns! it seems that it has
> its own init_user_ns ?   does a capable() check now on a shifted
> filesystem relates to that and hence to your mapping or to the real
> init_user_ns ?

capable(CAP_SYS_ADMIN) == ns_capable(_user_ns, CAP_SYS_ADMIN)

Or is there a misunderstanding here about how user namespaces work
inside the kernel?  The design is that the ID shift is done as you
cross the kernel boundary, so a filesystem, being usually all in-kernel
operating via the VFS interfaces, ideally never needs to make any
from_kuid/make_kuid calls.  However, there are ways filesystems can
send data across the kernel/user bounary outside of the usual vfs
interfaces (ioctls being the most usual one) so in that specific code,
they have to do the kuid_t to uid_t changes themselves.  Shiftfs never
sends data to the user outside of the VFS so it never needs to do this
and can operate entirely on kuid_ts.

> > There's a bit of an open question of whether it should have vfs
> > changes: the way the struct file f_inode and f_ops are hijacked is 
> > a bit nasty and perhaps d_select_inode() could be made a bit 
> > cleverer to help us here instead.

Re: [RFC v2 PATCH 0/8] VFS:userns: support portable root filesystems

2016-05-11 Thread Djalal Harouni
On Tue, May 10, 2016 at 04:36:56PM -0700, James Bottomley wrote:
> On Thu, 2016-05-05 at 18:08 -0400, James Bottomley wrote:
[...]
> > 
> > OK, so the way attributes are populated on an inode is via getattr. 
> >  You intercept that, you change the inode owner and group that are
> > installed on the inode.  That means that when you list the directory,
> > you see the shift and the shifted uid/gid are used to check 
> > permissions for vfs_open().
> 
> Just to illustrate how this could be done, here's a functional proof of
> concept for a uid/gid shifting bind mount equivalent.  It's not
> actually a proper bind mount because it has to manufacture its own
> inodes.  As you can see, it can only be used by root, it will shift all
> the uid/gid bits as well as the permission comparisons.  It operates on
> subtrees, so it can shift the uids/gids on any filesystem or part of
> one and because the shifts are per superblock, it could actually shift
> the same subtree for multiple users on different shifts.  Best of all,
> it requires no vfs changes at all, being entirely implemented inside
> its own filesystem type.

First, I guess this should be in a separate thread.. this way this RFC
was just hijacked!

Obviously as you say later in your response it may require a VFS
change... 

You have just consumed all inodes... what about containers or small apps
that are spawned quickly... it can even used maybe as a DoS...  maybe you
endup reporting different inode numbers... ?


> You use it just like bind mount:
> 
> mount -t shiftfs  
> 
> except that it takes uidshift=x:y:z and gidshift=x:y:z multiple times
> as options.  It's currently not recursive and it definitely needs
> polishing to show things like mount options and be properly Kconfig
> using.

why it's not recursive ? and what if you have circular bind mounts ? 

Hmm anyway you are mounting this on behalf of filesystems, so if you add
the recursive thing, you will just probably make everything worse, by
making any /proc, /sys dentry that's under that path shiftable, and
unprivileged users can just create user namespaces and read /proc/*
and all the other stuff that doesn't have capable() related to the
init_user_ns host...

  what if you have paths like /filesystem0/uidshiftedY/dir,
/filesystem0/uidshiftedX/dir , /filesystem0/notshifted/dir 
where some of them are also bind mounts that point to same dentry ?


Also, you create a totally new user namespace interface here! by making
your own new interface we just lose the notion of init_user_ns and its
children and mapping ?

I'm not sure of the implication of all this... your user namespace
mapping is not related at all to init_user_ns! it seems that it has
its own init_user_ns ?   does a capable() check now on a shifted
filesystem relates to that and hence to your mapping or to the real
init_user_ns ?


> There's a bit of an open question of whether it should have vfs
> changes: the way the struct file f_inode and f_ops are hijacked is a
> bit nasty and perhaps d_select_inode() could be made a bit cleverer to
> help us here instead.

I'm not sure if this PoC works... but you sure you didn't introduce
a serious vulnerability here ? you use a new mapping and you update
current_fsuid() creds up, which is global on any fs operation, so may
be: lets operate on any inode, update our current_fsuid()... and
access the rest of *unshifted filesystems*... !?

The worst thing is that current_fsuid() does not follow now the
/proc/self/uid_map interface! this is a serious vulnerability and a mix
of the current semantics... it's updated but using other rules...?

For overlayfs I did write an expriment but for me it's not an overlayfs
or another new filesystem problem... we are manipulating UID/GID
identities...

It would have been better if you did send this as a separate thread.
It was a vfs:userns RFC fix which if we continue we turn it into a
complicated thing! implement another new light filesystem with
userns... (overlayfs...)

Will follow up if the appropriate thread is created, not here, I guess
it's ok ?

> James
> 

Thank you for your feedback!


-- 
Djalal Harouni
http://opendz.org


Re: [RFC v2 PATCH 0/8] VFS:userns: support portable root filesystems

2016-05-11 Thread Djalal Harouni
On Tue, May 10, 2016 at 04:36:56PM -0700, James Bottomley wrote:
> On Thu, 2016-05-05 at 18:08 -0400, James Bottomley wrote:
[...]
> > 
> > OK, so the way attributes are populated on an inode is via getattr. 
> >  You intercept that, you change the inode owner and group that are
> > installed on the inode.  That means that when you list the directory,
> > you see the shift and the shifted uid/gid are used to check 
> > permissions for vfs_open().
> 
> Just to illustrate how this could be done, here's a functional proof of
> concept for a uid/gid shifting bind mount equivalent.  It's not
> actually a proper bind mount because it has to manufacture its own
> inodes.  As you can see, it can only be used by root, it will shift all
> the uid/gid bits as well as the permission comparisons.  It operates on
> subtrees, so it can shift the uids/gids on any filesystem or part of
> one and because the shifts are per superblock, it could actually shift
> the same subtree for multiple users on different shifts.  Best of all,
> it requires no vfs changes at all, being entirely implemented inside
> its own filesystem type.

First, I guess this should be in a separate thread.. this way this RFC
was just hijacked!

Obviously as you say later in your response it may require a VFS
change... 

You have just consumed all inodes... what about containers or small apps
that are spawned quickly... it can even used maybe as a DoS...  maybe you
endup reporting different inode numbers... ?


> You use it just like bind mount:
> 
> mount -t shiftfs  
> 
> except that it takes uidshift=x:y:z and gidshift=x:y:z multiple times
> as options.  It's currently not recursive and it definitely needs
> polishing to show things like mount options and be properly Kconfig
> using.

why it's not recursive ? and what if you have circular bind mounts ? 

Hmm anyway you are mounting this on behalf of filesystems, so if you add
the recursive thing, you will just probably make everything worse, by
making any /proc, /sys dentry that's under that path shiftable, and
unprivileged users can just create user namespaces and read /proc/*
and all the other stuff that doesn't have capable() related to the
init_user_ns host...

  what if you have paths like /filesystem0/uidshiftedY/dir,
/filesystem0/uidshiftedX/dir , /filesystem0/notshifted/dir 
where some of them are also bind mounts that point to same dentry ?


Also, you create a totally new user namespace interface here! by making
your own new interface we just lose the notion of init_user_ns and its
children and mapping ?

I'm not sure of the implication of all this... your user namespace
mapping is not related at all to init_user_ns! it seems that it has
its own init_user_ns ?   does a capable() check now on a shifted
filesystem relates to that and hence to your mapping or to the real
init_user_ns ?


> There's a bit of an open question of whether it should have vfs
> changes: the way the struct file f_inode and f_ops are hijacked is a
> bit nasty and perhaps d_select_inode() could be made a bit cleverer to
> help us here instead.

I'm not sure if this PoC works... but you sure you didn't introduce
a serious vulnerability here ? you use a new mapping and you update
current_fsuid() creds up, which is global on any fs operation, so may
be: lets operate on any inode, update our current_fsuid()... and
access the rest of *unshifted filesystems*... !?

The worst thing is that current_fsuid() does not follow now the
/proc/self/uid_map interface! this is a serious vulnerability and a mix
of the current semantics... it's updated but using other rules...?

For overlayfs I did write an expriment but for me it's not an overlayfs
or another new filesystem problem... we are manipulating UID/GID
identities...

It would have been better if you did send this as a separate thread.
It was a vfs:userns RFC fix which if we continue we turn it into a
complicated thing! implement another new light filesystem with
userns... (overlayfs...)

Will follow up if the appropriate thread is created, not here, I guess
it's ok ?

> James
> 

Thank you for your feedback!


-- 
Djalal Harouni
http://opendz.org


Re: [RFC v2 PATCH 0/8] VFS:userns: support portable root filesystems

2016-05-10 Thread James Bottomley
On Wed, 2016-05-11 at 01:53 +0100, Al Viro wrote:
> On Tue, May 10, 2016 at 04:36:56PM -0700, James Bottomley wrote:
> > +static int shiftfs_rename2(struct inode *olddir, struct dentry
> > *old,
> > +  struct inode *newdir, struct dentry
> > *new,
> > +  unsigned int flags)
> > +{
> > +   struct dentry *rodd = olddir->i_private, *rndd = newdir
> > ->i_private,
> > +   *realold = old->d_inode->i_private,
> > +   *realnew = new->d_inode->i_private;
> > +   struct inode *realolddir = rodd->d_inode, *realnewdir =
> > rndd->d_inode;
> > +   const struct inode_operations *iop = realolddir->i_op;
> > +   int err;
> > +   const struct cred *oldcred, *newcred;
> > +
> > +   oldcred = shiftfs_new_creds(, old->d_sb);
> > +   err = iop->rename2(realolddir, realold, realnewdir,
> > realnew, flags);
> > +   shiftfs_old_creds(oldcred, );
> 
> ... and you've just violated all locking rules for ->rename2().

Yes, sorry, somehow I missed that when I converted everything else to
the vfs_ functions.

> > +static struct dentry *shiftfs_lookup(struct inode *dir, struct
> > dentry *dentry,
> > +unsigned int flags)
> > +{
> > +   struct dentry *real = dir->i_private, *new;
> > +   struct inode *reali = real->d_inode, *newi;
> > +   const struct cred *oldcred, *newcred;
> > +
> > +   /* note: violation of usual fs rules here: dentries are
> > never
> > +* added with d_add.  This is because we want no dentry
> > cache
> > +* for shiftfs.  All lookups proceed through the dentry
> > cache
> > +* of the underlying filesystem, meaning we always see any
> > +* changes in the underlying */
> 
> Bloody wonderful.  So
>   * we lose caching the negative lookups

We do?  They should be cached in the underlying layer's dcache. If
that's not enough, I can hash them, but I was trying to avoid doubling
the dcache size.

>   * we've got buggered hardlinks (different inodes for those)

Yes, had a note to do the lookup, but forgot.

>   * it has never, ever been tried on -next (would do rather nasty
> things on that d_instantiate())

So this is just a proof of concept; I figured it was best to do it
against current rather than have people who wanted to try it pull in
your tree.  I can respin it after the merge window closes.

> 
> > +
> > +   kfree(sfc);
> > +
> > +   return err;
> > +}
> 
> > +   file->f_op = >fop;
> 
> Lovely - now try that with underlying fs something built modular.
> 
> Or try to use it on top of something with non-trivial
> dentry_operations
> (hell, on top of itself, for starters).

So if I add the missing fops_get/put, you're happy with the way this
hijacks f_op and f_inode?

James




Re: [RFC v2 PATCH 0/8] VFS:userns: support portable root filesystems

2016-05-10 Thread James Bottomley
On Wed, 2016-05-11 at 01:53 +0100, Al Viro wrote:
> On Tue, May 10, 2016 at 04:36:56PM -0700, James Bottomley wrote:
> > +static int shiftfs_rename2(struct inode *olddir, struct dentry
> > *old,
> > +  struct inode *newdir, struct dentry
> > *new,
> > +  unsigned int flags)
> > +{
> > +   struct dentry *rodd = olddir->i_private, *rndd = newdir
> > ->i_private,
> > +   *realold = old->d_inode->i_private,
> > +   *realnew = new->d_inode->i_private;
> > +   struct inode *realolddir = rodd->d_inode, *realnewdir =
> > rndd->d_inode;
> > +   const struct inode_operations *iop = realolddir->i_op;
> > +   int err;
> > +   const struct cred *oldcred, *newcred;
> > +
> > +   oldcred = shiftfs_new_creds(, old->d_sb);
> > +   err = iop->rename2(realolddir, realold, realnewdir,
> > realnew, flags);
> > +   shiftfs_old_creds(oldcred, );
> 
> ... and you've just violated all locking rules for ->rename2().

Yes, sorry, somehow I missed that when I converted everything else to
the vfs_ functions.

> > +static struct dentry *shiftfs_lookup(struct inode *dir, struct
> > dentry *dentry,
> > +unsigned int flags)
> > +{
> > +   struct dentry *real = dir->i_private, *new;
> > +   struct inode *reali = real->d_inode, *newi;
> > +   const struct cred *oldcred, *newcred;
> > +
> > +   /* note: violation of usual fs rules here: dentries are
> > never
> > +* added with d_add.  This is because we want no dentry
> > cache
> > +* for shiftfs.  All lookups proceed through the dentry
> > cache
> > +* of the underlying filesystem, meaning we always see any
> > +* changes in the underlying */
> 
> Bloody wonderful.  So
>   * we lose caching the negative lookups

We do?  They should be cached in the underlying layer's dcache. If
that's not enough, I can hash them, but I was trying to avoid doubling
the dcache size.

>   * we've got buggered hardlinks (different inodes for those)

Yes, had a note to do the lookup, but forgot.

>   * it has never, ever been tried on -next (would do rather nasty
> things on that d_instantiate())

So this is just a proof of concept; I figured it was best to do it
against current rather than have people who wanted to try it pull in
your tree.  I can respin it after the merge window closes.

> 
> > +
> > +   kfree(sfc);
> > +
> > +   return err;
> > +}
> 
> > +   file->f_op = >fop;
> 
> Lovely - now try that with underlying fs something built modular.
> 
> Or try to use it on top of something with non-trivial
> dentry_operations
> (hell, on top of itself, for starters).

So if I add the missing fops_get/put, you're happy with the way this
hijacks f_op and f_inode?

James




Re: [RFC v2 PATCH 0/8] VFS:userns: support portable root filesystems

2016-05-10 Thread Al Viro
On Tue, May 10, 2016 at 04:36:56PM -0700, James Bottomley wrote:
> +static int shiftfs_rename2(struct inode *olddir, struct dentry *old,
> +struct inode *newdir, struct dentry *new,
> +unsigned int flags)
> +{
> + struct dentry *rodd = olddir->i_private, *rndd = newdir->i_private,
> + *realold = old->d_inode->i_private,
> + *realnew = new->d_inode->i_private;
> + struct inode *realolddir = rodd->d_inode, *realnewdir = rndd->d_inode;
> + const struct inode_operations *iop = realolddir->i_op;
> + int err;
> + const struct cred *oldcred, *newcred;
> +
> + oldcred = shiftfs_new_creds(, old->d_sb);
> + err = iop->rename2(realolddir, realold, realnewdir, realnew, flags);
> + shiftfs_old_creds(oldcred, );

... and you've just violated all locking rules for ->rename2().

> +static struct dentry *shiftfs_lookup(struct inode *dir, struct dentry 
> *dentry,
> +  unsigned int flags)
> +{
> + struct dentry *real = dir->i_private, *new;
> + struct inode *reali = real->d_inode, *newi;
> + const struct cred *oldcred, *newcred;
> +
> + /* note: violation of usual fs rules here: dentries are never
> +  * added with d_add.  This is because we want no dentry cache
> +  * for shiftfs.  All lookups proceed through the dentry cache
> +  * of the underlying filesystem, meaning we always see any
> +  * changes in the underlying */

Bloody wonderful.  So
* we lose caching the negative lookups
* we've got buggered hardlinks (different inodes for those)
* it has never, ever been tried on -next (would do rather nasty
things on that d_instantiate())

> +
> + kfree(sfc);
> +
> + return err;
> +}

> + file->f_op = >fop;

Lovely - now try that with underlying fs something built modular.

Or try to use it on top of something with non-trivial dentry_operations
(hell, on top of itself, for starters).


Re: [RFC v2 PATCH 0/8] VFS:userns: support portable root filesystems

2016-05-10 Thread Al Viro
On Tue, May 10, 2016 at 04:36:56PM -0700, James Bottomley wrote:
> +static int shiftfs_rename2(struct inode *olddir, struct dentry *old,
> +struct inode *newdir, struct dentry *new,
> +unsigned int flags)
> +{
> + struct dentry *rodd = olddir->i_private, *rndd = newdir->i_private,
> + *realold = old->d_inode->i_private,
> + *realnew = new->d_inode->i_private;
> + struct inode *realolddir = rodd->d_inode, *realnewdir = rndd->d_inode;
> + const struct inode_operations *iop = realolddir->i_op;
> + int err;
> + const struct cred *oldcred, *newcred;
> +
> + oldcred = shiftfs_new_creds(, old->d_sb);
> + err = iop->rename2(realolddir, realold, realnewdir, realnew, flags);
> + shiftfs_old_creds(oldcred, );

... and you've just violated all locking rules for ->rename2().

> +static struct dentry *shiftfs_lookup(struct inode *dir, struct dentry 
> *dentry,
> +  unsigned int flags)
> +{
> + struct dentry *real = dir->i_private, *new;
> + struct inode *reali = real->d_inode, *newi;
> + const struct cred *oldcred, *newcred;
> +
> + /* note: violation of usual fs rules here: dentries are never
> +  * added with d_add.  This is because we want no dentry cache
> +  * for shiftfs.  All lookups proceed through the dentry cache
> +  * of the underlying filesystem, meaning we always see any
> +  * changes in the underlying */

Bloody wonderful.  So
* we lose caching the negative lookups
* we've got buggered hardlinks (different inodes for those)
* it has never, ever been tried on -next (would do rather nasty
things on that d_instantiate())

> +
> + kfree(sfc);
> +
> + return err;
> +}

> + file->f_op = >fop;

Lovely - now try that with underlying fs something built modular.

Or try to use it on top of something with non-trivial dentry_operations
(hell, on top of itself, for starters).


Re: [RFC v2 PATCH 0/8] VFS:userns: support portable root filesystems

2016-05-10 Thread Al Viro
On Tue, May 10, 2016 at 04:36:56PM -0700, James Bottomley wrote:

> mount -t shiftfs  

Note to self: do not eat while reading l-k...


Re: [RFC v2 PATCH 0/8] VFS:userns: support portable root filesystems

2016-05-10 Thread Al Viro
On Tue, May 10, 2016 at 04:36:56PM -0700, James Bottomley wrote:

> mount -t shiftfs  

Note to self: do not eat while reading l-k...


Re: [RFC v2 PATCH 0/8] VFS:userns: support portable root filesystems

2016-05-10 Thread James Bottomley
On Thu, 2016-05-05 at 18:08 -0400, James Bottomley wrote:
> On Thu, 2016-05-05 at 22:49 +0100, Djalal Harouni wrote:
> > On Thu, May 05, 2016 at 07:56:28AM -0400, James Bottomley wrote:
> > > On Thu, 2016-05-05 at 08:36 +0100, Djalal Harouni wrote:
> > > > On Wed, May 04, 2016 at 05:06:19PM -0400, James Bottomley
> > > > wrote:
[...]
> > > > > So this option was discussed at the recent LSF/MM summit. 
> > > > >  The 
> > > > > most supported suggestion was that you'd use a new internal
> > > > > fs 
> > > > > type that had a struct mount with a new superblock and would 
> > > > > copy the underlying inodes but substitute it's own with 
> > > > >  modified  ->getatrr/->setattr calls that did the uid shift. 
> > > > >  In many ways it would be a remapping bind which would look 
> > > > > similar to overlayfs but be a lot simpler.
> > > > 
> > > > Hmm, it's not only about ->getattr and ->setattr, you have all 
> > > > the other file system operations that need access too...
> > > 
> > > Why?  Or perhaps we should more cogently define the actual 
> > > problem.   My problem is simply mounting image volumes that were 
> > > created with real uids at user namespace shifted uids because I'm
> > >  downshifting the privileged ids in the container.  I actually 
> > > *only* need the uid/gids on the attributes shifted because that's
> > > what I need to manipulate the
> > >   
> > We need them obviously for read/write/creation... ?!
> 
> OK, so the way attributes are populated on an inode is via getattr. 
>  You intercept that, you change the inode owner and group that are
> installed on the inode.  That means that when you list the directory,
> you see the shift and the shifted uid/gid are used to check 
> permissions for vfs_open().

Just to illustrate how this could be done, here's a functional proof of
concept for a uid/gid shifting bind mount equivalent.  It's not
actually a proper bind mount because it has to manufacture its own
inodes.  As you can see, it can only be used by root, it will shift all
the uid/gid bits as well as the permission comparisons.  It operates on
subtrees, so it can shift the uids/gids on any filesystem or part of
one and because the shifts are per superblock, it could actually shift
the same subtree for multiple users on different shifts.  Best of all,
it requires no vfs changes at all, being entirely implemented inside
its own filesystem type.

You use it just like bind mount:

mount -t shiftfs  

except that it takes uidshift=x:y:z and gidshift=x:y:z multiple times
as options.  It's currently not recursive and it definitely needs
polishing to show things like mount options and be properly Kconfig
using.

There's a bit of an open question of whether it should have vfs
changes: the way the struct file f_inode and f_ops are hijacked is a
bit nasty and perhaps d_select_inode() could be made a bit cleverer to
help us here instead.

James

---

 fs/Makefile|   1 +
 fs/shiftfs.c   | 790 +
 include/uapi/linux/magic.h |   2 +
 3 files changed, 793 insertions(+)

diff --git a/fs/Makefile b/fs/Makefile
index 85b6e13..bad03b2 100644
--- a/fs/Makefile
+++ b/fs/Makefile
@@ -128,3 +128,4 @@ obj-y   += exofs/ # Multiple 
modules
 obj-$(CONFIG_CEPH_FS)  += ceph/
 obj-$(CONFIG_PSTORE)   += pstore/
 obj-$(CONFIG_EFIVAR_FS)+= efivarfs/
+obj-m  += shiftfs.o
diff --git a/fs/shiftfs.c b/fs/shiftfs.c
new file mode 100644
index 000..b40cdfe
--- /dev/null
+++ b/fs/shiftfs.c
@@ -0,0 +1,790 @@
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+
+struct shiftfs_super_info {
+   struct vfsmount *mnt;
+   struct uid_gid_map uid_map, gid_map;
+};
+
+static struct inode *shiftfs_new_inode(struct super_block *sb, umode_t mode,
+  struct dentry *dentry);
+
+enum {
+   OPT_UIDMAP,
+   OPT_GIDMAP,
+   OPT_LAST,
+};
+
+/* global filesystem options */
+static const match_table_t tokens = {
+   { OPT_UIDMAP, "uidmap=%u:%u:%u" },
+   { OPT_GIDMAP, "gidmap=%u:%u:%u" },
+   { OPT_LAST, NULL }
+};
+
+/*
+ * code stolen from user_namespace.c ... except that these functions
+ * return the same id back if unmapped ... should probably have a
+ * library?
+ */
+static u32 map_id_down(struct uid_gid_map *map, u32 id)
+{
+   unsigned idx, extents;
+   u32 first, last;
+
+   /* Find the matching extent */
+   extents = map->nr_extents;
+   smp_rmb();
+   for (idx = 0; idx < extents; idx++) {
+   first = map->extent[idx].first;
+   last = first + map->extent[idx].count - 1;
+   if (id >= first && id <= last)
+   break;
+   }
+   /* Map the id or note failure */
+   if (idx < extents)
+   id = (id - first) + 

Re: [RFC v2 PATCH 0/8] VFS:userns: support portable root filesystems

2016-05-10 Thread James Bottomley
On Thu, 2016-05-05 at 18:08 -0400, James Bottomley wrote:
> On Thu, 2016-05-05 at 22:49 +0100, Djalal Harouni wrote:
> > On Thu, May 05, 2016 at 07:56:28AM -0400, James Bottomley wrote:
> > > On Thu, 2016-05-05 at 08:36 +0100, Djalal Harouni wrote:
> > > > On Wed, May 04, 2016 at 05:06:19PM -0400, James Bottomley
> > > > wrote:
[...]
> > > > > So this option was discussed at the recent LSF/MM summit. 
> > > > >  The 
> > > > > most supported suggestion was that you'd use a new internal
> > > > > fs 
> > > > > type that had a struct mount with a new superblock and would 
> > > > > copy the underlying inodes but substitute it's own with 
> > > > >  modified  ->getatrr/->setattr calls that did the uid shift. 
> > > > >  In many ways it would be a remapping bind which would look 
> > > > > similar to overlayfs but be a lot simpler.
> > > > 
> > > > Hmm, it's not only about ->getattr and ->setattr, you have all 
> > > > the other file system operations that need access too...
> > > 
> > > Why?  Or perhaps we should more cogently define the actual 
> > > problem.   My problem is simply mounting image volumes that were 
> > > created with real uids at user namespace shifted uids because I'm
> > >  downshifting the privileged ids in the container.  I actually 
> > > *only* need the uid/gids on the attributes shifted because that's
> > > what I need to manipulate the
> > >   
> > We need them obviously for read/write/creation... ?!
> 
> OK, so the way attributes are populated on an inode is via getattr. 
>  You intercept that, you change the inode owner and group that are
> installed on the inode.  That means that when you list the directory,
> you see the shift and the shifted uid/gid are used to check 
> permissions for vfs_open().

Just to illustrate how this could be done, here's a functional proof of
concept for a uid/gid shifting bind mount equivalent.  It's not
actually a proper bind mount because it has to manufacture its own
inodes.  As you can see, it can only be used by root, it will shift all
the uid/gid bits as well as the permission comparisons.  It operates on
subtrees, so it can shift the uids/gids on any filesystem or part of
one and because the shifts are per superblock, it could actually shift
the same subtree for multiple users on different shifts.  Best of all,
it requires no vfs changes at all, being entirely implemented inside
its own filesystem type.

You use it just like bind mount:

mount -t shiftfs  

except that it takes uidshift=x:y:z and gidshift=x:y:z multiple times
as options.  It's currently not recursive and it definitely needs
polishing to show things like mount options and be properly Kconfig
using.

There's a bit of an open question of whether it should have vfs
changes: the way the struct file f_inode and f_ops are hijacked is a
bit nasty and perhaps d_select_inode() could be made a bit cleverer to
help us here instead.

James

---

 fs/Makefile|   1 +
 fs/shiftfs.c   | 790 +
 include/uapi/linux/magic.h |   2 +
 3 files changed, 793 insertions(+)

diff --git a/fs/Makefile b/fs/Makefile
index 85b6e13..bad03b2 100644
--- a/fs/Makefile
+++ b/fs/Makefile
@@ -128,3 +128,4 @@ obj-y   += exofs/ # Multiple 
modules
 obj-$(CONFIG_CEPH_FS)  += ceph/
 obj-$(CONFIG_PSTORE)   += pstore/
 obj-$(CONFIG_EFIVAR_FS)+= efivarfs/
+obj-m  += shiftfs.o
diff --git a/fs/shiftfs.c b/fs/shiftfs.c
new file mode 100644
index 000..b40cdfe
--- /dev/null
+++ b/fs/shiftfs.c
@@ -0,0 +1,790 @@
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+
+struct shiftfs_super_info {
+   struct vfsmount *mnt;
+   struct uid_gid_map uid_map, gid_map;
+};
+
+static struct inode *shiftfs_new_inode(struct super_block *sb, umode_t mode,
+  struct dentry *dentry);
+
+enum {
+   OPT_UIDMAP,
+   OPT_GIDMAP,
+   OPT_LAST,
+};
+
+/* global filesystem options */
+static const match_table_t tokens = {
+   { OPT_UIDMAP, "uidmap=%u:%u:%u" },
+   { OPT_GIDMAP, "gidmap=%u:%u:%u" },
+   { OPT_LAST, NULL }
+};
+
+/*
+ * code stolen from user_namespace.c ... except that these functions
+ * return the same id back if unmapped ... should probably have a
+ * library?
+ */
+static u32 map_id_down(struct uid_gid_map *map, u32 id)
+{
+   unsigned idx, extents;
+   u32 first, last;
+
+   /* Find the matching extent */
+   extents = map->nr_extents;
+   smp_rmb();
+   for (idx = 0; idx < extents; idx++) {
+   first = map->extent[idx].first;
+   last = first + map->extent[idx].count - 1;
+   if (id >= first && id <= last)
+   break;
+   }
+   /* Map the id or note failure */
+   if (idx < extents)
+   id = (id - first) + 

Re: [RFC v2 PATCH 0/8] VFS:userns: support portable root filesystems

2016-05-10 Thread Djalal Harouni
On Mon, May 09, 2016 at 04:26:30PM +, Serge Hallyn wrote:
> Quoting Djalal Harouni (tix...@gmail.com):
> > Hi,
[...]
> > 
> > After clone(CLONE_NEWUSER|CLONE_NEWNS|CLONE_MNTNS_SHIFT_UIDGID), setup
> > the user namespace mapping, I guess you drop capabilities, do setuid()
> > or whatever and start the PID 1 or the app of the container.
> > 
> > Now and to not confuse more Dave, since he doesn't like the idea of
> > a shared backing device, and me neither for obvious reasons! the shared
> > device should not be used for a rootfs, maybe for read-only user shared
> > data, or shared config, that's it... but for real rootfs they should have
> > their own *different* backing device! unless you know what you are doing
> > hehe I don't want to confuse people, and I just lack time, will also
> > respond to Dave email.
> 
> Yes.  We're saying slightly different things.  You're saying that the admin
> should assign different backing stores for containers.  I'm saying perhaps
> the kernel should enforce that, because $leaks.  Let's say the host admin
> did a perfect setup of a container with shifted uids.  Now he wants to
> run a quick ps in the container...  he does it in a way that leaks a
> /proc/pid reference into the container so that (evil) container root can
> use /proc/pid/root/ to get a toehold into the host /.  Does he now have
> shifted access to that?

No. Assuming host / or its other mount points are not mounted with
vfs_shift_uids and vfs_shift_gids options. In this case no shift is
performed at all.

1) If you mount host / with vfs_shift_uids and vfs_shift_gids it's
like real root in init_user_ns does "chmod -R o+rwx /"... It does not make
sense and since no one can edit/remount mounts to change their options in
the mount namespace of init_user_ns, it's safe, and not available by
default.

2) That's why also filsystems must support this explicitly and not on
their behalf.

IMO the kernel is already enforcing this, so even if you assign different
backing stores to containers, you can't have shifted access there, unless
you explicitly tell the kernel that the mount is mean to be shifted by
adding vfs_shift_uids and vfs_shift_gids mount options.


> I think if we say "this blockdev will have shifted uids in 
> /proc/$pid/ns/user",
> then immediately that blockdev becomes not-readable (or not-executable)
> in any namespace which does not have /proc/$pid/ns/user as an ancestor.

Hmm,

(1) This won't work since to do that you have to know in advance
/proc/$pid/ns/user and since file systems can't be mounted inside user
namespace this brings us to the same blocker ... ! and in our use case
we do want to shift UIDs/GIDs to just access inodes, no need to expose
the whole filesystem, root is responsible and filesystems stay safe.

(2)  Why complicate ? the kernel already supports this! and it's a
generic solution.

As said you can just create new mount namespaces, mount things there
private, slave... mount your blockdev that will be shifted by processes
that inherits that mount, you can even have intermediate mount namespaces
that you will forget/unref at any moment and where they are only used to
perform setup, and no other process/code can enter... You don't have
any leaks nothing! you control that piece of code.

If you want that blockdev to become not-readable or noexec in any
namespace which does not have /proc/$pid/ns/user as an ancestor,
the kernel allows a better interface, it allows that blockdev to not
even show up in any ancestor, by making use of mount namespaces and
MS_PRIVATE, MS_SLAVE... no one will even notice if the mount exists.

However if you want to access that blockdev for whatever reason, then
create a new mount namespace and use MS_PRIVATE, MS_SLAVE and all the
noexec flags and mount it.

Yes slightly different things, but I don't want to add complexity where
the interface already exists in the kernel...


> With obvious check as in write-versus-execute exclusion that you cannot
> mark the blockdev shifted if ancestor user_ns already has a file open for
> execute.

Please note here, that it's the same ancestor who will mark the blockdev
to be shifted, but  why the ancestor will keep at the same time a file
open in that filesystem that is mean to be shifted and later execute
through that fd a program that was just crafted by untrusted container ?!


For me the kernel already offers the interfaces no need to complicate
things or enforce it... As said in other responses, the design of these
patches is to just use what the kernel already provides.



> BTW, perhaps I should do this in a separate email, but here is how I would
> expect to use this:
> 
> 1. Using zfs: I create a bare (unshifted) rootfs fs1.   When I want to
> create a new container, I zfs clone fs1 to fs2, and let the container
> use fs2 shifted.  No danger to fs1 since fs2 is cow.  Same with btrfs.

Yes that would work, since fs1 is unshifted, the only requirement is
that fs2 should not reside on the same backing store of 

Re: [RFC v2 PATCH 0/8] VFS:userns: support portable root filesystems

2016-05-10 Thread Djalal Harouni
On Mon, May 09, 2016 at 04:26:30PM +, Serge Hallyn wrote:
> Quoting Djalal Harouni (tix...@gmail.com):
> > Hi,
[...]
> > 
> > After clone(CLONE_NEWUSER|CLONE_NEWNS|CLONE_MNTNS_SHIFT_UIDGID), setup
> > the user namespace mapping, I guess you drop capabilities, do setuid()
> > or whatever and start the PID 1 or the app of the container.
> > 
> > Now and to not confuse more Dave, since he doesn't like the idea of
> > a shared backing device, and me neither for obvious reasons! the shared
> > device should not be used for a rootfs, maybe for read-only user shared
> > data, or shared config, that's it... but for real rootfs they should have
> > their own *different* backing device! unless you know what you are doing
> > hehe I don't want to confuse people, and I just lack time, will also
> > respond to Dave email.
> 
> Yes.  We're saying slightly different things.  You're saying that the admin
> should assign different backing stores for containers.  I'm saying perhaps
> the kernel should enforce that, because $leaks.  Let's say the host admin
> did a perfect setup of a container with shifted uids.  Now he wants to
> run a quick ps in the container...  he does it in a way that leaks a
> /proc/pid reference into the container so that (evil) container root can
> use /proc/pid/root/ to get a toehold into the host /.  Does he now have
> shifted access to that?

No. Assuming host / or its other mount points are not mounted with
vfs_shift_uids and vfs_shift_gids options. In this case no shift is
performed at all.

1) If you mount host / with vfs_shift_uids and vfs_shift_gids it's
like real root in init_user_ns does "chmod -R o+rwx /"... It does not make
sense and since no one can edit/remount mounts to change their options in
the mount namespace of init_user_ns, it's safe, and not available by
default.

2) That's why also filsystems must support this explicitly and not on
their behalf.

IMO the kernel is already enforcing this, so even if you assign different
backing stores to containers, you can't have shifted access there, unless
you explicitly tell the kernel that the mount is mean to be shifted by
adding vfs_shift_uids and vfs_shift_gids mount options.


> I think if we say "this blockdev will have shifted uids in 
> /proc/$pid/ns/user",
> then immediately that blockdev becomes not-readable (or not-executable)
> in any namespace which does not have /proc/$pid/ns/user as an ancestor.

Hmm,

(1) This won't work since to do that you have to know in advance
/proc/$pid/ns/user and since file systems can't be mounted inside user
namespace this brings us to the same blocker ... ! and in our use case
we do want to shift UIDs/GIDs to just access inodes, no need to expose
the whole filesystem, root is responsible and filesystems stay safe.

(2)  Why complicate ? the kernel already supports this! and it's a
generic solution.

As said you can just create new mount namespaces, mount things there
private, slave... mount your blockdev that will be shifted by processes
that inherits that mount, you can even have intermediate mount namespaces
that you will forget/unref at any moment and where they are only used to
perform setup, and no other process/code can enter... You don't have
any leaks nothing! you control that piece of code.

If you want that blockdev to become not-readable or noexec in any
namespace which does not have /proc/$pid/ns/user as an ancestor,
the kernel allows a better interface, it allows that blockdev to not
even show up in any ancestor, by making use of mount namespaces and
MS_PRIVATE, MS_SLAVE... no one will even notice if the mount exists.

However if you want to access that blockdev for whatever reason, then
create a new mount namespace and use MS_PRIVATE, MS_SLAVE and all the
noexec flags and mount it.

Yes slightly different things, but I don't want to add complexity where
the interface already exists in the kernel...


> With obvious check as in write-versus-execute exclusion that you cannot
> mark the blockdev shifted if ancestor user_ns already has a file open for
> execute.

Please note here, that it's the same ancestor who will mark the blockdev
to be shifted, but  why the ancestor will keep at the same time a file
open in that filesystem that is mean to be shifted and later execute
through that fd a program that was just crafted by untrusted container ?!


For me the kernel already offers the interfaces no need to complicate
things or enforce it... As said in other responses, the design of these
patches is to just use what the kernel already provides.



> BTW, perhaps I should do this in a separate email, but here is how I would
> expect to use this:
> 
> 1. Using zfs: I create a bare (unshifted) rootfs fs1.   When I want to
> create a new container, I zfs clone fs1 to fs2, and let the container
> use fs2 shifted.  No danger to fs1 since fs2 is cow.  Same with btrfs.

Yes that would work, since fs1 is unshifted, the only requirement is
that fs2 should not reside on the same backing store of 

Re: [RFC v2 PATCH 0/8] VFS:userns: support portable root filesystems

2016-05-09 Thread Serge Hallyn
Quoting Djalal Harouni (tix...@gmail.com):
> Hi,
> 
> On Wed, May 04, 2016 at 11:30:09PM +, Serge Hallyn wrote:
> > Quoting Djalal Harouni (tix...@gmail.com):
> > > This is version 2 of the VFS:userns support portable root filesystems
> > > RFC. Changes since version 1:
> > > 
> > > * Update documentation and remove some ambiguity about the feature.
> > >   Based on Josh Triplett comments.
> > > * Use a new email address to send the RFC :-)
> > > 
> > > 
> > > This RFC tries to explore how to support filesystem operations inside
> > > user namespace using only VFS and a per mount namespace solution. This
> > > allows to take advantage of user namespace separations without
> > > introducing any change at the filesystems level. All this is handled
> > > with the virtual view of mount namespaces.
> > 
> > Given your use case, is there any way we could work in some tradeoffs
> > to protect the host?  What I'm thinking is that containers can all
> > share devices uid-mapped at will, however any device mounted with
> > uid shifting cannot be used by the inital user namespace.  Or maybe
> > just non-executable in that case, as you'll need enough access to
> > the fs to set up the containers you want to run.
> > 
> > So if /dev/sda1 is your host /, you have to use /dev/sda2 as the
> > container rootfs source.  Mount it under /containers with uid
> > shifting.  Now all containers regardless of uid mappings see
> > the shifted fs contents.  But the host root cannot be tricked by
> > files on it, as /dev/sda2 is non-executable as far as it is
> > concerned.
> Of course the whole setup is based on the container manager to setup
> the right mount namespace, clean mounts, etc then pivot root, boot or
> whatever...
> 
> Now I guess we can achieve what you want with MS_SLAVE|MS_REC on / ?
> 
> You create a new mount/pid... namespaces with shift flags, but you are still
> in init_user_ns, you remount your / with MS_SLAVE|MS_REC, then you
> create new mount/pid namespaces with shift flag (two mount namespaces
> here if you don't want to race setting MS_SLAVE flag and creating mount
> namespace and you don't trust other processes... or you want the same nested
> setup...)
> 
> This second new secure mount namespace will be the one that you will use
> to setup the container, device nodes, loops...  fs that you want into the
> container (probably with shift options) and also filesystems that you can't
> mount inside user namespaces nor want them to show up or propagate into
> host, you may also want to umount stuff too or remount to change mount
> options too.., etc anyway here call it the cleaning of the mount namespace.
> 
> Now during this phase, when you mount and prepare these file systems,
> mount them with noexec flag first, then remount later with exec, or delay
> the mounting just before you do a new clone(CLONE_NEWUSER...). During this
> phase the container manager should get the device that you want to be
> shared from input or argument, and it will only mount it and prepare
> it inside new mount namespaces or containers and make sure that it will
> never be propagated back...
> 
> After clone(CLONE_NEWUSER|CLONE_NEWNS|CLONE_MNTNS_SHIFT_UIDGID), setup
> the user namespace mapping, I guess you drop capabilities, do setuid()
> or whatever and start the PID 1 or the app of the container.
> 
> Now and to not confuse more Dave, since he doesn't like the idea of
> a shared backing device, and me neither for obvious reasons! the shared
> device should not be used for a rootfs, maybe for read-only user shared
> data, or shared config, that's it... but for real rootfs they should have
> their own *different* backing device! unless you know what you are doing
> hehe I don't want to confuse people, and I just lack time, will also
> respond to Dave email.

Yes.  We're saying slightly different things.  You're saying that the admin
should assign different backing stores for containers.  I'm saying perhaps
the kernel should enforce that, because $leaks.  Let's say the host admin
did a perfect setup of a container with shifted uids.  Now he wants to
run a quick ps in the container...  he does it in a way that leaks a
/proc/pid reference into the container so that (evil) container root can
use /proc/pid/root/ to get a toehold into the host /.  Does he now have
shifted access to that?

I think if we say "this blockdev will have shifted uids in /proc/$pid/ns/user",
then immediately that blockdev becomes not-readable (or not-executable)
in any namespace which does not have /proc/$pid/ns/user as an ancestor.
With obvious check as in write-versus-execute exclusion that you cannot
mark the blockdev shifted if ancestor user_ns already has a file open for
execute.

BTW, perhaps I should do this in a separate email, but here is how I would
expect to use this:

1. Using zfs: I create a bare (unshifted) rootfs fs1.   When I want to
create a new container, I zfs clone fs1 to fs2, and let the container
use fs2 shifted.  No danger to fs1 

Re: [RFC v2 PATCH 0/8] VFS:userns: support portable root filesystems

2016-05-09 Thread Serge Hallyn
Quoting Djalal Harouni (tix...@gmail.com):
> Hi,
> 
> On Wed, May 04, 2016 at 11:30:09PM +, Serge Hallyn wrote:
> > Quoting Djalal Harouni (tix...@gmail.com):
> > > This is version 2 of the VFS:userns support portable root filesystems
> > > RFC. Changes since version 1:
> > > 
> > > * Update documentation and remove some ambiguity about the feature.
> > >   Based on Josh Triplett comments.
> > > * Use a new email address to send the RFC :-)
> > > 
> > > 
> > > This RFC tries to explore how to support filesystem operations inside
> > > user namespace using only VFS and a per mount namespace solution. This
> > > allows to take advantage of user namespace separations without
> > > introducing any change at the filesystems level. All this is handled
> > > with the virtual view of mount namespaces.
> > 
> > Given your use case, is there any way we could work in some tradeoffs
> > to protect the host?  What I'm thinking is that containers can all
> > share devices uid-mapped at will, however any device mounted with
> > uid shifting cannot be used by the inital user namespace.  Or maybe
> > just non-executable in that case, as you'll need enough access to
> > the fs to set up the containers you want to run.
> > 
> > So if /dev/sda1 is your host /, you have to use /dev/sda2 as the
> > container rootfs source.  Mount it under /containers with uid
> > shifting.  Now all containers regardless of uid mappings see
> > the shifted fs contents.  But the host root cannot be tricked by
> > files on it, as /dev/sda2 is non-executable as far as it is
> > concerned.
> Of course the whole setup is based on the container manager to setup
> the right mount namespace, clean mounts, etc then pivot root, boot or
> whatever...
> 
> Now I guess we can achieve what you want with MS_SLAVE|MS_REC on / ?
> 
> You create a new mount/pid... namespaces with shift flags, but you are still
> in init_user_ns, you remount your / with MS_SLAVE|MS_REC, then you
> create new mount/pid namespaces with shift flag (two mount namespaces
> here if you don't want to race setting MS_SLAVE flag and creating mount
> namespace and you don't trust other processes... or you want the same nested
> setup...)
> 
> This second new secure mount namespace will be the one that you will use
> to setup the container, device nodes, loops...  fs that you want into the
> container (probably with shift options) and also filesystems that you can't
> mount inside user namespaces nor want them to show up or propagate into
> host, you may also want to umount stuff too or remount to change mount
> options too.., etc anyway here call it the cleaning of the mount namespace.
> 
> Now during this phase, when you mount and prepare these file systems,
> mount them with noexec flag first, then remount later with exec, or delay
> the mounting just before you do a new clone(CLONE_NEWUSER...). During this
> phase the container manager should get the device that you want to be
> shared from input or argument, and it will only mount it and prepare
> it inside new mount namespaces or containers and make sure that it will
> never be propagated back...
> 
> After clone(CLONE_NEWUSER|CLONE_NEWNS|CLONE_MNTNS_SHIFT_UIDGID), setup
> the user namespace mapping, I guess you drop capabilities, do setuid()
> or whatever and start the PID 1 or the app of the container.
> 
> Now and to not confuse more Dave, since he doesn't like the idea of
> a shared backing device, and me neither for obvious reasons! the shared
> device should not be used for a rootfs, maybe for read-only user shared
> data, or shared config, that's it... but for real rootfs they should have
> their own *different* backing device! unless you know what you are doing
> hehe I don't want to confuse people, and I just lack time, will also
> respond to Dave email.

Yes.  We're saying slightly different things.  You're saying that the admin
should assign different backing stores for containers.  I'm saying perhaps
the kernel should enforce that, because $leaks.  Let's say the host admin
did a perfect setup of a container with shifted uids.  Now he wants to
run a quick ps in the container...  he does it in a way that leaks a
/proc/pid reference into the container so that (evil) container root can
use /proc/pid/root/ to get a toehold into the host /.  Does he now have
shifted access to that?

I think if we say "this blockdev will have shifted uids in /proc/$pid/ns/user",
then immediately that blockdev becomes not-readable (or not-executable)
in any namespace which does not have /proc/$pid/ns/user as an ancestor.
With obvious check as in write-versus-execute exclusion that you cannot
mark the blockdev shifted if ancestor user_ns already has a file open for
execute.

BTW, perhaps I should do this in a separate email, but here is how I would
expect to use this:

1. Using zfs: I create a bare (unshifted) rootfs fs1.   When I want to
create a new container, I zfs clone fs1 to fs2, and let the container
use fs2 shifted.  No danger to fs1 

Re: [RFC v2 PATCH 0/8] VFS:userns: support portable root filesystems

2016-05-06 Thread Djalal Harouni
Hi,

On Wed, May 04, 2016 at 11:30:09PM +, Serge Hallyn wrote:
> Quoting Djalal Harouni (tix...@gmail.com):
> > This is version 2 of the VFS:userns support portable root filesystems
> > RFC. Changes since version 1:
> > 
> > * Update documentation and remove some ambiguity about the feature.
> >   Based on Josh Triplett comments.
> > * Use a new email address to send the RFC :-)
> > 
> > 
> > This RFC tries to explore how to support filesystem operations inside
> > user namespace using only VFS and a per mount namespace solution. This
> > allows to take advantage of user namespace separations without
> > introducing any change at the filesystems level. All this is handled
> > with the virtual view of mount namespaces.
> 
> Given your use case, is there any way we could work in some tradeoffs
> to protect the host?  What I'm thinking is that containers can all
> share devices uid-mapped at will, however any device mounted with
> uid shifting cannot be used by the inital user namespace.  Or maybe
> just non-executable in that case, as you'll need enough access to
> the fs to set up the containers you want to run.
> 
> So if /dev/sda1 is your host /, you have to use /dev/sda2 as the
> container rootfs source.  Mount it under /containers with uid
> shifting.  Now all containers regardless of uid mappings see
> the shifted fs contents.  But the host root cannot be tricked by
> files on it, as /dev/sda2 is non-executable as far as it is
> concerned.
Of course the whole setup is based on the container manager to setup
the right mount namespace, clean mounts, etc then pivot root, boot or
whatever...

Now I guess we can achieve what you want with MS_SLAVE|MS_REC on / ?

You create a new mount/pid... namespaces with shift flags, but you are still
in init_user_ns, you remount your / with MS_SLAVE|MS_REC, then you
create new mount/pid namespaces with shift flag (two mount namespaces
here if you don't want to race setting MS_SLAVE flag and creating mount
namespace and you don't trust other processes... or you want the same nested
setup...)

This second new secure mount namespace will be the one that you will use
to setup the container, device nodes, loops...  fs that you want into the
container (probably with shift options) and also filesystems that you can't
mount inside user namespaces nor want them to show up or propagate into
host, you may also want to umount stuff too or remount to change mount
options too.., etc anyway here call it the cleaning of the mount namespace.

Now during this phase, when you mount and prepare these file systems,
mount them with noexec flag first, then remount later with exec, or delay
the mounting just before you do a new clone(CLONE_NEWUSER...). During this
phase the container manager should get the device that you want to be
shared from input or argument, and it will only mount it and prepare
it inside new mount namespaces or containers and make sure that it will
never be propagated back...

After clone(CLONE_NEWUSER|CLONE_NEWNS|CLONE_MNTNS_SHIFT_UIDGID), setup
the user namespace mapping, I guess you drop capabilities, do setuid()
or whatever and start the PID 1 or the app of the container.

Now and to not confuse more Dave, since he doesn't like the idea of
a shared backing device, and me neither for obvious reasons! the shared
device should not be used for a rootfs, maybe for read-only user shared
data, or shared config, that's it... but for real rootfs they should have
their own *different* backing device! unless you know what you are doing
hehe I don't want to confuse people, and I just lack time, will also
respond to Dave email.


> Just a thought.

You think it will solve the case ?


Thanks for your comments!

-- 
Djalal Harouni
http://opendz.org


Re: [RFC v2 PATCH 0/8] VFS:userns: support portable root filesystems

2016-05-06 Thread Djalal Harouni
Hi,

On Wed, May 04, 2016 at 11:30:09PM +, Serge Hallyn wrote:
> Quoting Djalal Harouni (tix...@gmail.com):
> > This is version 2 of the VFS:userns support portable root filesystems
> > RFC. Changes since version 1:
> > 
> > * Update documentation and remove some ambiguity about the feature.
> >   Based on Josh Triplett comments.
> > * Use a new email address to send the RFC :-)
> > 
> > 
> > This RFC tries to explore how to support filesystem operations inside
> > user namespace using only VFS and a per mount namespace solution. This
> > allows to take advantage of user namespace separations without
> > introducing any change at the filesystems level. All this is handled
> > with the virtual view of mount namespaces.
> 
> Given your use case, is there any way we could work in some tradeoffs
> to protect the host?  What I'm thinking is that containers can all
> share devices uid-mapped at will, however any device mounted with
> uid shifting cannot be used by the inital user namespace.  Or maybe
> just non-executable in that case, as you'll need enough access to
> the fs to set up the containers you want to run.
> 
> So if /dev/sda1 is your host /, you have to use /dev/sda2 as the
> container rootfs source.  Mount it under /containers with uid
> shifting.  Now all containers regardless of uid mappings see
> the shifted fs contents.  But the host root cannot be tricked by
> files on it, as /dev/sda2 is non-executable as far as it is
> concerned.
Of course the whole setup is based on the container manager to setup
the right mount namespace, clean mounts, etc then pivot root, boot or
whatever...

Now I guess we can achieve what you want with MS_SLAVE|MS_REC on / ?

You create a new mount/pid... namespaces with shift flags, but you are still
in init_user_ns, you remount your / with MS_SLAVE|MS_REC, then you
create new mount/pid namespaces with shift flag (two mount namespaces
here if you don't want to race setting MS_SLAVE flag and creating mount
namespace and you don't trust other processes... or you want the same nested
setup...)

This second new secure mount namespace will be the one that you will use
to setup the container, device nodes, loops...  fs that you want into the
container (probably with shift options) and also filesystems that you can't
mount inside user namespaces nor want them to show up or propagate into
host, you may also want to umount stuff too or remount to change mount
options too.., etc anyway here call it the cleaning of the mount namespace.

Now during this phase, when you mount and prepare these file systems,
mount them with noexec flag first, then remount later with exec, or delay
the mounting just before you do a new clone(CLONE_NEWUSER...). During this
phase the container manager should get the device that you want to be
shared from input or argument, and it will only mount it and prepare
it inside new mount namespaces or containers and make sure that it will
never be propagated back...

After clone(CLONE_NEWUSER|CLONE_NEWNS|CLONE_MNTNS_SHIFT_UIDGID), setup
the user namespace mapping, I guess you drop capabilities, do setuid()
or whatever and start the PID 1 or the app of the container.

Now and to not confuse more Dave, since he doesn't like the idea of
a shared backing device, and me neither for obvious reasons! the shared
device should not be used for a rootfs, maybe for read-only user shared
data, or shared config, that's it... but for real rootfs they should have
their own *different* backing device! unless you know what you are doing
hehe I don't want to confuse people, and I just lack time, will also
respond to Dave email.


> Just a thought.

You think it will solve the case ?


Thanks for your comments!

-- 
Djalal Harouni
http://opendz.org


Re: [RFC v2 PATCH 0/8] VFS:userns: support portable root filesystems

2016-05-05 Thread Dave Chinner
On Thu, May 05, 2016 at 11:24:35PM +0100, Djalal Harouni wrote:
> On Thu, May 05, 2016 at 10:23:14AM +1000, Dave Chinner wrote:
> > On Wed, May 04, 2016 at 04:26:46PM +0200, Djalal Harouni wrote:
> > > This is version 2 of the VFS:userns support portable root filesystems
> > > RFC. Changes since version 1:
> > > 
> > > * Update documentation and remove some ambiguity about the feature.
> > >   Based on Josh Triplett comments.
> > > * Use a new email address to send the RFC :-)
> > > 
> > > 
> > > This RFC tries to explore how to support filesystem operations inside
> > > user namespace using only VFS and a per mount namespace solution. This
> > > allows to take advantage of user namespace separations without
> > > introducing any change at the filesystems level. All this is handled
> > > with the virtual view of mount namespaces.
> > 
> > [...]
> > 
> > > As an example if the mapping 0:65535 inside mount namespace and outside
> > > is 100:1065536, then 0:65535 will be the range that we use to
> > > construct UIDs/GIDs mapping into init_user_ns and use it for on-disk
> > > data. They represent the persistent values that we want to write to the
> > > disk. Therefore, we don't keep track of any UID/GID shift that was applied
> > > before, it gives portability and allows to use the previous mapping
> > > which was freed for another root filesystem...
> > 
> > So let me get this straight. Two /isolated/ containers, different
> > UID/GID mappings, sharing the same files and directories. Create a
> > new file in a writeable directory in container 1, namespace
> > information gets stripped from on-disk uid/gid representation.
> > 
> > Container 2 then reads that shared directory, finds the file written
> > by container 1. As there is no no namespace component to the uid:gid
> > stored in the inode, we apply the current namespace shift to the VFS
> > inode uid/gid and so it maps to root in container 2 and we are
> > allowed to read it?
> 
> Only if container 2 has the flag CLONE_MNTNS_SHIFT_UIDGID set in its own
> mount namespace which only root can set or if it was already set in
> parent, and have access to the shared dir which the container manager
> should also configure before... even with that if in container 2 the
> shift flag is not set then there is no mapping and things work as they
> are now, but yes this setup is flawed! they should not share rootfs,
> maybe in rare cases, some user data that's it.



I can't follow any of the logic you're explaining - you just
confused me even more.  I thought this was to allow namespaces with
different uid/gid mappings all to use the same backing store? And
now you're saying that "no, they'll all have separate backing
stores"?

I suspect you need to describe the layering in a way a stupid dummy
can understand, because trying to be clever with wacky examples is
not working.

> > Unless I've misunderstood something in this crazy mapping scheme,
> > isn't this just a vector for unintentional containment breaches?
> > 
> > [...]
> > 
> > > Simple demo overlayfs, and  btrfs mounted with vfs_shift_uids and
> > > vfs_shift_gids. The overlayfs mounts will share the same upperdir. We
> > > create two user namesapces every one with its own mapping and where
> > > container-uid-200 will pull changes from container-uid-100
> > > upperdir automatically.
> > 
> > Ok, forget I asked - it's clearly intentional. This is beyond
> > crazy, IMO.
> 
> This setup is flawed! that example was to show that files show up with
> the right mapping with two different user namespaces. As Andy noted
> they should have a backing device...

Did you mean "should have different backing devices" here? If not,
I'm even more confused now...

> Anyway by the previous paragraph what I mean is that when the container
> terminates it releases the UID shift range which can be re-used later
> on another filesystem or on the same previous fs... whatever. Now if
> the range is already in use, userspace should grab a new range give it
> a new filesystem or a previous one which doesn't need to be shared and
> everything should continue to work...

This sounds like you're talking about a set of single, sequential
uses of a single filesystem image across multiple different
container lifecycles? Maybe that's where I'm getting confused,
because I'm assuming multiple concurrent uses of a single filesystem
by all the running containers that are running the same distro
image

> simple example with loop devices..., however the image should be a GPT
> (GUID partition table) or an MBR one...
> 
> $ dd if=/dev/zero of=/tmp/fedora-newtree.raw bs=10M count=100
> $ mkfs.ext4 /tmp/fedora-newtree.raw
> ...
> $ sudo mount -t ext4 -oloop,rw,sync /var/lib/machines/fedora-newtree.raw 
> /mnt/fedora-tree
> $ sudo yum -y --releasever=23 --installroot=/mnt/fedora-tree 
> --disablerepo='*' --enablerepo=fedora install systemd passwd yum 
> fedora-release vim 
> $ sudo mount -t ext4 -oloop,vfs_shift_uids,vfs_shift_gids, 
> 

Re: [RFC v2 PATCH 0/8] VFS:userns: support portable root filesystems

2016-05-05 Thread Dave Chinner
On Thu, May 05, 2016 at 11:24:35PM +0100, Djalal Harouni wrote:
> On Thu, May 05, 2016 at 10:23:14AM +1000, Dave Chinner wrote:
> > On Wed, May 04, 2016 at 04:26:46PM +0200, Djalal Harouni wrote:
> > > This is version 2 of the VFS:userns support portable root filesystems
> > > RFC. Changes since version 1:
> > > 
> > > * Update documentation and remove some ambiguity about the feature.
> > >   Based on Josh Triplett comments.
> > > * Use a new email address to send the RFC :-)
> > > 
> > > 
> > > This RFC tries to explore how to support filesystem operations inside
> > > user namespace using only VFS and a per mount namespace solution. This
> > > allows to take advantage of user namespace separations without
> > > introducing any change at the filesystems level. All this is handled
> > > with the virtual view of mount namespaces.
> > 
> > [...]
> > 
> > > As an example if the mapping 0:65535 inside mount namespace and outside
> > > is 100:1065536, then 0:65535 will be the range that we use to
> > > construct UIDs/GIDs mapping into init_user_ns and use it for on-disk
> > > data. They represent the persistent values that we want to write to the
> > > disk. Therefore, we don't keep track of any UID/GID shift that was applied
> > > before, it gives portability and allows to use the previous mapping
> > > which was freed for another root filesystem...
> > 
> > So let me get this straight. Two /isolated/ containers, different
> > UID/GID mappings, sharing the same files and directories. Create a
> > new file in a writeable directory in container 1, namespace
> > information gets stripped from on-disk uid/gid representation.
> > 
> > Container 2 then reads that shared directory, finds the file written
> > by container 1. As there is no no namespace component to the uid:gid
> > stored in the inode, we apply the current namespace shift to the VFS
> > inode uid/gid and so it maps to root in container 2 and we are
> > allowed to read it?
> 
> Only if container 2 has the flag CLONE_MNTNS_SHIFT_UIDGID set in its own
> mount namespace which only root can set or if it was already set in
> parent, and have access to the shared dir which the container manager
> should also configure before... even with that if in container 2 the
> shift flag is not set then there is no mapping and things work as they
> are now, but yes this setup is flawed! they should not share rootfs,
> maybe in rare cases, some user data that's it.



I can't follow any of the logic you're explaining - you just
confused me even more.  I thought this was to allow namespaces with
different uid/gid mappings all to use the same backing store? And
now you're saying that "no, they'll all have separate backing
stores"?

I suspect you need to describe the layering in a way a stupid dummy
can understand, because trying to be clever with wacky examples is
not working.

> > Unless I've misunderstood something in this crazy mapping scheme,
> > isn't this just a vector for unintentional containment breaches?
> > 
> > [...]
> > 
> > > Simple demo overlayfs, and  btrfs mounted with vfs_shift_uids and
> > > vfs_shift_gids. The overlayfs mounts will share the same upperdir. We
> > > create two user namesapces every one with its own mapping and where
> > > container-uid-200 will pull changes from container-uid-100
> > > upperdir automatically.
> > 
> > Ok, forget I asked - it's clearly intentional. This is beyond
> > crazy, IMO.
> 
> This setup is flawed! that example was to show that files show up with
> the right mapping with two different user namespaces. As Andy noted
> they should have a backing device...

Did you mean "should have different backing devices" here? If not,
I'm even more confused now...

> Anyway by the previous paragraph what I mean is that when the container
> terminates it releases the UID shift range which can be re-used later
> on another filesystem or on the same previous fs... whatever. Now if
> the range is already in use, userspace should grab a new range give it
> a new filesystem or a previous one which doesn't need to be shared and
> everything should continue to work...

This sounds like you're talking about a set of single, sequential
uses of a single filesystem image across multiple different
container lifecycles? Maybe that's where I'm getting confused,
because I'm assuming multiple concurrent uses of a single filesystem
by all the running containers that are running the same distro
image

> simple example with loop devices..., however the image should be a GPT
> (GUID partition table) or an MBR one...
> 
> $ dd if=/dev/zero of=/tmp/fedora-newtree.raw bs=10M count=100
> $ mkfs.ext4 /tmp/fedora-newtree.raw
> ...
> $ sudo mount -t ext4 -oloop,rw,sync /var/lib/machines/fedora-newtree.raw 
> /mnt/fedora-tree
> $ sudo yum -y --releasever=23 --installroot=/mnt/fedora-tree 
> --disablerepo='*' --enablerepo=fedora install systemd passwd yum 
> fedora-release vim 
> $ sudo mount -t ext4 -oloop,vfs_shift_uids,vfs_shift_gids, 
> 

Re: [RFC v2 PATCH 0/8] VFS:userns: support portable root filesystems

2016-05-05 Thread Djalal Harouni
On Wed, May 04, 2016 at 06:44:14PM -0700, Andy Lutomirski wrote:
> On Wed, May 4, 2016 at 5:23 PM, Dave Chinner  wrote:
> > On Wed, May 04, 2016 at 04:26:46PM +0200, Djalal Harouni wrote:
> >> This is version 2 of the VFS:userns support portable root filesystems
> >> RFC. Changes since version 1:
> >>
> >> * Update documentation and remove some ambiguity about the feature.
> >>   Based on Josh Triplett comments.
> >> * Use a new email address to send the RFC :-)
> >>
> >>
> >> This RFC tries to explore how to support filesystem operations inside
> >> user namespace using only VFS and a per mount namespace solution. This
> >> allows to take advantage of user namespace separations without
> >> introducing any change at the filesystems level. All this is handled
> >> with the virtual view of mount namespaces.
> >
> > [...]
> >
> >> As an example if the mapping 0:65535 inside mount namespace and outside
> >> is 100:1065536, then 0:65535 will be the range that we use to
> >> construct UIDs/GIDs mapping into init_user_ns and use it for on-disk
> >> data. They represent the persistent values that we want to write to the
> >> disk. Therefore, we don't keep track of any UID/GID shift that was applied
> >> before, it gives portability and allows to use the previous mapping
> >> which was freed for another root filesystem...
> >
> > So let me get this straight. Two /isolated/ containers, different
> > UID/GID mappings, sharing the same files and directories. Create a
> > new file in a writeable directory in container 1, namespace
> > information gets stripped from on-disk uid/gid representation.
> 
> I think the intent is a totally separate superblock for each
> container.  Djalal, am I right?

Absolutely that would be ideal, each container will mount its image
device into the new mount namespace, setting up the right private/slave
flags, no propagation into host... using GPT, lvm, loop or any other
backing device, the mount will show up only into the container...

Now as you know we can't prevent all flawed solutions. The thing that
I made sure is that the flag CLONE_MNTNS_SHIFT_UIDGID could only be
set by real root.


> The feature that seems to me to be missing is the ability to squash
> uids.  I can imagine desktop distros wanting to mount removable
> storage such that everything shows up (to permission checks and
> stat()) as the logged-in user's uid but that the filesystem sees 0:0.
> That can be done by shifting, but the distro would want everything
> else on the filesystem to show up as the logged-in user as well.
> 
> That use case could also be handled by adding a way to tell a given
> filesystem to completely opt out of normal access control rules and
> just let a given user act as root wrt that filesystem (and be nosuid,
> of course).  This would be a much greater departure from current
> behavior, but would let normal users chown things on a removable
> device, which is potentially nice.

Ok Andy, this one is hard... I gave it some thought and what do you
think of the above:
It will work only if you are referring to some high level software
into distros which seems perfect of course for normal users.

So the sotfware should do:

1) mount the removable storage with vfs_shift_uids and vfs_shift_gids
2) Now the software should act as a container, make a
clone4(CLONE_NEWUSER|CLONE_NEWNS|CLONE_MNTNS_SHIFT_UIDGID, ...)

=> Setup the right mapping so we are able to access files...

   The mount will show up into the new mount namespace.

3) Now inside new namespaces we are able to access all files.

4) Use stat() returned values, and shift back to logged-in user
   values...

The software did setup the mapping so it already knows who maps to who!

This allows to show results of stat() as they are normal logged-in
users, where everything works as you have described. So maybe this
has its place in a small userspace helper library where all softwares
can use it ?! thoughts ? 

> --Andy

Thanks!

-- 
Djalal Harouni
http://opendz.org


Re: [RFC v2 PATCH 0/8] VFS:userns: support portable root filesystems

2016-05-05 Thread Djalal Harouni
On Wed, May 04, 2016 at 06:44:14PM -0700, Andy Lutomirski wrote:
> On Wed, May 4, 2016 at 5:23 PM, Dave Chinner  wrote:
> > On Wed, May 04, 2016 at 04:26:46PM +0200, Djalal Harouni wrote:
> >> This is version 2 of the VFS:userns support portable root filesystems
> >> RFC. Changes since version 1:
> >>
> >> * Update documentation and remove some ambiguity about the feature.
> >>   Based on Josh Triplett comments.
> >> * Use a new email address to send the RFC :-)
> >>
> >>
> >> This RFC tries to explore how to support filesystem operations inside
> >> user namespace using only VFS and a per mount namespace solution. This
> >> allows to take advantage of user namespace separations without
> >> introducing any change at the filesystems level. All this is handled
> >> with the virtual view of mount namespaces.
> >
> > [...]
> >
> >> As an example if the mapping 0:65535 inside mount namespace and outside
> >> is 100:1065536, then 0:65535 will be the range that we use to
> >> construct UIDs/GIDs mapping into init_user_ns and use it for on-disk
> >> data. They represent the persistent values that we want to write to the
> >> disk. Therefore, we don't keep track of any UID/GID shift that was applied
> >> before, it gives portability and allows to use the previous mapping
> >> which was freed for another root filesystem...
> >
> > So let me get this straight. Two /isolated/ containers, different
> > UID/GID mappings, sharing the same files and directories. Create a
> > new file in a writeable directory in container 1, namespace
> > information gets stripped from on-disk uid/gid representation.
> 
> I think the intent is a totally separate superblock for each
> container.  Djalal, am I right?

Absolutely that would be ideal, each container will mount its image
device into the new mount namespace, setting up the right private/slave
flags, no propagation into host... using GPT, lvm, loop or any other
backing device, the mount will show up only into the container...

Now as you know we can't prevent all flawed solutions. The thing that
I made sure is that the flag CLONE_MNTNS_SHIFT_UIDGID could only be
set by real root.


> The feature that seems to me to be missing is the ability to squash
> uids.  I can imagine desktop distros wanting to mount removable
> storage such that everything shows up (to permission checks and
> stat()) as the logged-in user's uid but that the filesystem sees 0:0.
> That can be done by shifting, but the distro would want everything
> else on the filesystem to show up as the logged-in user as well.
> 
> That use case could also be handled by adding a way to tell a given
> filesystem to completely opt out of normal access control rules and
> just let a given user act as root wrt that filesystem (and be nosuid,
> of course).  This would be a much greater departure from current
> behavior, but would let normal users chown things on a removable
> device, which is potentially nice.

Ok Andy, this one is hard... I gave it some thought and what do you
think of the above:
It will work only if you are referring to some high level software
into distros which seems perfect of course for normal users.

So the sotfware should do:

1) mount the removable storage with vfs_shift_uids and vfs_shift_gids
2) Now the software should act as a container, make a
clone4(CLONE_NEWUSER|CLONE_NEWNS|CLONE_MNTNS_SHIFT_UIDGID, ...)

=> Setup the right mapping so we are able to access files...

   The mount will show up into the new mount namespace.

3) Now inside new namespaces we are able to access all files.

4) Use stat() returned values, and shift back to logged-in user
   values...

The software did setup the mapping so it already knows who maps to who!

This allows to show results of stat() as they are normal logged-in
users, where everything works as you have described. So maybe this
has its place in a small userspace helper library where all softwares
can use it ?! thoughts ? 

> --Andy

Thanks!

-- 
Djalal Harouni
http://opendz.org


Re: [RFC v2 PATCH 0/8] VFS:userns: support portable root filesystems

2016-05-05 Thread Djalal Harouni
On Thu, May 05, 2016 at 10:23:14AM +1000, Dave Chinner wrote:
> On Wed, May 04, 2016 at 04:26:46PM +0200, Djalal Harouni wrote:
> > This is version 2 of the VFS:userns support portable root filesystems
> > RFC. Changes since version 1:
> > 
> > * Update documentation and remove some ambiguity about the feature.
> >   Based on Josh Triplett comments.
> > * Use a new email address to send the RFC :-)
> > 
> > 
> > This RFC tries to explore how to support filesystem operations inside
> > user namespace using only VFS and a per mount namespace solution. This
> > allows to take advantage of user namespace separations without
> > introducing any change at the filesystems level. All this is handled
> > with the virtual view of mount namespaces.
> 
> [...]
> 
> > As an example if the mapping 0:65535 inside mount namespace and outside
> > is 100:1065536, then 0:65535 will be the range that we use to
> > construct UIDs/GIDs mapping into init_user_ns and use it for on-disk
> > data. They represent the persistent values that we want to write to the
> > disk. Therefore, we don't keep track of any UID/GID shift that was applied
> > before, it gives portability and allows to use the previous mapping
> > which was freed for another root filesystem...
> 
> So let me get this straight. Two /isolated/ containers, different
> UID/GID mappings, sharing the same files and directories. Create a
> new file in a writeable directory in container 1, namespace
> information gets stripped from on-disk uid/gid representation.
> 
> Container 2 then reads that shared directory, finds the file written
> by container 1. As there is no no namespace component to the uid:gid
> stored in the inode, we apply the current namespace shift to the VFS
> inode uid/gid and so it maps to root in container 2 and we are
> allowed to read it?

Only if container 2 has the flag CLONE_MNTNS_SHIFT_UIDGID set in its own
mount namespace which only root can set or if it was already set in
parent, and have access to the shared dir which the container manager
should also configure before... even with that if in container 2 the
shift flag is not set then there is no mapping and things work as they
are now, but yes this setup is flawed! they should not share rootfs,
maybe in rare cases, some user data that's it.


> Unless I've misunderstood something in this crazy mapping scheme,
> isn't this just a vector for unintentional containment breaches?
> 
> [...]
> 
> > Simple demo overlayfs, and  btrfs mounted with vfs_shift_uids and
> > vfs_shift_gids. The overlayfs mounts will share the same upperdir. We
> > create two user namesapces every one with its own mapping and where
> > container-uid-200 will pull changes from container-uid-100
> > upperdir automatically.
> 
> Ok, forget I asked - it's clearly intentional. This is beyond
> crazy, IMO.

This setup is flawed! that example was to show that files show up with
the right mapping with two different user namespaces. As Andy noted
they should have a backing device...

Anyway by the previous paragraph what I mean is that when the container
terminates it releases the UID shift range which can be re-used later
on another filesystem or on the same previous fs... whatever. Now if
the range is already in use, userspace should grab a new range give it
a new filesystem or a previous one which doesn't need to be shared and
everything should continue to work...


simple example with loop devices..., however the image should be a GPT
(GUID partition table) or an MBR one...

$ dd if=/dev/zero of=/tmp/fedora-newtree.raw bs=10M count=100
$ mkfs.ext4 /tmp/fedora-newtree.raw
...
$ sudo mount -t ext4 -oloop,rw,sync /var/lib/machines/fedora-newtree.raw 
/mnt/fedora-tree
$ sudo yum -y --releasever=23 --installroot=/mnt/fedora-tree --disablerepo='*' 
--enablerepo=fedora install systemd passwd yum fedora-release vim 
$ sudo mount -t ext4 -oloop,vfs_shift_uids,vfs_shift_gids, 
/var/lib/machines/fedora-newtree.raw  /mnt/fedora-tree
$ sudo ~/container --uidmap [100:1065536 or
 200:2065536 or
 300:3065536 }
  (That's the mapping outside of the container)



> > 3) ROADMAP:
> > ===
> > * Confirm current design, and make sure that the mapping is done
> >   correctly.
> 
> How are you going to ensure that all filesystems behave the same,
> and it doesn't get broken by people who really don't care about this
> sort of crazy?

By trying to make this a VFS mount namespace parameter. So if the
shift is not set on on the mount namespace then we just fallback to
the current behaviour! no shift is performed.

later of course I'll try xfstests and several tests...

Does this answer your question ?


> FWIW, having the VFS convert things to "on-disk format" is an
> oxymoron - the "V" in VFS means "virtual" and has nothing to do with
> disks or persistent storage formats. Indeed, let's convert the UID
> to "on-disk" format for a network filesystem client
hehe! 

Re: [RFC v2 PATCH 0/8] VFS:userns: support portable root filesystems

2016-05-05 Thread Djalal Harouni
On Thu, May 05, 2016 at 10:23:14AM +1000, Dave Chinner wrote:
> On Wed, May 04, 2016 at 04:26:46PM +0200, Djalal Harouni wrote:
> > This is version 2 of the VFS:userns support portable root filesystems
> > RFC. Changes since version 1:
> > 
> > * Update documentation and remove some ambiguity about the feature.
> >   Based on Josh Triplett comments.
> > * Use a new email address to send the RFC :-)
> > 
> > 
> > This RFC tries to explore how to support filesystem operations inside
> > user namespace using only VFS and a per mount namespace solution. This
> > allows to take advantage of user namespace separations without
> > introducing any change at the filesystems level. All this is handled
> > with the virtual view of mount namespaces.
> 
> [...]
> 
> > As an example if the mapping 0:65535 inside mount namespace and outside
> > is 100:1065536, then 0:65535 will be the range that we use to
> > construct UIDs/GIDs mapping into init_user_ns and use it for on-disk
> > data. They represent the persistent values that we want to write to the
> > disk. Therefore, we don't keep track of any UID/GID shift that was applied
> > before, it gives portability and allows to use the previous mapping
> > which was freed for another root filesystem...
> 
> So let me get this straight. Two /isolated/ containers, different
> UID/GID mappings, sharing the same files and directories. Create a
> new file in a writeable directory in container 1, namespace
> information gets stripped from on-disk uid/gid representation.
> 
> Container 2 then reads that shared directory, finds the file written
> by container 1. As there is no no namespace component to the uid:gid
> stored in the inode, we apply the current namespace shift to the VFS
> inode uid/gid and so it maps to root in container 2 and we are
> allowed to read it?

Only if container 2 has the flag CLONE_MNTNS_SHIFT_UIDGID set in its own
mount namespace which only root can set or if it was already set in
parent, and have access to the shared dir which the container manager
should also configure before... even with that if in container 2 the
shift flag is not set then there is no mapping and things work as they
are now, but yes this setup is flawed! they should not share rootfs,
maybe in rare cases, some user data that's it.


> Unless I've misunderstood something in this crazy mapping scheme,
> isn't this just a vector for unintentional containment breaches?
> 
> [...]
> 
> > Simple demo overlayfs, and  btrfs mounted with vfs_shift_uids and
> > vfs_shift_gids. The overlayfs mounts will share the same upperdir. We
> > create two user namesapces every one with its own mapping and where
> > container-uid-200 will pull changes from container-uid-100
> > upperdir automatically.
> 
> Ok, forget I asked - it's clearly intentional. This is beyond
> crazy, IMO.

This setup is flawed! that example was to show that files show up with
the right mapping with two different user namespaces. As Andy noted
they should have a backing device...

Anyway by the previous paragraph what I mean is that when the container
terminates it releases the UID shift range which can be re-used later
on another filesystem or on the same previous fs... whatever. Now if
the range is already in use, userspace should grab a new range give it
a new filesystem or a previous one which doesn't need to be shared and
everything should continue to work...


simple example with loop devices..., however the image should be a GPT
(GUID partition table) or an MBR one...

$ dd if=/dev/zero of=/tmp/fedora-newtree.raw bs=10M count=100
$ mkfs.ext4 /tmp/fedora-newtree.raw
...
$ sudo mount -t ext4 -oloop,rw,sync /var/lib/machines/fedora-newtree.raw 
/mnt/fedora-tree
$ sudo yum -y --releasever=23 --installroot=/mnt/fedora-tree --disablerepo='*' 
--enablerepo=fedora install systemd passwd yum fedora-release vim 
$ sudo mount -t ext4 -oloop,vfs_shift_uids,vfs_shift_gids, 
/var/lib/machines/fedora-newtree.raw  /mnt/fedora-tree
$ sudo ~/container --uidmap [100:1065536 or
 200:2065536 or
 300:3065536 }
  (That's the mapping outside of the container)



> > 3) ROADMAP:
> > ===
> > * Confirm current design, and make sure that the mapping is done
> >   correctly.
> 
> How are you going to ensure that all filesystems behave the same,
> and it doesn't get broken by people who really don't care about this
> sort of crazy?

By trying to make this a VFS mount namespace parameter. So if the
shift is not set on on the mount namespace then we just fallback to
the current behaviour! no shift is performed.

later of course I'll try xfstests and several tests...

Does this answer your question ?


> FWIW, having the VFS convert things to "on-disk format" is an
> oxymoron - the "V" in VFS means "virtual" and has nothing to do with
> disks or persistent storage formats. Indeed, let's convert the UID
> to "on-disk" format for a network filesystem client
hehe! 

Re: [RFC v2 PATCH 0/8] VFS:userns: support portable root filesystems

2016-05-05 Thread James Bottomley
On Thu, 2016-05-05 at 22:49 +0100, Djalal Harouni wrote:
> On Thu, May 05, 2016 at 07:56:28AM -0400, James Bottomley wrote:
> > On Thu, 2016-05-05 at 08:36 +0100, Djalal Harouni wrote:
> > > On Wed, May 04, 2016 at 05:06:19PM -0400, James Bottomley wrote:
> > > > On Wed, 2016-05-04 at 16:26 +0200, Djalal Harouni wrote:
> > > > > This is version 2 of the VFS:userns support portable root
> > > > > filesystems
> > > > > RFC. Changes since version 1:
> > > > > 
> > > > > * Update documentation and remove some ambiguity about the
> > > > > feature.   Based on Josh Triplett comments.
> > > > > * Use a new email address to send the RFC :-)
> > > > > 
> > > > > 
> > > > > This RFC tries to explore how to support filesystem 
> > > > > operations inside user namespace using only VFS and a per 
> > > > > mount namespace solution. This allows to take advantage of 
> > > > > user namespace separations without introducing any change at 
> > > > > the filesystems level. All this is handled with the virtual 
> > > > > view of mount namespaces.
> > > > > 
> > > > > 
> > > > > 1) Presentation:
> > > > > 
> > > > > 
> > > > > The main aim is to support portable root filesystems and 
> > > > > allow containers, virtual machines and other cases to use the 
> > > > > same root filesystem. Due to security reasons, filesystems 
> > > > > can't be mounted inside user namespaces, and mounting them 
> > > > > outside will not solve the problem since they will show up 
> > > > > with the wrong UIDs/GIDs. Read and write operations will also
> > > > > fail and so on.
> > > > > 
> > > > > The current userspace solution is to automatically chown the 
> > > > > whole root filesystem before starting a container, example:
> > > > > (host) init_user_ns  100:1065536  => (container)
> > > > > user_ns_X1
> > > > > 0:65535
> > > > > (host) init_user_ns  200:2065536  => (container)
> > > > > user_ns_Y1
> > > > > 0:65535
> > > > > (host) init_user_ns  300:3065536  => (container)
> > > > > user_ns_Z1
> > > > > 0:65535
> > > > > ...
> > > > > 
> > > > > Every time a chown is called, files are changed and so on... 
> > > > > This prevents to have portable filesystems where you can 
> > > > > throw anywhere and boot. Having an extra step to adapt the
> > > > > filesystem to the current mapping and persist it will not 
> > > > > allow to verify its integrity, it makes snapshots and 
> > > > > migration a bit harder, and probably other limitations...
> > > > > 
> > > > > It seems that there are multiple ways to allow user 
> > > > > namespaces combine nicely with filesystems, but none of them 
> > > > > is that easy. The bind mount and pin the user namespace 
> > > > > during mount time will not work, bind mounts share the same 
> > > > > super block, hence you may endup working on the wrong 
> > > > > vfsmount context and there is no easy way to get out of
> > > > > that...
> > > > 
> > > > So this option was discussed at the recent LSF/MM summit.  The 
> > > > most supported suggestion was that you'd use a new internal fs 
> > > > type that had a struct mount with a new superblock and would 
> > > > copy the underlying inodes but substitute it's own with 
> > > >  modified  ->getatrr/->setattr calls that did the uid shift. 
> > > >  In many ways it would be a remapping bind which would look 
> > > > similar to overlayfs but be a lot simpler.
> > > 
> > > Hmm, it's not only about ->getattr and ->setattr, you have all 
> > > the other file system operations that need access too...
> > 
> > Why?  Or perhaps we should more cogently define the actual problem.
> >   My problem is simply mounting image volumes that were created 
> > with real uids at user namespace shifted uids because I'm
> >  downshifting the privileged ids in the container.  I actually 
> > *only* need the uid/gids on the attributes shifted because that's 
> > what I need to manipulate the
> >   
> We need them obviously for read/write/creation... ?!

OK, so the way attributes are populated on an inode is via getattr. 
 You intercept that, you change the inode owner and group that are
installed on the inode.  That means that when you list the directory,
you see the shift and the shifted uid/gid are used to check permissions
for vfs_open().

>  We want to handle also stock filesystems that were never edited
> without depending on any module or third party solution, mounting
> them outside user namespaces, and access inside.

OK, but that's basically my requirements ... you didn't mention any of
the esoteric filesystem ioctls, so I assume from the below you're not
interested in shifting the uids there either?

> > volumes.  I actually think that other operations, like the file 
> > ioctl ones should, for security reasons, not be uid shifted.  For
> > instance with xfs you could set the panic mask and error tags and 
> > bring down the whole host.  What extra things do you need access to
> > and why?
> 
> That's why precisely I said that mounting options not 

Re: [RFC v2 PATCH 0/8] VFS:userns: support portable root filesystems

2016-05-05 Thread James Bottomley
On Thu, 2016-05-05 at 22:49 +0100, Djalal Harouni wrote:
> On Thu, May 05, 2016 at 07:56:28AM -0400, James Bottomley wrote:
> > On Thu, 2016-05-05 at 08:36 +0100, Djalal Harouni wrote:
> > > On Wed, May 04, 2016 at 05:06:19PM -0400, James Bottomley wrote:
> > > > On Wed, 2016-05-04 at 16:26 +0200, Djalal Harouni wrote:
> > > > > This is version 2 of the VFS:userns support portable root
> > > > > filesystems
> > > > > RFC. Changes since version 1:
> > > > > 
> > > > > * Update documentation and remove some ambiguity about the
> > > > > feature.   Based on Josh Triplett comments.
> > > > > * Use a new email address to send the RFC :-)
> > > > > 
> > > > > 
> > > > > This RFC tries to explore how to support filesystem 
> > > > > operations inside user namespace using only VFS and a per 
> > > > > mount namespace solution. This allows to take advantage of 
> > > > > user namespace separations without introducing any change at 
> > > > > the filesystems level. All this is handled with the virtual 
> > > > > view of mount namespaces.
> > > > > 
> > > > > 
> > > > > 1) Presentation:
> > > > > 
> > > > > 
> > > > > The main aim is to support portable root filesystems and 
> > > > > allow containers, virtual machines and other cases to use the 
> > > > > same root filesystem. Due to security reasons, filesystems 
> > > > > can't be mounted inside user namespaces, and mounting them 
> > > > > outside will not solve the problem since they will show up 
> > > > > with the wrong UIDs/GIDs. Read and write operations will also
> > > > > fail and so on.
> > > > > 
> > > > > The current userspace solution is to automatically chown the 
> > > > > whole root filesystem before starting a container, example:
> > > > > (host) init_user_ns  100:1065536  => (container)
> > > > > user_ns_X1
> > > > > 0:65535
> > > > > (host) init_user_ns  200:2065536  => (container)
> > > > > user_ns_Y1
> > > > > 0:65535
> > > > > (host) init_user_ns  300:3065536  => (container)
> > > > > user_ns_Z1
> > > > > 0:65535
> > > > > ...
> > > > > 
> > > > > Every time a chown is called, files are changed and so on... 
> > > > > This prevents to have portable filesystems where you can 
> > > > > throw anywhere and boot. Having an extra step to adapt the
> > > > > filesystem to the current mapping and persist it will not 
> > > > > allow to verify its integrity, it makes snapshots and 
> > > > > migration a bit harder, and probably other limitations...
> > > > > 
> > > > > It seems that there are multiple ways to allow user 
> > > > > namespaces combine nicely with filesystems, but none of them 
> > > > > is that easy. The bind mount and pin the user namespace 
> > > > > during mount time will not work, bind mounts share the same 
> > > > > super block, hence you may endup working on the wrong 
> > > > > vfsmount context and there is no easy way to get out of
> > > > > that...
> > > > 
> > > > So this option was discussed at the recent LSF/MM summit.  The 
> > > > most supported suggestion was that you'd use a new internal fs 
> > > > type that had a struct mount with a new superblock and would 
> > > > copy the underlying inodes but substitute it's own with 
> > > >  modified  ->getatrr/->setattr calls that did the uid shift. 
> > > >  In many ways it would be a remapping bind which would look 
> > > > similar to overlayfs but be a lot simpler.
> > > 
> > > Hmm, it's not only about ->getattr and ->setattr, you have all 
> > > the other file system operations that need access too...
> > 
> > Why?  Or perhaps we should more cogently define the actual problem.
> >   My problem is simply mounting image volumes that were created 
> > with real uids at user namespace shifted uids because I'm
> >  downshifting the privileged ids in the container.  I actually 
> > *only* need the uid/gids on the attributes shifted because that's 
> > what I need to manipulate the
> >   
> We need them obviously for read/write/creation... ?!

OK, so the way attributes are populated on an inode is via getattr. 
 You intercept that, you change the inode owner and group that are
installed on the inode.  That means that when you list the directory,
you see the shift and the shifted uid/gid are used to check permissions
for vfs_open().

>  We want to handle also stock filesystems that were never edited
> without depending on any module or third party solution, mounting
> them outside user namespaces, and access inside.

OK, but that's basically my requirements ... you didn't mention any of
the esoteric filesystem ioctls, so I assume from the below you're not
interested in shifting the uids there either?

> > volumes.  I actually think that other operations, like the file 
> > ioctl ones should, for security reasons, not be uid shifted.  For
> > instance with xfs you could set the panic mask and error tags and 
> > bring down the whole host.  What extra things do you need access to
> > and why?
> 
> That's why precisely I said that mounting options not 

Re: [RFC v2 PATCH 0/8] VFS:userns: support portable root filesystems

2016-05-05 Thread Djalal Harouni
On Thu, May 05, 2016 at 07:56:28AM -0400, James Bottomley wrote:
> On Thu, 2016-05-05 at 08:36 +0100, Djalal Harouni wrote:
> > On Wed, May 04, 2016 at 05:06:19PM -0400, James Bottomley wrote:
> > > On Wed, 2016-05-04 at 16:26 +0200, Djalal Harouni wrote:
> > > > This is version 2 of the VFS:userns support portable root
> > > > filesystems
> > > > RFC. Changes since version 1:
> > > > 
> > > > * Update documentation and remove some ambiguity about the
> > > > feature.
> > > >   Based on Josh Triplett comments.
> > > > * Use a new email address to send the RFC :-)
> > > > 
> > > > 
> > > > This RFC tries to explore how to support filesystem operations 
> > > > inside user namespace using only VFS and a per mount namespace 
> > > > solution. This allows to take advantage of user namespace 
> > > > separations without introducing any change at the filesystems 
> > > > level. All this is handled with the virtual view of mount
> > > > namespaces.
> > > > 
> > > > 
> > > > 1) Presentation:
> > > > 
> > > > 
> > > > The main aim is to support portable root filesystems and allow 
> > > > containers, virtual machines and other cases to use the same root
> > > > filesystem. Due to security reasons, filesystems can't be mounted
> > > > inside user namespaces, and mounting them outside will not solve 
> > > > the problem since they will show up with the wrong UIDs/GIDs. 
> > > > Read and write operations will also fail and so on.
> > > > 
> > > > The current userspace solution is to automatically chown the 
> > > > whole root filesystem before starting a container, example:
> > > > (host) init_user_ns  100:1065536  => (container) user_ns_X1
> > > > 0:65535
> > > > (host) init_user_ns  200:2065536  => (container) user_ns_Y1
> > > > 0:65535
> > > > (host) init_user_ns  300:3065536  => (container) user_ns_Z1
> > > > 0:65535
> > > > ...
> > > > 
> > > > Every time a chown is called, files are changed and so on... This
> > > > prevents to have portable filesystems where you can throw 
> > > > anywhere and boot. Having an extra step to adapt the filesystem 
> > > > to the current mapping and persist it will not allow to verify 
> > > > its integrity, it makes snapshots and migration a bit harder, and 
> > > > probably other limitations...
> > > > 
> > > > It seems that there are multiple ways to allow user namespaces 
> > > > combine nicely with filesystems, but none of them is that easy. 
> > > > The bind mount and pin the user namespace during mount time will 
> > > > not work, bind mounts share the same super block, hence you may 
> > > > endup working on the wrong vfsmount context and there is no easy 
> > > > way to get out of that...
> > > 
> > > So this option was discussed at the recent LSF/MM summit.  The most
> > > supported suggestion was that you'd use a new internal fs type that 
> > > had a struct mount with a new superblock and would copy the 
> > > underlying inodes but substitute it's own with modified ->getatrr/
> > > ->setattr calls that did the uid shift.  In many ways it would be a 
> > > remapping bind which would look similar to overlayfs but be a lot
> > > simpler.
> > 
> > Hmm, it's not only about ->getattr and ->setattr, you have all the 
> > other file system operations that need access too...
> 
> Why?  Or perhaps we should more cogently define the actual problem.  My
> problem is simply mounting image volumes that were created with real
> uids at user namespace shifted uids because I'm downshifting the
> privileged ids in the container.  I actually *only* need the uid/gids
> on the attributes shifted because that's what I need to manipulate the

We need them obviously for read/write/creation... ?! We want to handle
also stock filesystems that were never edited without depending on any
module or third party solution, mounting them outside user namespaces,
and access inside.

> volumes.  I actually think that other operations, like the file ioctl
> ones should, for security reasons, not be uid shifted.  For instance
> with xfs you could set the panic mask and error tags and bring down the
> whole host.  What extra things do you need access to and why?

That's why precisely I said that mounting options not *inside*
filesystems which means on their back, and on behalf of container
managers, etc then you are exposed to such scenarios... some virtual
file systems can also be mounted by unprivileged, how you will deal
with something like a bind mount on them ?


> >  which brings two points:
> > 
> > 1) This new internal fs may end up doing what this RFC does...
> 
> Well that was why I brought it up, yes.

yes but *with* extra code! that was my point. I'm not sure we need to
bother with any *new* internal fs type nor hack around dir, file
operations... yet that has to be shown, defined, coded ... ?


> > 2) or by quoting "new internal fs + its own super block + copy
> > underlying inodes..." it seems like another overlayfs where you also
> > need some decisions to copy 

Re: [RFC v2 PATCH 0/8] VFS:userns: support portable root filesystems

2016-05-05 Thread Djalal Harouni
On Thu, May 05, 2016 at 07:56:28AM -0400, James Bottomley wrote:
> On Thu, 2016-05-05 at 08:36 +0100, Djalal Harouni wrote:
> > On Wed, May 04, 2016 at 05:06:19PM -0400, James Bottomley wrote:
> > > On Wed, 2016-05-04 at 16:26 +0200, Djalal Harouni wrote:
> > > > This is version 2 of the VFS:userns support portable root
> > > > filesystems
> > > > RFC. Changes since version 1:
> > > > 
> > > > * Update documentation and remove some ambiguity about the
> > > > feature.
> > > >   Based on Josh Triplett comments.
> > > > * Use a new email address to send the RFC :-)
> > > > 
> > > > 
> > > > This RFC tries to explore how to support filesystem operations 
> > > > inside user namespace using only VFS and a per mount namespace 
> > > > solution. This allows to take advantage of user namespace 
> > > > separations without introducing any change at the filesystems 
> > > > level. All this is handled with the virtual view of mount
> > > > namespaces.
> > > > 
> > > > 
> > > > 1) Presentation:
> > > > 
> > > > 
> > > > The main aim is to support portable root filesystems and allow 
> > > > containers, virtual machines and other cases to use the same root
> > > > filesystem. Due to security reasons, filesystems can't be mounted
> > > > inside user namespaces, and mounting them outside will not solve 
> > > > the problem since they will show up with the wrong UIDs/GIDs. 
> > > > Read and write operations will also fail and so on.
> > > > 
> > > > The current userspace solution is to automatically chown the 
> > > > whole root filesystem before starting a container, example:
> > > > (host) init_user_ns  100:1065536  => (container) user_ns_X1
> > > > 0:65535
> > > > (host) init_user_ns  200:2065536  => (container) user_ns_Y1
> > > > 0:65535
> > > > (host) init_user_ns  300:3065536  => (container) user_ns_Z1
> > > > 0:65535
> > > > ...
> > > > 
> > > > Every time a chown is called, files are changed and so on... This
> > > > prevents to have portable filesystems where you can throw 
> > > > anywhere and boot. Having an extra step to adapt the filesystem 
> > > > to the current mapping and persist it will not allow to verify 
> > > > its integrity, it makes snapshots and migration a bit harder, and 
> > > > probably other limitations...
> > > > 
> > > > It seems that there are multiple ways to allow user namespaces 
> > > > combine nicely with filesystems, but none of them is that easy. 
> > > > The bind mount and pin the user namespace during mount time will 
> > > > not work, bind mounts share the same super block, hence you may 
> > > > endup working on the wrong vfsmount context and there is no easy 
> > > > way to get out of that...
> > > 
> > > So this option was discussed at the recent LSF/MM summit.  The most
> > > supported suggestion was that you'd use a new internal fs type that 
> > > had a struct mount with a new superblock and would copy the 
> > > underlying inodes but substitute it's own with modified ->getatrr/
> > > ->setattr calls that did the uid shift.  In many ways it would be a 
> > > remapping bind which would look similar to overlayfs but be a lot
> > > simpler.
> > 
> > Hmm, it's not only about ->getattr and ->setattr, you have all the 
> > other file system operations that need access too...
> 
> Why?  Or perhaps we should more cogently define the actual problem.  My
> problem is simply mounting image volumes that were created with real
> uids at user namespace shifted uids because I'm downshifting the
> privileged ids in the container.  I actually *only* need the uid/gids
> on the attributes shifted because that's what I need to manipulate the

We need them obviously for read/write/creation... ?! We want to handle
also stock filesystems that were never edited without depending on any
module or third party solution, mounting them outside user namespaces,
and access inside.

> volumes.  I actually think that other operations, like the file ioctl
> ones should, for security reasons, not be uid shifted.  For instance
> with xfs you could set the panic mask and error tags and bring down the
> whole host.  What extra things do you need access to and why?

That's why precisely I said that mounting options not *inside*
filesystems which means on their back, and on behalf of container
managers, etc then you are exposed to such scenarios... some virtual
file systems can also be mounted by unprivileged, how you will deal
with something like a bind mount on them ?


> >  which brings two points:
> > 
> > 1) This new internal fs may end up doing what this RFC does...
> 
> Well that was why I brought it up, yes.

yes but *with* extra code! that was my point. I'm not sure we need to
bother with any *new* internal fs type nor hack around dir, file
operations... yet that has to be shown, defined, coded ... ?


> > 2) or by quoting "new internal fs + its own super block + copy
> > underlying inodes..." it seems like another overlayfs where you also
> > need some decisions to copy 

Re: [RFC v2 PATCH 0/8] VFS:userns: support portable root filesystems

2016-05-05 Thread James Bottomley
On Thu, 2016-05-05 at 08:36 +0100, Djalal Harouni wrote:
> On Wed, May 04, 2016 at 05:06:19PM -0400, James Bottomley wrote:
> > On Wed, 2016-05-04 at 16:26 +0200, Djalal Harouni wrote:
> > > This is version 2 of the VFS:userns support portable root
> > > filesystems
> > > RFC. Changes since version 1:
> > > 
> > > * Update documentation and remove some ambiguity about the
> > > feature.
> > >   Based on Josh Triplett comments.
> > > * Use a new email address to send the RFC :-)
> > > 
> > > 
> > > This RFC tries to explore how to support filesystem operations 
> > > inside user namespace using only VFS and a per mount namespace 
> > > solution. This allows to take advantage of user namespace 
> > > separations without introducing any change at the filesystems 
> > > level. All this is handled with the virtual view of mount
> > > namespaces.
> > > 
> > > 
> > > 1) Presentation:
> > > 
> > > 
> > > The main aim is to support portable root filesystems and allow 
> > > containers, virtual machines and other cases to use the same root
> > > filesystem. Due to security reasons, filesystems can't be mounted
> > > inside user namespaces, and mounting them outside will not solve 
> > > the problem since they will show up with the wrong UIDs/GIDs. 
> > > Read and write operations will also fail and so on.
> > > 
> > > The current userspace solution is to automatically chown the 
> > > whole root filesystem before starting a container, example:
> > > (host) init_user_ns  100:1065536  => (container) user_ns_X1
> > > 0:65535
> > > (host) init_user_ns  200:2065536  => (container) user_ns_Y1
> > > 0:65535
> > > (host) init_user_ns  300:3065536  => (container) user_ns_Z1
> > > 0:65535
> > > ...
> > > 
> > > Every time a chown is called, files are changed and so on... This
> > > prevents to have portable filesystems where you can throw 
> > > anywhere and boot. Having an extra step to adapt the filesystem 
> > > to the current mapping and persist it will not allow to verify 
> > > its integrity, it makes snapshots and migration a bit harder, and 
> > > probably other limitations...
> > > 
> > > It seems that there are multiple ways to allow user namespaces 
> > > combine nicely with filesystems, but none of them is that easy. 
> > > The bind mount and pin the user namespace during mount time will 
> > > not work, bind mounts share the same super block, hence you may 
> > > endup working on the wrong vfsmount context and there is no easy 
> > > way to get out of that...
> > 
> > So this option was discussed at the recent LSF/MM summit.  The most
> > supported suggestion was that you'd use a new internal fs type that 
> > had a struct mount with a new superblock and would copy the 
> > underlying inodes but substitute it's own with modified ->getatrr/
> > ->setattr calls that did the uid shift.  In many ways it would be a 
> > remapping bind which would look similar to overlayfs but be a lot
> > simpler.
> 
> Hmm, it's not only about ->getattr and ->setattr, you have all the 
> other file system operations that need access too...

Why?  Or perhaps we should more cogently define the actual problem.  My
problem is simply mounting image volumes that were created with real
uids at user namespace shifted uids because I'm downshifting the
privileged ids in the container.  I actually *only* need the uid/gids
on the attributes shifted because that's what I need to manipulate the
volumes.  I actually think that other operations, like the file ioctl
ones should, for security reasons, not be uid shifted.  For instance
with xfs you could set the panic mask and error tags and bring down the
whole host.  What extra things do you need access to and why?

>  which brings two points:
> 
> 1) This new internal fs may end up doing what this RFC does...

Well that was why I brought it up, yes.

> 2) or by quoting "new internal fs + its own super block + copy
> underlying inodes..." it seems like another overlayfs where you also
> need some decisions to copy what, etc. So, will this be really
> that light compared to current overlayfs ? not to mention that you 
> need to hook up basically the same logic or something else inside
> overlayfs..

OK, so forget overlayfs, perhaps that was a bad example.  It's like a
uid shifting bind.  The way it works is to use shadow inodes (unlike
bind, but because you have to intercept the operations, so it's not a
simple subtree operation) but there's no file copying.  The shadow
points to the real inode.

> > > Using the user namespace in the super block seems the way to go, 
> > > and there is the "Support fuse mounts in user namespaces" [1] 
> > > patches which seem nice but perhaps too complex!?
> > 
> > So I don't think that does what you want.  The fuse project I've 
> > used before to do uid/gid shifts for build containers is bindfs
> > 
> > https://github.com/mpartel/bindfs/
> > 
> > It allows a --map argument where you specify pairs of uids/gids to 
> > map (tedious for 

Re: [RFC v2 PATCH 0/8] VFS:userns: support portable root filesystems

2016-05-05 Thread James Bottomley
On Thu, 2016-05-05 at 08:36 +0100, Djalal Harouni wrote:
> On Wed, May 04, 2016 at 05:06:19PM -0400, James Bottomley wrote:
> > On Wed, 2016-05-04 at 16:26 +0200, Djalal Harouni wrote:
> > > This is version 2 of the VFS:userns support portable root
> > > filesystems
> > > RFC. Changes since version 1:
> > > 
> > > * Update documentation and remove some ambiguity about the
> > > feature.
> > >   Based on Josh Triplett comments.
> > > * Use a new email address to send the RFC :-)
> > > 
> > > 
> > > This RFC tries to explore how to support filesystem operations 
> > > inside user namespace using only VFS and a per mount namespace 
> > > solution. This allows to take advantage of user namespace 
> > > separations without introducing any change at the filesystems 
> > > level. All this is handled with the virtual view of mount
> > > namespaces.
> > > 
> > > 
> > > 1) Presentation:
> > > 
> > > 
> > > The main aim is to support portable root filesystems and allow 
> > > containers, virtual machines and other cases to use the same root
> > > filesystem. Due to security reasons, filesystems can't be mounted
> > > inside user namespaces, and mounting them outside will not solve 
> > > the problem since they will show up with the wrong UIDs/GIDs. 
> > > Read and write operations will also fail and so on.
> > > 
> > > The current userspace solution is to automatically chown the 
> > > whole root filesystem before starting a container, example:
> > > (host) init_user_ns  100:1065536  => (container) user_ns_X1
> > > 0:65535
> > > (host) init_user_ns  200:2065536  => (container) user_ns_Y1
> > > 0:65535
> > > (host) init_user_ns  300:3065536  => (container) user_ns_Z1
> > > 0:65535
> > > ...
> > > 
> > > Every time a chown is called, files are changed and so on... This
> > > prevents to have portable filesystems where you can throw 
> > > anywhere and boot. Having an extra step to adapt the filesystem 
> > > to the current mapping and persist it will not allow to verify 
> > > its integrity, it makes snapshots and migration a bit harder, and 
> > > probably other limitations...
> > > 
> > > It seems that there are multiple ways to allow user namespaces 
> > > combine nicely with filesystems, but none of them is that easy. 
> > > The bind mount and pin the user namespace during mount time will 
> > > not work, bind mounts share the same super block, hence you may 
> > > endup working on the wrong vfsmount context and there is no easy 
> > > way to get out of that...
> > 
> > So this option was discussed at the recent LSF/MM summit.  The most
> > supported suggestion was that you'd use a new internal fs type that 
> > had a struct mount with a new superblock and would copy the 
> > underlying inodes but substitute it's own with modified ->getatrr/
> > ->setattr calls that did the uid shift.  In many ways it would be a 
> > remapping bind which would look similar to overlayfs but be a lot
> > simpler.
> 
> Hmm, it's not only about ->getattr and ->setattr, you have all the 
> other file system operations that need access too...

Why?  Or perhaps we should more cogently define the actual problem.  My
problem is simply mounting image volumes that were created with real
uids at user namespace shifted uids because I'm downshifting the
privileged ids in the container.  I actually *only* need the uid/gids
on the attributes shifted because that's what I need to manipulate the
volumes.  I actually think that other operations, like the file ioctl
ones should, for security reasons, not be uid shifted.  For instance
with xfs you could set the panic mask and error tags and bring down the
whole host.  What extra things do you need access to and why?

>  which brings two points:
> 
> 1) This new internal fs may end up doing what this RFC does...

Well that was why I brought it up, yes.

> 2) or by quoting "new internal fs + its own super block + copy
> underlying inodes..." it seems like another overlayfs where you also
> need some decisions to copy what, etc. So, will this be really
> that light compared to current overlayfs ? not to mention that you 
> need to hook up basically the same logic or something else inside
> overlayfs..

OK, so forget overlayfs, perhaps that was a bad example.  It's like a
uid shifting bind.  The way it works is to use shadow inodes (unlike
bind, but because you have to intercept the operations, so it's not a
simple subtree operation) but there's no file copying.  The shadow
points to the real inode.

> > > Using the user namespace in the super block seems the way to go, 
> > > and there is the "Support fuse mounts in user namespaces" [1] 
> > > patches which seem nice but perhaps too complex!?
> > 
> > So I don't think that does what you want.  The fuse project I've 
> > used before to do uid/gid shifts for build containers is bindfs
> > 
> > https://github.com/mpartel/bindfs/
> > 
> > It allows a --map argument where you specify pairs of uids/gids to 
> > map (tedious for 

Re: [RFC v2 PATCH 0/8] VFS:userns: support portable root filesystems

2016-05-05 Thread Djalal Harouni
On Wed, May 04, 2016 at 05:06:19PM -0400, James Bottomley wrote:
> On Wed, 2016-05-04 at 16:26 +0200, Djalal Harouni wrote:
> > This is version 2 of the VFS:userns support portable root filesystems
> > RFC. Changes since version 1:
> > 
> > * Update documentation and remove some ambiguity about the feature.
> >   Based on Josh Triplett comments.
> > * Use a new email address to send the RFC :-)
> > 
> > 
> > This RFC tries to explore how to support filesystem operations inside
> > user namespace using only VFS and a per mount namespace solution.
> > This
> > allows to take advantage of user namespace separations without
> > introducing any change at the filesystems level. All this is handled
> > with the virtual view of mount namespaces.
> > 
> > 
> > 1) Presentation:
> > 
> > 
> > The main aim is to support portable root filesystems and allow 
> > containers, virtual machines and other cases to use the same root 
> > filesystem. Due to security reasons, filesystems can't be mounted 
> > inside user namespaces, and mounting them outside will not solve the 
> > problem since they will show up with the wrong UIDs/GIDs. Read and 
> > write operations will also fail and so on.
> > 
> > The current userspace solution is to automatically chown the whole 
> > root filesystem before starting a container, example:
> > (host) init_user_ns  100:1065536  => (container) user_ns_X1
> > 0:65535
> > (host) init_user_ns  200:2065536  => (container) user_ns_Y1
> > 0:65535
> > (host) init_user_ns  300:3065536  => (container) user_ns_Z1
> > 0:65535
> > ...
> > 
> > Every time a chown is called, files are changed and so on... This
> > prevents to have portable filesystems where you can throw anywhere
> > and boot. Having an extra step to adapt the filesystem to the current
> > mapping and persist it will not allow to verify its integrity, it 
> > makes snapshots and migration a bit harder, and probably other
> > limitations...
> > 
> > It seems that there are multiple ways to allow user namespaces 
> > combine nicely with filesystems, but none of them is that easy. The 
> > bind mount and pin the user namespace during mount time will not 
> > work, bind mounts share the same super block, hence you may endup 
> > working on the wrong vfsmount context and there is no easy way to get
> > out of that...
> 
> So this option was discussed at the recent LSF/MM summit.  The most
> supported suggestion was that you'd use a new internal fs type that had
> a struct mount with a new superblock and would copy the underlying
> inodes but substitute it's own with modified ->getatrr/->setattr calls
> that did the uid shift.  In many ways it would be a remapping bind
> which would look similar to overlayfs but be a lot simpler.

Hmm, it's not only about ->getattr and ->setattr, you have all the other
file system operations that need access too... which brings two points:

1) This new internal fs may end up doing what this RFC does...

2) or by quoting "new internal fs + its own super block + copy underlying
inodes..." it seems like another overlayfs where you also need some
decisions to copy what, etc. So, will this be really
that light compared to current overlayfs ? not to mention that you need
to hook up basically the same logic or something else inside overlayfs..

> > Using the user namespace in the super block seems the way to go, and
> > there is the "Support fuse mounts in user namespaces" [1] patches 
> > which seem nice but perhaps too complex!?
> 
> So I don't think that does what you want.  The fuse project I've used
> before to do uid/gid shifts for build containers is bindfs
> 
> https://github.com/mpartel/bindfs/
> 
> It allows a --map argument where you specify pairs of uids/gids to map
> (tedious for large ranges, but the map can be fixed to use uid:range
> instead of individual).

Ok, thanks for the link, will try to take a deep look but bindfs seem
really big!

> >  there is also the overlayfs solution, and finaly the VFS layer 
> > solution.
> > 
> > We present here a simple VFS solution, everything is packed inside 
> > VFS, filesystems don't need to know anything (except probably XFS, 
> > and special operations inside union filesystems). Currently it 
> > supports ext4, btrfs and overlayfs. Changes into filesystems are 
> > small, just parse the vfs_shift_uids and vfs_shift_gids options 
> > during mount and set the appropriate flags into the super_block
> > structure.
> 
> So this looks a little daunting.  It sprays the VFS with knowledge
> about the shifts and requires support from every underlying filesystem.
Well, from my angle, shifts are just user namespace mappings which
follow certain rules, and currently VFS and all filesystems are *already*
doing some kind of shifting... This RFC uses mount namespaces which are
the standard way to deal with mounts, now the mapping inside mount
namespace can just be "inside: 0:1000" => "outside: 0:1000"
and current implementation will just use it, at 

Re: [RFC v2 PATCH 0/8] VFS:userns: support portable root filesystems

2016-05-05 Thread Djalal Harouni
On Wed, May 04, 2016 at 05:06:19PM -0400, James Bottomley wrote:
> On Wed, 2016-05-04 at 16:26 +0200, Djalal Harouni wrote:
> > This is version 2 of the VFS:userns support portable root filesystems
> > RFC. Changes since version 1:
> > 
> > * Update documentation and remove some ambiguity about the feature.
> >   Based on Josh Triplett comments.
> > * Use a new email address to send the RFC :-)
> > 
> > 
> > This RFC tries to explore how to support filesystem operations inside
> > user namespace using only VFS and a per mount namespace solution.
> > This
> > allows to take advantage of user namespace separations without
> > introducing any change at the filesystems level. All this is handled
> > with the virtual view of mount namespaces.
> > 
> > 
> > 1) Presentation:
> > 
> > 
> > The main aim is to support portable root filesystems and allow 
> > containers, virtual machines and other cases to use the same root 
> > filesystem. Due to security reasons, filesystems can't be mounted 
> > inside user namespaces, and mounting them outside will not solve the 
> > problem since they will show up with the wrong UIDs/GIDs. Read and 
> > write operations will also fail and so on.
> > 
> > The current userspace solution is to automatically chown the whole 
> > root filesystem before starting a container, example:
> > (host) init_user_ns  100:1065536  => (container) user_ns_X1
> > 0:65535
> > (host) init_user_ns  200:2065536  => (container) user_ns_Y1
> > 0:65535
> > (host) init_user_ns  300:3065536  => (container) user_ns_Z1
> > 0:65535
> > ...
> > 
> > Every time a chown is called, files are changed and so on... This
> > prevents to have portable filesystems where you can throw anywhere
> > and boot. Having an extra step to adapt the filesystem to the current
> > mapping and persist it will not allow to verify its integrity, it 
> > makes snapshots and migration a bit harder, and probably other
> > limitations...
> > 
> > It seems that there are multiple ways to allow user namespaces 
> > combine nicely with filesystems, but none of them is that easy. The 
> > bind mount and pin the user namespace during mount time will not 
> > work, bind mounts share the same super block, hence you may endup 
> > working on the wrong vfsmount context and there is no easy way to get
> > out of that...
> 
> So this option was discussed at the recent LSF/MM summit.  The most
> supported suggestion was that you'd use a new internal fs type that had
> a struct mount with a new superblock and would copy the underlying
> inodes but substitute it's own with modified ->getatrr/->setattr calls
> that did the uid shift.  In many ways it would be a remapping bind
> which would look similar to overlayfs but be a lot simpler.

Hmm, it's not only about ->getattr and ->setattr, you have all the other
file system operations that need access too... which brings two points:

1) This new internal fs may end up doing what this RFC does...

2) or by quoting "new internal fs + its own super block + copy underlying
inodes..." it seems like another overlayfs where you also need some
decisions to copy what, etc. So, will this be really
that light compared to current overlayfs ? not to mention that you need
to hook up basically the same logic or something else inside overlayfs..

> > Using the user namespace in the super block seems the way to go, and
> > there is the "Support fuse mounts in user namespaces" [1] patches 
> > which seem nice but perhaps too complex!?
> 
> So I don't think that does what you want.  The fuse project I've used
> before to do uid/gid shifts for build containers is bindfs
> 
> https://github.com/mpartel/bindfs/
> 
> It allows a --map argument where you specify pairs of uids/gids to map
> (tedious for large ranges, but the map can be fixed to use uid:range
> instead of individual).

Ok, thanks for the link, will try to take a deep look but bindfs seem
really big!

> >  there is also the overlayfs solution, and finaly the VFS layer 
> > solution.
> > 
> > We present here a simple VFS solution, everything is packed inside 
> > VFS, filesystems don't need to know anything (except probably XFS, 
> > and special operations inside union filesystems). Currently it 
> > supports ext4, btrfs and overlayfs. Changes into filesystems are 
> > small, just parse the vfs_shift_uids and vfs_shift_gids options 
> > during mount and set the appropriate flags into the super_block
> > structure.
> 
> So this looks a little daunting.  It sprays the VFS with knowledge
> about the shifts and requires support from every underlying filesystem.
Well, from my angle, shifts are just user namespace mappings which
follow certain rules, and currently VFS and all filesystems are *already*
doing some kind of shifting... This RFC uses mount namespaces which are
the standard way to deal with mounts, now the mapping inside mount
namespace can just be "inside: 0:1000" => "outside: 0:1000"
and current implementation will just use it, at 

Re: [RFC v2 PATCH 0/8] VFS:userns: support portable root filesystems

2016-05-04 Thread Andy Lutomirski
On May 4, 2016 7:25 PM, "Dave Chinner"  wrote:
>
> On Wed, May 04, 2016 at 06:44:14PM -0700, Andy Lutomirski wrote:
> > On Wed, May 4, 2016 at 5:23 PM, Dave Chinner  wrote:
> > > On Wed, May 04, 2016 at 04:26:46PM +0200, Djalal Harouni wrote:
> > >> This is version 2 of the VFS:userns support portable root filesystems
> > >> RFC. Changes since version 1:
> > >>
> > >> * Update documentation and remove some ambiguity about the feature.
> > >>   Based on Josh Triplett comments.
> > >> * Use a new email address to send the RFC :-)
> > >>
> > >>
> > >> This RFC tries to explore how to support filesystem operations inside
> > >> user namespace using only VFS and a per mount namespace solution. This
> > >> allows to take advantage of user namespace separations without
> > >> introducing any change at the filesystems level. All this is handled
> > >> with the virtual view of mount namespaces.
> > >
> > > [...]
> > >
> > >> As an example if the mapping 0:65535 inside mount namespace and outside
> > >> is 100:1065536, then 0:65535 will be the range that we use to
> > >> construct UIDs/GIDs mapping into init_user_ns and use it for on-disk
> > >> data. They represent the persistent values that we want to write to the
> > >> disk. Therefore, we don't keep track of any UID/GID shift that was 
> > >> applied
> > >> before, it gives portability and allows to use the previous mapping
> > >> which was freed for another root filesystem...
> > >
> > > So let me get this straight. Two /isolated/ containers, different
> > > UID/GID mappings, sharing the same files and directories. Create a
> > > new file in a writeable directory in container 1, namespace
> > > information gets stripped from on-disk uid/gid representation.
> >
> > I think the intent is a totally separate superblock for each
> > container.  Djalal, am I right?
>
> I'm pretty sure you can't have multiple superblocks point to the
> same backing device. Each superblock would then think it's the sole
> owner of the filesystem and all we get out of that is incoherent
> caching and a corrupt on-disk filesystem.

I meant separate backing stores, too.

--Andy

>
> Cheers,
>
> Dave.
> --
> Dave Chinner
> da...@fromorbit.com


Re: [RFC v2 PATCH 0/8] VFS:userns: support portable root filesystems

2016-05-04 Thread Andy Lutomirski
On May 4, 2016 7:25 PM, "Dave Chinner"  wrote:
>
> On Wed, May 04, 2016 at 06:44:14PM -0700, Andy Lutomirski wrote:
> > On Wed, May 4, 2016 at 5:23 PM, Dave Chinner  wrote:
> > > On Wed, May 04, 2016 at 04:26:46PM +0200, Djalal Harouni wrote:
> > >> This is version 2 of the VFS:userns support portable root filesystems
> > >> RFC. Changes since version 1:
> > >>
> > >> * Update documentation and remove some ambiguity about the feature.
> > >>   Based on Josh Triplett comments.
> > >> * Use a new email address to send the RFC :-)
> > >>
> > >>
> > >> This RFC tries to explore how to support filesystem operations inside
> > >> user namespace using only VFS and a per mount namespace solution. This
> > >> allows to take advantage of user namespace separations without
> > >> introducing any change at the filesystems level. All this is handled
> > >> with the virtual view of mount namespaces.
> > >
> > > [...]
> > >
> > >> As an example if the mapping 0:65535 inside mount namespace and outside
> > >> is 100:1065536, then 0:65535 will be the range that we use to
> > >> construct UIDs/GIDs mapping into init_user_ns and use it for on-disk
> > >> data. They represent the persistent values that we want to write to the
> > >> disk. Therefore, we don't keep track of any UID/GID shift that was 
> > >> applied
> > >> before, it gives portability and allows to use the previous mapping
> > >> which was freed for another root filesystem...
> > >
> > > So let me get this straight. Two /isolated/ containers, different
> > > UID/GID mappings, sharing the same files and directories. Create a
> > > new file in a writeable directory in container 1, namespace
> > > information gets stripped from on-disk uid/gid representation.
> >
> > I think the intent is a totally separate superblock for each
> > container.  Djalal, am I right?
>
> I'm pretty sure you can't have multiple superblocks point to the
> same backing device. Each superblock would then think it's the sole
> owner of the filesystem and all we get out of that is incoherent
> caching and a corrupt on-disk filesystem.

I meant separate backing stores, too.

--Andy

>
> Cheers,
>
> Dave.
> --
> Dave Chinner
> da...@fromorbit.com


Re: [RFC v2 PATCH 0/8] VFS:userns: support portable root filesystems

2016-05-04 Thread Dave Chinner
On Wed, May 04, 2016 at 06:44:14PM -0700, Andy Lutomirski wrote:
> On Wed, May 4, 2016 at 5:23 PM, Dave Chinner  wrote:
> > On Wed, May 04, 2016 at 04:26:46PM +0200, Djalal Harouni wrote:
> >> This is version 2 of the VFS:userns support portable root filesystems
> >> RFC. Changes since version 1:
> >>
> >> * Update documentation and remove some ambiguity about the feature.
> >>   Based on Josh Triplett comments.
> >> * Use a new email address to send the RFC :-)
> >>
> >>
> >> This RFC tries to explore how to support filesystem operations inside
> >> user namespace using only VFS and a per mount namespace solution. This
> >> allows to take advantage of user namespace separations without
> >> introducing any change at the filesystems level. All this is handled
> >> with the virtual view of mount namespaces.
> >
> > [...]
> >
> >> As an example if the mapping 0:65535 inside mount namespace and outside
> >> is 100:1065536, then 0:65535 will be the range that we use to
> >> construct UIDs/GIDs mapping into init_user_ns and use it for on-disk
> >> data. They represent the persistent values that we want to write to the
> >> disk. Therefore, we don't keep track of any UID/GID shift that was applied
> >> before, it gives portability and allows to use the previous mapping
> >> which was freed for another root filesystem...
> >
> > So let me get this straight. Two /isolated/ containers, different
> > UID/GID mappings, sharing the same files and directories. Create a
> > new file in a writeable directory in container 1, namespace
> > information gets stripped from on-disk uid/gid representation.
> 
> I think the intent is a totally separate superblock for each
> container.  Djalal, am I right?

I'm pretty sure you can't have multiple superblocks point to the
same backing device. Each superblock would then think it's the sole
owner of the filesystem and all we get out of that is incoherent
caching and a corrupt on-disk filesystem.

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com


Re: [RFC v2 PATCH 0/8] VFS:userns: support portable root filesystems

2016-05-04 Thread Dave Chinner
On Wed, May 04, 2016 at 06:44:14PM -0700, Andy Lutomirski wrote:
> On Wed, May 4, 2016 at 5:23 PM, Dave Chinner  wrote:
> > On Wed, May 04, 2016 at 04:26:46PM +0200, Djalal Harouni wrote:
> >> This is version 2 of the VFS:userns support portable root filesystems
> >> RFC. Changes since version 1:
> >>
> >> * Update documentation and remove some ambiguity about the feature.
> >>   Based on Josh Triplett comments.
> >> * Use a new email address to send the RFC :-)
> >>
> >>
> >> This RFC tries to explore how to support filesystem operations inside
> >> user namespace using only VFS and a per mount namespace solution. This
> >> allows to take advantage of user namespace separations without
> >> introducing any change at the filesystems level. All this is handled
> >> with the virtual view of mount namespaces.
> >
> > [...]
> >
> >> As an example if the mapping 0:65535 inside mount namespace and outside
> >> is 100:1065536, then 0:65535 will be the range that we use to
> >> construct UIDs/GIDs mapping into init_user_ns and use it for on-disk
> >> data. They represent the persistent values that we want to write to the
> >> disk. Therefore, we don't keep track of any UID/GID shift that was applied
> >> before, it gives portability and allows to use the previous mapping
> >> which was freed for another root filesystem...
> >
> > So let me get this straight. Two /isolated/ containers, different
> > UID/GID mappings, sharing the same files and directories. Create a
> > new file in a writeable directory in container 1, namespace
> > information gets stripped from on-disk uid/gid representation.
> 
> I think the intent is a totally separate superblock for each
> container.  Djalal, am I right?

I'm pretty sure you can't have multiple superblocks point to the
same backing device. Each superblock would then think it's the sole
owner of the filesystem and all we get out of that is incoherent
caching and a corrupt on-disk filesystem.

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com


Re: [RFC v2 PATCH 0/8] VFS:userns: support portable root filesystems

2016-05-04 Thread Andy Lutomirski
On Wed, May 4, 2016 at 5:23 PM, Dave Chinner  wrote:
> On Wed, May 04, 2016 at 04:26:46PM +0200, Djalal Harouni wrote:
>> This is version 2 of the VFS:userns support portable root filesystems
>> RFC. Changes since version 1:
>>
>> * Update documentation and remove some ambiguity about the feature.
>>   Based on Josh Triplett comments.
>> * Use a new email address to send the RFC :-)
>>
>>
>> This RFC tries to explore how to support filesystem operations inside
>> user namespace using only VFS and a per mount namespace solution. This
>> allows to take advantage of user namespace separations without
>> introducing any change at the filesystems level. All this is handled
>> with the virtual view of mount namespaces.
>
> [...]
>
>> As an example if the mapping 0:65535 inside mount namespace and outside
>> is 100:1065536, then 0:65535 will be the range that we use to
>> construct UIDs/GIDs mapping into init_user_ns and use it for on-disk
>> data. They represent the persistent values that we want to write to the
>> disk. Therefore, we don't keep track of any UID/GID shift that was applied
>> before, it gives portability and allows to use the previous mapping
>> which was freed for another root filesystem...
>
> So let me get this straight. Two /isolated/ containers, different
> UID/GID mappings, sharing the same files and directories. Create a
> new file in a writeable directory in container 1, namespace
> information gets stripped from on-disk uid/gid representation.

I think the intent is a totally separate superblock for each
container.  Djalal, am I right?

The feature that seems to me to be missing is the ability to squash
uids.  I can imagine desktop distros wanting to mount removable
storage such that everything shows up (to permission checks and
stat()) as the logged-in user's uid but that the filesystem sees 0:0.
That can be done by shifting, but the distro would want everything
else on the filesystem to show up as the logged-in user as well.

That use case could also be handled by adding a way to tell a given
filesystem to completely opt out of normal access control rules and
just let a given user act as root wrt that filesystem (and be nosuid,
of course).  This would be a much greater departure from current
behavior, but would let normal users chown things on a removable
device, which is potentially nice.

--Andy


Re: [RFC v2 PATCH 0/8] VFS:userns: support portable root filesystems

2016-05-04 Thread Andy Lutomirski
On Wed, May 4, 2016 at 5:23 PM, Dave Chinner  wrote:
> On Wed, May 04, 2016 at 04:26:46PM +0200, Djalal Harouni wrote:
>> This is version 2 of the VFS:userns support portable root filesystems
>> RFC. Changes since version 1:
>>
>> * Update documentation and remove some ambiguity about the feature.
>>   Based on Josh Triplett comments.
>> * Use a new email address to send the RFC :-)
>>
>>
>> This RFC tries to explore how to support filesystem operations inside
>> user namespace using only VFS and a per mount namespace solution. This
>> allows to take advantage of user namespace separations without
>> introducing any change at the filesystems level. All this is handled
>> with the virtual view of mount namespaces.
>
> [...]
>
>> As an example if the mapping 0:65535 inside mount namespace and outside
>> is 100:1065536, then 0:65535 will be the range that we use to
>> construct UIDs/GIDs mapping into init_user_ns and use it for on-disk
>> data. They represent the persistent values that we want to write to the
>> disk. Therefore, we don't keep track of any UID/GID shift that was applied
>> before, it gives portability and allows to use the previous mapping
>> which was freed for another root filesystem...
>
> So let me get this straight. Two /isolated/ containers, different
> UID/GID mappings, sharing the same files and directories. Create a
> new file in a writeable directory in container 1, namespace
> information gets stripped from on-disk uid/gid representation.

I think the intent is a totally separate superblock for each
container.  Djalal, am I right?

The feature that seems to me to be missing is the ability to squash
uids.  I can imagine desktop distros wanting to mount removable
storage such that everything shows up (to permission checks and
stat()) as the logged-in user's uid but that the filesystem sees 0:0.
That can be done by shifting, but the distro would want everything
else on the filesystem to show up as the logged-in user as well.

That use case could also be handled by adding a way to tell a given
filesystem to completely opt out of normal access control rules and
just let a given user act as root wrt that filesystem (and be nosuid,
of course).  This would be a much greater departure from current
behavior, but would let normal users chown things on a removable
device, which is potentially nice.

--Andy


Re: [RFC v2 PATCH 0/8] VFS:userns: support portable root filesystems

2016-05-04 Thread Dave Chinner
On Wed, May 04, 2016 at 04:26:46PM +0200, Djalal Harouni wrote:
> This is version 2 of the VFS:userns support portable root filesystems
> RFC. Changes since version 1:
> 
> * Update documentation and remove some ambiguity about the feature.
>   Based on Josh Triplett comments.
> * Use a new email address to send the RFC :-)
> 
> 
> This RFC tries to explore how to support filesystem operations inside
> user namespace using only VFS and a per mount namespace solution. This
> allows to take advantage of user namespace separations without
> introducing any change at the filesystems level. All this is handled
> with the virtual view of mount namespaces.

[...]

> As an example if the mapping 0:65535 inside mount namespace and outside
> is 100:1065536, then 0:65535 will be the range that we use to
> construct UIDs/GIDs mapping into init_user_ns and use it for on-disk
> data. They represent the persistent values that we want to write to the
> disk. Therefore, we don't keep track of any UID/GID shift that was applied
> before, it gives portability and allows to use the previous mapping
> which was freed for another root filesystem...

So let me get this straight. Two /isolated/ containers, different
UID/GID mappings, sharing the same files and directories. Create a
new file in a writeable directory in container 1, namespace
information gets stripped from on-disk uid/gid representation.

Container 2 then reads that shared directory, finds the file written
by container 1. As there is no no namespace component to the uid:gid
stored in the inode, we apply the current namespace shift to the VFS
inode uid/gid and so it maps to root in container 2 and we are
allowed to read it?

Unless I've misunderstood something in this crazy mapping scheme,
isn't this just a vector for unintentional containment breaches?

[...]

> Simple demo overlayfs, and  btrfs mounted with vfs_shift_uids and
> vfs_shift_gids. The overlayfs mounts will share the same upperdir. We
> create two user namesapces every one with its own mapping and where
> container-uid-200 will pull changes from container-uid-100
> upperdir automatically.

Ok, forget I asked - it's clearly intentional. This is beyond
crazy, IMO.

> 3) ROADMAP:
> ===
> * Confirm current design, and make sure that the mapping is done
>   correctly.

How are you going to ensure that all filesystems behave the same,
and it doesn't get broken by people who really don't care about this
sort of crazy?

FWIW, having the VFS convert things to "on-disk format" is an
oxymoron - the "V" in VFS means "virtual" and has nothing to do with
disks or persistent storage formats. Indeed, let's convert the UID
to "on-disk" format for a network filesystem client

.
> * Add XFS support.

What is the problem here?

Next question: how does this work with uid/gid based quotas?

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com


Re: [RFC v2 PATCH 0/8] VFS:userns: support portable root filesystems

2016-05-04 Thread Dave Chinner
On Wed, May 04, 2016 at 04:26:46PM +0200, Djalal Harouni wrote:
> This is version 2 of the VFS:userns support portable root filesystems
> RFC. Changes since version 1:
> 
> * Update documentation and remove some ambiguity about the feature.
>   Based on Josh Triplett comments.
> * Use a new email address to send the RFC :-)
> 
> 
> This RFC tries to explore how to support filesystem operations inside
> user namespace using only VFS and a per mount namespace solution. This
> allows to take advantage of user namespace separations without
> introducing any change at the filesystems level. All this is handled
> with the virtual view of mount namespaces.

[...]

> As an example if the mapping 0:65535 inside mount namespace and outside
> is 100:1065536, then 0:65535 will be the range that we use to
> construct UIDs/GIDs mapping into init_user_ns and use it for on-disk
> data. They represent the persistent values that we want to write to the
> disk. Therefore, we don't keep track of any UID/GID shift that was applied
> before, it gives portability and allows to use the previous mapping
> which was freed for another root filesystem...

So let me get this straight. Two /isolated/ containers, different
UID/GID mappings, sharing the same files and directories. Create a
new file in a writeable directory in container 1, namespace
information gets stripped from on-disk uid/gid representation.

Container 2 then reads that shared directory, finds the file written
by container 1. As there is no no namespace component to the uid:gid
stored in the inode, we apply the current namespace shift to the VFS
inode uid/gid and so it maps to root in container 2 and we are
allowed to read it?

Unless I've misunderstood something in this crazy mapping scheme,
isn't this just a vector for unintentional containment breaches?

[...]

> Simple demo overlayfs, and  btrfs mounted with vfs_shift_uids and
> vfs_shift_gids. The overlayfs mounts will share the same upperdir. We
> create two user namesapces every one with its own mapping and where
> container-uid-200 will pull changes from container-uid-100
> upperdir automatically.

Ok, forget I asked - it's clearly intentional. This is beyond
crazy, IMO.

> 3) ROADMAP:
> ===
> * Confirm current design, and make sure that the mapping is done
>   correctly.

How are you going to ensure that all filesystems behave the same,
and it doesn't get broken by people who really don't care about this
sort of crazy?

FWIW, having the VFS convert things to "on-disk format" is an
oxymoron - the "V" in VFS means "virtual" and has nothing to do with
disks or persistent storage formats. Indeed, let's convert the UID
to "on-disk" format for a network filesystem client

.
> * Add XFS support.

What is the problem here?

Next question: how does this work with uid/gid based quotas?

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com


Re: [RFC v2 PATCH 0/8] VFS:userns: support portable root filesystems

2016-05-04 Thread Serge Hallyn
Quoting Djalal Harouni (tix...@gmail.com):
> This is version 2 of the VFS:userns support portable root filesystems
> RFC. Changes since version 1:
> 
> * Update documentation and remove some ambiguity about the feature.
>   Based on Josh Triplett comments.
> * Use a new email address to send the RFC :-)
> 
> 
> This RFC tries to explore how to support filesystem operations inside
> user namespace using only VFS and a per mount namespace solution. This
> allows to take advantage of user namespace separations without
> introducing any change at the filesystems level. All this is handled
> with the virtual view of mount namespaces.

Given your use case, is there any way we could work in some tradeoffs
to protect the host?  What I'm thinking is that containers can all
share devices uid-mapped at will, however any device mounted with
uid shifting cannot be used by the inital user namespace.  Or maybe
just non-executable in that case, as you'll need enough access to
the fs to set up the containers you want to run.

So if /dev/sda1 is your host /, you have to use /dev/sda2 as the
container rootfs source.  Mount it under /containers with uid
shifting.  Now all containers regardless of uid mappings see
the shifted fs contents.  But the host root cannot be tricked by
files on it, as /dev/sda2 is non-executable as far as it is
concerned.

Just a thought.


Re: [RFC v2 PATCH 0/8] VFS:userns: support portable root filesystems

2016-05-04 Thread Serge Hallyn
Quoting Djalal Harouni (tix...@gmail.com):
> This is version 2 of the VFS:userns support portable root filesystems
> RFC. Changes since version 1:
> 
> * Update documentation and remove some ambiguity about the feature.
>   Based on Josh Triplett comments.
> * Use a new email address to send the RFC :-)
> 
> 
> This RFC tries to explore how to support filesystem operations inside
> user namespace using only VFS and a per mount namespace solution. This
> allows to take advantage of user namespace separations without
> introducing any change at the filesystems level. All this is handled
> with the virtual view of mount namespaces.

Given your use case, is there any way we could work in some tradeoffs
to protect the host?  What I'm thinking is that containers can all
share devices uid-mapped at will, however any device mounted with
uid shifting cannot be used by the inital user namespace.  Or maybe
just non-executable in that case, as you'll need enough access to
the fs to set up the containers you want to run.

So if /dev/sda1 is your host /, you have to use /dev/sda2 as the
container rootfs source.  Mount it under /containers with uid
shifting.  Now all containers regardless of uid mappings see
the shifted fs contents.  But the host root cannot be tricked by
files on it, as /dev/sda2 is non-executable as far as it is
concerned.

Just a thought.


Re: [RFC v2 PATCH 0/8] VFS:userns: support portable root filesystems

2016-05-04 Thread James Bottomley
On Wed, 2016-05-04 at 16:26 +0200, Djalal Harouni wrote:
> This is version 2 of the VFS:userns support portable root filesystems
> RFC. Changes since version 1:
> 
> * Update documentation and remove some ambiguity about the feature.
>   Based on Josh Triplett comments.
> * Use a new email address to send the RFC :-)
> 
> 
> This RFC tries to explore how to support filesystem operations inside
> user namespace using only VFS and a per mount namespace solution.
> This
> allows to take advantage of user namespace separations without
> introducing any change at the filesystems level. All this is handled
> with the virtual view of mount namespaces.
> 
> 
> 1) Presentation:
> 
> 
> The main aim is to support portable root filesystems and allow 
> containers, virtual machines and other cases to use the same root 
> filesystem. Due to security reasons, filesystems can't be mounted 
> inside user namespaces, and mounting them outside will not solve the 
> problem since they will show up with the wrong UIDs/GIDs. Read and 
> write operations will also fail and so on.
> 
> The current userspace solution is to automatically chown the whole 
> root filesystem before starting a container, example:
> (host) init_user_ns  100:1065536  => (container) user_ns_X1
> 0:65535
> (host) init_user_ns  200:2065536  => (container) user_ns_Y1
> 0:65535
> (host) init_user_ns  300:3065536  => (container) user_ns_Z1
> 0:65535
> ...
> 
> Every time a chown is called, files are changed and so on... This
> prevents to have portable filesystems where you can throw anywhere
> and boot. Having an extra step to adapt the filesystem to the current
> mapping and persist it will not allow to verify its integrity, it 
> makes snapshots and migration a bit harder, and probably other
> limitations...
> 
> It seems that there are multiple ways to allow user namespaces 
> combine nicely with filesystems, but none of them is that easy. The 
> bind mount and pin the user namespace during mount time will not 
> work, bind mounts share the same super block, hence you may endup 
> working on the wrong vfsmount context and there is no easy way to get
> out of that...

So this option was discussed at the recent LSF/MM summit.  The most
supported suggestion was that you'd use a new internal fs type that had
a struct mount with a new superblock and would copy the underlying
inodes but substitute it's own with modified ->getatrr/->setattr calls
that did the uid shift.  In many ways it would be a remapping bind
which would look similar to overlayfs but be a lot simpler.

> Using the user namespace in the super block seems the way to go, and
> there is the "Support fuse mounts in user namespaces" [1] patches 
> which seem nice but perhaps too complex!?

So I don't think that does what you want.  The fuse project I've used
before to do uid/gid shifts for build containers is bindfs

https://github.com/mpartel/bindfs/

It allows a --map argument where you specify pairs of uids/gids to map
(tedious for large ranges, but the map can be fixed to use uid:range
instead of individual).

>  there is also the overlayfs solution, and finaly the VFS layer 
> solution.
> 
> We present here a simple VFS solution, everything is packed inside 
> VFS, filesystems don't need to know anything (except probably XFS, 
> and special operations inside union filesystems). Currently it 
> supports ext4, btrfs and overlayfs. Changes into filesystems are 
> small, just parse the vfs_shift_uids and vfs_shift_gids options 
> during mount and set the appropriate flags into the super_block
> structure.

So this looks a little daunting.  It sprays the VFS with knowledge
about the shifts and requires support from every underlying filesystem.
 A simple remapping bind filesystem would be a lot simpler and require
no underlying filesystem support.

James



Re: [RFC v2 PATCH 0/8] VFS:userns: support portable root filesystems

2016-05-04 Thread James Bottomley
On Wed, 2016-05-04 at 16:26 +0200, Djalal Harouni wrote:
> This is version 2 of the VFS:userns support portable root filesystems
> RFC. Changes since version 1:
> 
> * Update documentation and remove some ambiguity about the feature.
>   Based on Josh Triplett comments.
> * Use a new email address to send the RFC :-)
> 
> 
> This RFC tries to explore how to support filesystem operations inside
> user namespace using only VFS and a per mount namespace solution.
> This
> allows to take advantage of user namespace separations without
> introducing any change at the filesystems level. All this is handled
> with the virtual view of mount namespaces.
> 
> 
> 1) Presentation:
> 
> 
> The main aim is to support portable root filesystems and allow 
> containers, virtual machines and other cases to use the same root 
> filesystem. Due to security reasons, filesystems can't be mounted 
> inside user namespaces, and mounting them outside will not solve the 
> problem since they will show up with the wrong UIDs/GIDs. Read and 
> write operations will also fail and so on.
> 
> The current userspace solution is to automatically chown the whole 
> root filesystem before starting a container, example:
> (host) init_user_ns  100:1065536  => (container) user_ns_X1
> 0:65535
> (host) init_user_ns  200:2065536  => (container) user_ns_Y1
> 0:65535
> (host) init_user_ns  300:3065536  => (container) user_ns_Z1
> 0:65535
> ...
> 
> Every time a chown is called, files are changed and so on... This
> prevents to have portable filesystems where you can throw anywhere
> and boot. Having an extra step to adapt the filesystem to the current
> mapping and persist it will not allow to verify its integrity, it 
> makes snapshots and migration a bit harder, and probably other
> limitations...
> 
> It seems that there are multiple ways to allow user namespaces 
> combine nicely with filesystems, but none of them is that easy. The 
> bind mount and pin the user namespace during mount time will not 
> work, bind mounts share the same super block, hence you may endup 
> working on the wrong vfsmount context and there is no easy way to get
> out of that...

So this option was discussed at the recent LSF/MM summit.  The most
supported suggestion was that you'd use a new internal fs type that had
a struct mount with a new superblock and would copy the underlying
inodes but substitute it's own with modified ->getatrr/->setattr calls
that did the uid shift.  In many ways it would be a remapping bind
which would look similar to overlayfs but be a lot simpler.

> Using the user namespace in the super block seems the way to go, and
> there is the "Support fuse mounts in user namespaces" [1] patches 
> which seem nice but perhaps too complex!?

So I don't think that does what you want.  The fuse project I've used
before to do uid/gid shifts for build containers is bindfs

https://github.com/mpartel/bindfs/

It allows a --map argument where you specify pairs of uids/gids to map
(tedious for large ranges, but the map can be fixed to use uid:range
instead of individual).

>  there is also the overlayfs solution, and finaly the VFS layer 
> solution.
> 
> We present here a simple VFS solution, everything is packed inside 
> VFS, filesystems don't need to know anything (except probably XFS, 
> and special operations inside union filesystems). Currently it 
> supports ext4, btrfs and overlayfs. Changes into filesystems are 
> small, just parse the vfs_shift_uids and vfs_shift_gids options 
> during mount and set the appropriate flags into the super_block
> structure.

So this looks a little daunting.  It sprays the VFS with knowledge
about the shifts and requires support from every underlying filesystem.
 A simple remapping bind filesystem would be a lot simpler and require
no underlying filesystem support.

James



Re: [RFC v2 PATCH 0/8] VFS:userns: support portable root filesystems

2016-05-04 Thread Josh Triplett
On Wed, May 04, 2016 at 04:26:46PM +0200, Djalal Harouni wrote:
> This is version 2 of the VFS:userns support portable root filesystems
> RFC. Changes since version 1:
> 
> * Update documentation and remove some ambiguity about the feature.
>   Based on Josh Triplett comments.

Thanks for the clarifications.

> 3) The existing user namespace interface is the one used to do the
> translation from virtual to on-disk mapping.

This makes sense.  Even if in the future we had a way to supply an
arbitrary VFS UID/GID mapping for a mount, independent of the userns,
what you've proposed would still make sense as a shorthand for the
common case of using the same mapping for both userns and VFS.

- Josh Triplett


Re: [RFC v2 PATCH 0/8] VFS:userns: support portable root filesystems

2016-05-04 Thread Josh Triplett
On Wed, May 04, 2016 at 04:26:46PM +0200, Djalal Harouni wrote:
> This is version 2 of the VFS:userns support portable root filesystems
> RFC. Changes since version 1:
> 
> * Update documentation and remove some ambiguity about the feature.
>   Based on Josh Triplett comments.

Thanks for the clarifications.

> 3) The existing user namespace interface is the one used to do the
> translation from virtual to on-disk mapping.

This makes sense.  Even if in the future we had a way to supply an
arbitrary VFS UID/GID mapping for a mount, independent of the userns,
what you've proposed would still make sense as a shorthand for the
common case of using the same mapping for both userns and VFS.

- Josh Triplett


[RFC v2 PATCH 0/8] VFS:userns: support portable root filesystems

2016-05-04 Thread Djalal Harouni
if lowerdir, upperdir and workdir are all
on a mount that supports vfs_shift_uids and vfs_shift_gids flags and we
are in a mount namespace that also supports that.

$ mount | grep btrfs
/dev/mapper/fedora-btrfs_root on /mnt/btrfs_root type btrfs 
(rw,relatime,seclabel,space_cache,vfs_shift_uids,vfs_shift_gids,subvolid=5,subvol=/)
$ cd /mnt/btrfs_root/
$ sudo mkdir -p container-uid-200/{upperdir,workdir,merged}
$ sudo chown -R 200.200 container-uid-200/
$ cd container-uid-200/
$ sudo mount -t overlay overlay 
-o,lowerdir=/mnt/btrfs_root/rootfs/fedora-tree,upperdir=upperdir,workdir=workdir
 merged
$ sudo chown -R 200.200 workdir/work/
$ sudo ~/bin/mountns-uidshift -u 200
...
bash-4.3# stat -c '%u:%g' merged/etc/passwd
0:0
bash-4.3# touch merged/overlayfs-file
bash-4.3# stat -c '%u:%g' merged/overlayfs-file 
0:0
[outside of namespaces]# stat -c '%u:%g' 
/mnt/btrfs_root/container-uid-200/merged/overlayfs-file 
0:0
[outside of namespaces]# stat -c '%u:%g' 
/mnt/btrfs_root/container-uid-200/upperdir/overlayfs-file 
0:0


2.3.2) Complex support or union filesystems:

If overlayfs lowerdir and upperdir are not on a filesystem that supports
natively vfs_shift_uids and vfs_shift_gids then to support VFS UID/GID
shifts, we must adapt the helper functions that where introduced in this
series to take also a super_block struct and test if the appropriate flags
where set into overlayfs instead of the other filesystem which the inode
belongs to. The translation on-disk <=> virtual should happen then inside
overlayfs.

I think this will always be the case of union mounts which fetch an inode
from another mount. I think that solution (2.3.2) can also be implemented,
I had some ugly patches to implement this on top of overlayfs, but not
sure, better see what others think about VFS UID/GID shifts first.

IMO solution (2.3.1) if done correctly is the way to go, in the end all
this relates to the virtual view of UID/GID inside the kernel, and how
resources are translated to them, it's not related to overlayfs.


3) ROADMAP:
===
* Confirm current design, and make sure that the mapping is done
  correctly.

* Add clone4() syscall [2]

* Investigate if current setns() checks to enter new mount namespaces
  are sufficient ?

* Add POSIX ACL support ?

* Check if all filesystem operations are correctly supported and recheck
  permissions access.

* Do filesystems provide some operations to control disk or host resources ?
  in other words are there some inodes on filesystems that allow to access
  host resources, if so then maybe these inodes either should be marked only
  safe in init_user_ns or get the appropriate capable() in init_user_ns if
  missing. Needs investigation.

* Add XFS support.



References:
===
[1] https://www.redhat.com/archives/dm-devel/2016-April/msg00368.html
[2] https://lkml.org/lkml/2015/3/15/10
[3] 
https://raw.githubusercontent.com/OpenDZ/research/master/kernel/mountns-uidshift.c

Thanks!

Patches:
[RFC v2 PATCH 0/8] VFS:userns: support portable root filesystems
[RFC v2 PATCH 1/8] VFS: add CLONE_MNTNS_SHIFT_UIDGID flag to allow mounts to 
shift their UIDs/GIDs
[RFC v2 PATCH 2/8] VFS:uidshift: add flags and helpers to shift UIDs and GIDs 
to virtual view
[RFC v2 PATCH 3/8] fs: Treat foreign mounts as nosuid
[RFC v2 PATCH 4/8] VFS:userns: shift UID/GID to virtual view during permission 
access
[RFC v2 PATCH 5/8] VFS:userns: add helpers to shift UIDs and GIDs into on-disk 
view
[RFC v2 PATCH 6/8] VFS:userns: shift UID/GID to on-disk view before any write 
to disk
[RFC v2 PATCH 7/8] ext4: add support for vfs_shift_uids and vfs_shift_gids 
mount options
[RFC v2 PATCH 8/8] btrfs: add support for vfs_shift_uids and vfs_shift_gids 
mount options


Diffstat for this RFC
fs/attr.c  |  44 +++
fs/btrfs/super.c   |  15 ++-
fs/exec.c  |   2 +-
fs/ext4/super.c|  14 ++
fs/inode.c |   9 ---
fs/mount.h |   1 +
fs/namei.c |   6 +++--
fs/namespace.c | 190 
++
fs/stat.c  |   4 +--
include/linux/fs.h |  14 ++
include/linux/mount.h  |   1 +
include/linux/user_namespace.h |   8 ++
include/uapi/linux/sched.h |   1 +
kernel/capability.c|  14 --
kernel/fork.c  |   4 +++
kernel/user_namespace.c|  13 ++
security/commoncap.c   |   2 +-
security/selinux/hooks.c   |   2 +-
18 files changed, 319 insertions(+), 25 deletions(-)


[RFC v2 PATCH 0/8] VFS:userns: support portable root filesystems

2016-05-04 Thread Djalal Harouni
if lowerdir, upperdir and workdir are all
on a mount that supports vfs_shift_uids and vfs_shift_gids flags and we
are in a mount namespace that also supports that.

$ mount | grep btrfs
/dev/mapper/fedora-btrfs_root on /mnt/btrfs_root type btrfs 
(rw,relatime,seclabel,space_cache,vfs_shift_uids,vfs_shift_gids,subvolid=5,subvol=/)
$ cd /mnt/btrfs_root/
$ sudo mkdir -p container-uid-200/{upperdir,workdir,merged}
$ sudo chown -R 200.200 container-uid-200/
$ cd container-uid-200/
$ sudo mount -t overlay overlay 
-o,lowerdir=/mnt/btrfs_root/rootfs/fedora-tree,upperdir=upperdir,workdir=workdir
 merged
$ sudo chown -R 200.200 workdir/work/
$ sudo ~/bin/mountns-uidshift -u 200
...
bash-4.3# stat -c '%u:%g' merged/etc/passwd
0:0
bash-4.3# touch merged/overlayfs-file
bash-4.3# stat -c '%u:%g' merged/overlayfs-file 
0:0
[outside of namespaces]# stat -c '%u:%g' 
/mnt/btrfs_root/container-uid-200/merged/overlayfs-file 
0:0
[outside of namespaces]# stat -c '%u:%g' 
/mnt/btrfs_root/container-uid-200/upperdir/overlayfs-file 
0:0


2.3.2) Complex support or union filesystems:

If overlayfs lowerdir and upperdir are not on a filesystem that supports
natively vfs_shift_uids and vfs_shift_gids then to support VFS UID/GID
shifts, we must adapt the helper functions that where introduced in this
series to take also a super_block struct and test if the appropriate flags
where set into overlayfs instead of the other filesystem which the inode
belongs to. The translation on-disk <=> virtual should happen then inside
overlayfs.

I think this will always be the case of union mounts which fetch an inode
from another mount. I think that solution (2.3.2) can also be implemented,
I had some ugly patches to implement this on top of overlayfs, but not
sure, better see what others think about VFS UID/GID shifts first.

IMO solution (2.3.1) if done correctly is the way to go, in the end all
this relates to the virtual view of UID/GID inside the kernel, and how
resources are translated to them, it's not related to overlayfs.


3) ROADMAP:
===
* Confirm current design, and make sure that the mapping is done
  correctly.

* Add clone4() syscall [2]

* Investigate if current setns() checks to enter new mount namespaces
  are sufficient ?

* Add POSIX ACL support ?

* Check if all filesystem operations are correctly supported and recheck
  permissions access.

* Do filesystems provide some operations to control disk or host resources ?
  in other words are there some inodes on filesystems that allow to access
  host resources, if so then maybe these inodes either should be marked only
  safe in init_user_ns or get the appropriate capable() in init_user_ns if
  missing. Needs investigation.

* Add XFS support.



References:
===
[1] https://www.redhat.com/archives/dm-devel/2016-April/msg00368.html
[2] https://lkml.org/lkml/2015/3/15/10
[3] 
https://raw.githubusercontent.com/OpenDZ/research/master/kernel/mountns-uidshift.c

Thanks!

Patches:
[RFC v2 PATCH 0/8] VFS:userns: support portable root filesystems
[RFC v2 PATCH 1/8] VFS: add CLONE_MNTNS_SHIFT_UIDGID flag to allow mounts to 
shift their UIDs/GIDs
[RFC v2 PATCH 2/8] VFS:uidshift: add flags and helpers to shift UIDs and GIDs 
to virtual view
[RFC v2 PATCH 3/8] fs: Treat foreign mounts as nosuid
[RFC v2 PATCH 4/8] VFS:userns: shift UID/GID to virtual view during permission 
access
[RFC v2 PATCH 5/8] VFS:userns: add helpers to shift UIDs and GIDs into on-disk 
view
[RFC v2 PATCH 6/8] VFS:userns: shift UID/GID to on-disk view before any write 
to disk
[RFC v2 PATCH 7/8] ext4: add support for vfs_shift_uids and vfs_shift_gids 
mount options
[RFC v2 PATCH 8/8] btrfs: add support for vfs_shift_uids and vfs_shift_gids 
mount options


Diffstat for this RFC
fs/attr.c  |  44 +++
fs/btrfs/super.c   |  15 ++-
fs/exec.c  |   2 +-
fs/ext4/super.c|  14 ++
fs/inode.c |   9 ---
fs/mount.h |   1 +
fs/namei.c |   6 +++--
fs/namespace.c | 190 
++
fs/stat.c  |   4 +--
include/linux/fs.h |  14 ++
include/linux/mount.h  |   1 +
include/linux/user_namespace.h |   8 ++
include/uapi/linux/sched.h |   1 +
kernel/capability.c|  14 --
kernel/fork.c  |   4 +++
kernel/user_namespace.c|  13 ++
security/commoncap.c   |   2 +-
security/selinux/hooks.c   |   2 +-
18 files changed, 319 insertions(+), 25 deletions(-)