Re: [RFC 1/1] shiftfs: uid/gid shifting bind mount

2017-02-22 Thread James Bottomley
On Mon, 2017-02-20 at 17:24 +1300, Eric W. Biederman wrote:
> James Bottomley  writes:
> 
> > On Fri, 2017-02-17 at 14:57 +1300, Eric W. Biederman wrote:
> > > I think I am missing something but I completely do not understand
> > > that subthread that says use file marks and perform the work in 
> > > the vfs. The problem is that fundamentally we need multiple 
> > > mappings and I don't see a mark on a file (even an inherited 
> > > mark) providing the mapping so I don't see the point.
> > 
> > The point of the mark is that it's a statement by the system
> > administrator that the underlying subtree is safe to be mounted by 
> > an unprivileged container in the containers user view (i.e. with
> > current_user_ns() == s_user_ns).  For the unprivileged container
> > there's no real arbitrary s_user_ns use case because the 
> > unprivileged container must prove it can set up the mapping, so it 
> > would likely always be mounting from within a user_ns with the 
> > mapping it wanted.
> 
> As a statement that it is ok for the unprivileged mapping code to
> operate that seems reasonable.  I don't currently the need for such 
> an ok from the system adminstrator, but if you need it a flag that
> propagates to children and child directories seems reasonable.

The other way to do this is with an extended attribute.  I've played
around with that approach and quite like it: the advantage is that it's
sticky across system reboots; The down side is that it requires
additional VFS code to make sure you can't execute from the non-user_ns
view.

James



Re: [RFC 1/1] shiftfs: uid/gid shifting bind mount

2017-02-22 Thread James Bottomley
On Mon, 2017-02-20 at 17:24 +1300, Eric W. Biederman wrote:
> James Bottomley  writes:
> 
> > On Fri, 2017-02-17 at 14:57 +1300, Eric W. Biederman wrote:
> > > I think I am missing something but I completely do not understand
> > > that subthread that says use file marks and perform the work in 
> > > the vfs. The problem is that fundamentally we need multiple 
> > > mappings and I don't see a mark on a file (even an inherited 
> > > mark) providing the mapping so I don't see the point.
> > 
> > The point of the mark is that it's a statement by the system
> > administrator that the underlying subtree is safe to be mounted by 
> > an unprivileged container in the containers user view (i.e. with
> > current_user_ns() == s_user_ns).  For the unprivileged container
> > there's no real arbitrary s_user_ns use case because the 
> > unprivileged container must prove it can set up the mapping, so it 
> > would likely always be mounting from within a user_ns with the 
> > mapping it wanted.
> 
> As a statement that it is ok for the unprivileged mapping code to
> operate that seems reasonable.  I don't currently the need for such 
> an ok from the system adminstrator, but if you need it a flag that
> propagates to children and child directories seems reasonable.

The other way to do this is with an extended attribute.  I've played
around with that approach and quite like it: the advantage is that it's
sticky across system reboots; The down side is that it requires
additional VFS code to make sure you can't execute from the non-user_ns
view.

James



Re: [RFC 1/1] shiftfs: uid/gid shifting bind mount

2017-02-20 Thread J. R. Okajima
James Bottomley:
> With a different owner view, but that's irrelevant to the underlying
> inode.

Ok, the different ownership is limited within shitfs (or userns,
container). Good. I might forget that shiftfs wants to behave like
bind-mount.

And I noticed that shiftfs setattr() converts uid/gid before calling
backend fs' ->setattr(). It is good too.
But how about acl? Won't such conversion be necessary for acl too?


J. R. Okajima


Re: [RFC 1/1] shiftfs: uid/gid shifting bind mount

2017-02-20 Thread J. R. Okajima
James Bottomley:
> With a different owner view, but that's irrelevant to the underlying
> inode.

Ok, the different ownership is limited within shitfs (or userns,
container). Good. I might forget that shiftfs wants to behave like
bind-mount.

And I noticed that shiftfs setattr() converts uid/gid before calling
backend fs' ->setattr(). It is good too.
But how about acl? Won't such conversion be necessary for acl too?


J. R. Okajima


Re: [RFC 1/1] shiftfs: uid/gid shifting bind mount

2017-02-20 Thread James Bottomley
On Tue, 2017-02-21 at 11:57 +0900, J. R. Okajima wrote:
> James Bottomley:
> > I realised as I was trimming down the vestigial inode properties in 
> > the patch that actually shiftfs does use the i_ino from the 
> > underlying for userspace.  The reason why is that it comes from the 
> > getattr call in stat and that's fully what the underlying 
> > filesystem returns (including the inode number).
> 
> Let me make sure.
> - shiftfs has its own inode, but it will never be visible to 
> userspace. - the inode attr visible to users are equivalent to the 
> underlying one,   includeing dev:ino pair.
> right?

Yes, it behaves like a bind mount.

> If so, I am afraid it will make users confused. The dev:ino pair is a
> system-wide identity,

I don't believe it will, otherwise they'd have the same confusion over
a real bind mount.  The dev:inum pair identifies an inode.  An inode
may have many paths and shiftfs just adds a path.

>  but shiftfs creates the same dev:ino pair with different owner.

With a different owner view, but that's irrelevant to the underlying
inode.

>  Though I don't know whether the actual application or LSM exists or
> not who will be damaged by this situation.
> For git-status case which I wrote previously, it might not be a 
> problem as long as dev:ino is unchanged from git index.
> But such filesystem looks weird.

It behaves as much as possible like a bind mount and the user view is
standard behaviour, so it can't really be classified as "weird".  What
won't work like a classic bind mount in this scenario is NFS exporting,
but that's about the only thing.

James



Re: [RFC 1/1] shiftfs: uid/gid shifting bind mount

2017-02-20 Thread James Bottomley
On Tue, 2017-02-21 at 11:57 +0900, J. R. Okajima wrote:
> James Bottomley:
> > I realised as I was trimming down the vestigial inode properties in 
> > the patch that actually shiftfs does use the i_ino from the 
> > underlying for userspace.  The reason why is that it comes from the 
> > getattr call in stat and that's fully what the underlying 
> > filesystem returns (including the inode number).
> 
> Let me make sure.
> - shiftfs has its own inode, but it will never be visible to 
> userspace. - the inode attr visible to users are equivalent to the 
> underlying one,   includeing dev:ino pair.
> right?

Yes, it behaves like a bind mount.

> If so, I am afraid it will make users confused. The dev:ino pair is a
> system-wide identity,

I don't believe it will, otherwise they'd have the same confusion over
a real bind mount.  The dev:inum pair identifies an inode.  An inode
may have many paths and shiftfs just adds a path.

>  but shiftfs creates the same dev:ino pair with different owner.

With a different owner view, but that's irrelevant to the underlying
inode.

>  Though I don't know whether the actual application or LSM exists or
> not who will be damaged by this situation.
> For git-status case which I wrote previously, it might not be a 
> problem as long as dev:ino is unchanged from git index.
> But such filesystem looks weird.

It behaves as much as possible like a bind mount and the user view is
standard behaviour, so it can't really be classified as "weird".  What
won't work like a classic bind mount in this scenario is NFS exporting,
but that's about the only thing.

James



Re: [RFC 1/1] shiftfs: uid/gid shifting bind mount

2017-02-20 Thread J. R. Okajima
James Bottomley:
> I realised as I was trimming down the vestigial inode properties in the
> patch that actually shiftfs does use the i_ino from the underlying for
> userspace.  The reason why is that it comes from the getattr call in
> stat and that's fully what the underlying filesystem returns (including
> the inode number).

Let me make sure.
- shiftfs has its own inode, but it will never be visible to userspace.
- the inode attr visible to users are equivalent to the underlying one,
  includeing dev:ino pair.
right?
If so, I am afraid it will make users confused. The dev:ino pair is a
system-wide identity, but shiftfs creates the same dev:ino pair with
different owner. Though I don't know whether the actual application or
LSM exists or not who will be damaged by this situation.
For git-status case which I wrote previously, it might not be a problem
as long as dev:ino is unchanged from git index.
But such filesystem looks weird.


J. R. Okajima


Re: [RFC 1/1] shiftfs: uid/gid shifting bind mount

2017-02-20 Thread J. R. Okajima
James Bottomley:
> I realised as I was trimming down the vestigial inode properties in the
> patch that actually shiftfs does use the i_ino from the underlying for
> userspace.  The reason why is that it comes from the getattr call in
> stat and that's fully what the underlying filesystem returns (including
> the inode number).

Let me make sure.
- shiftfs has its own inode, but it will never be visible to userspace.
- the inode attr visible to users are equivalent to the underlying one,
  includeing dev:ino pair.
right?
If so, I am afraid it will make users confused. The dev:ino pair is a
system-wide identity, but shiftfs creates the same dev:ino pair with
different owner. Though I don't know whether the actual application or
LSM exists or not who will be damaged by this situation.
For git-status case which I wrote previously, it might not be a problem
as long as dev:ino is unchanged from git index.
But such filesystem looks weird.


J. R. Okajima


Re: [RFC 1/1] shiftfs: uid/gid shifting bind mount

2017-02-20 Thread James Bottomley
On Tue, 2017-02-07 at 01:24 +0900, J. R. Okajima wrote:
> James Bottomley:
> > Yes, I know the problem.  However, I believe most current linux
> > filesystems no longer guarantee stable, for the lifetime of the 
> > file, inode numbers.  The usual docker container root is overlayfs,
> > which, similarly doesn't support stable inode numbers.  I see the 
> > odd complaint about docker with overlayfs having unstable inode
> > numbers, but none seems to have any serious repercussions.
> 
> I think it serious.
> Reusing the backend fs' inum is a good approach which Amir wrote.
> Based on this, I'd suggest you to support the hardlinks.

I realised as I was trimming down the vestigial inode properties in the
patch that actually shiftfs does use the i_ino from the underlying for
userspace.  The reason why is that it comes from the getattr call in
stat and that's fully what the underlying filesystem returns (including
the inode number).

James



Re: [RFC 1/1] shiftfs: uid/gid shifting bind mount

2017-02-20 Thread James Bottomley
On Tue, 2017-02-07 at 01:24 +0900, J. R. Okajima wrote:
> James Bottomley:
> > Yes, I know the problem.  However, I believe most current linux
> > filesystems no longer guarantee stable, for the lifetime of the 
> > file, inode numbers.  The usual docker container root is overlayfs,
> > which, similarly doesn't support stable inode numbers.  I see the 
> > odd complaint about docker with overlayfs having unstable inode
> > numbers, but none seems to have any serious repercussions.
> 
> I think it serious.
> Reusing the backend fs' inum is a good approach which Amir wrote.
> Based on this, I'd suggest you to support the hardlinks.

I realised as I was trimming down the vestigial inode properties in the
patch that actually shiftfs does use the i_ino from the underlying for
userspace.  The reason why is that it comes from the getattr call in
stat and that's fully what the underlying filesystem returns (including
the inode number).

James



Re: [RFC 1/1] shiftfs: uid/gid shifting bind mount

2017-02-20 Thread James Bottomley
On Mon, 2017-02-20 at 14:26 -0500, Vivek Goyal wrote:
> On Sat, Feb 18, 2017 at 07:24:38PM -0800, James Bottomley wrote:
> 
> [..]
> > > > Yes, this is a known characteristic of stacked filesystems.  Is
> > > > there some magic I don't know about that would make it easier
> > > > to 
> > > > reflect hard links as aliases?
> > > 
> > > I think overlayfs had the same issue in the beginning and miklos
> > > fixed it.
> > > 
> > > commit 51f7e52dc943468c6929fa0a82d4afac3c8e9636
> > > Author: Miklos Szeredi 
> > > Date:   Fri Jul 29 12:05:24 2016 +0200
> > > 
> > > ovl: share inode for hard link
> > 
> > That's rather complex, but the principle is simple: use the inode
> > hash
> > for all upper inodes that may have aliases.  Aliasable means the
> > underlying inode isn't a directory and has i_nlink > 1, so all I
> > have
> > to do is perform a lookup through the hash if the underlying is
> > aliasable, invalidate the dentry in d_revalidate if the aliasing
> > conditions to the underlying change and manually handle hard links
> > and
> > it should all work.
> > 
> > Like this?
> 
> Sounds reasonable to me. I did basic testing and this seems to work
> for me.
> 
> In general, I am having random crashes. I just get following on 
> serial console
> 
> --[Cut Here]--
> 
> And nothing after that.

That's indicative of some hard lockup.  I don't see this, but I'm also
using a second laptop for testing, which is suboptimal.  I'm going to
try moving to xfstests inside a VM tomorrow (that's what long aeroplane
flights are for).

> Still trying to narrow down.

Thanks.  There've been a lot of patches flying around, so I'll do a
collected repost under a v2 header to make sure we're all in sync.

James



Re: [RFC 1/1] shiftfs: uid/gid shifting bind mount

2017-02-20 Thread James Bottomley
On Mon, 2017-02-20 at 14:26 -0500, Vivek Goyal wrote:
> On Sat, Feb 18, 2017 at 07:24:38PM -0800, James Bottomley wrote:
> 
> [..]
> > > > Yes, this is a known characteristic of stacked filesystems.  Is
> > > > there some magic I don't know about that would make it easier
> > > > to 
> > > > reflect hard links as aliases?
> > > 
> > > I think overlayfs had the same issue in the beginning and miklos
> > > fixed it.
> > > 
> > > commit 51f7e52dc943468c6929fa0a82d4afac3c8e9636
> > > Author: Miklos Szeredi 
> > > Date:   Fri Jul 29 12:05:24 2016 +0200
> > > 
> > > ovl: share inode for hard link
> > 
> > That's rather complex, but the principle is simple: use the inode
> > hash
> > for all upper inodes that may have aliases.  Aliasable means the
> > underlying inode isn't a directory and has i_nlink > 1, so all I
> > have
> > to do is perform a lookup through the hash if the underlying is
> > aliasable, invalidate the dentry in d_revalidate if the aliasing
> > conditions to the underlying change and manually handle hard links
> > and
> > it should all work.
> > 
> > Like this?
> 
> Sounds reasonable to me. I did basic testing and this seems to work
> for me.
> 
> In general, I am having random crashes. I just get following on 
> serial console
> 
> --[Cut Here]--
> 
> And nothing after that.

That's indicative of some hard lockup.  I don't see this, but I'm also
using a second laptop for testing, which is suboptimal.  I'm going to
try moving to xfstests inside a VM tomorrow (that's what long aeroplane
flights are for).

> Still trying to narrow down.

Thanks.  There've been a lot of patches flying around, so I'll do a
collected repost under a v2 header to make sure we're all in sync.

James



Re: [RFC 1/1] shiftfs: uid/gid shifting bind mount

2017-02-20 Thread Vivek Goyal
On Sat, Feb 18, 2017 at 07:24:38PM -0800, James Bottomley wrote:

[..]
> > > Yes, this is a known characteristic of stacked filesystems.  Is 
> > > there some magic I don't know about that would make it easier to 
> > > reflect hard links as aliases?
> > 
> > I think overlayfs had the same issue in the beginning and miklos
> > fixed it.
> > 
> > commit 51f7e52dc943468c6929fa0a82d4afac3c8e9636
> > Author: Miklos Szeredi 
> > Date:   Fri Jul 29 12:05:24 2016 +0200
> > 
> > ovl: share inode for hard link
> 
> That's rather complex, but the principle is simple: use the inode hash
> for all upper inodes that may have aliases.  Aliasable means the
> underlying inode isn't a directory and has i_nlink > 1, so all I have
> to do is perform a lookup through the hash if the underlying is
> aliasable, invalidate the dentry in d_revalidate if the aliasing
> conditions to the underlying change and manually handle hard links and
> it should all work.
> 
> Like this?

Sounds reasonable to me. I did basic testing and this seems to work for me.

In general, I am having random crashes. I just get following on serial
console

--[Cut Here]--

And nothing after that.

Still trying to narrow down.

Vivek


Re: [RFC 1/1] shiftfs: uid/gid shifting bind mount

2017-02-20 Thread Vivek Goyal
On Sat, Feb 18, 2017 at 07:24:38PM -0800, James Bottomley wrote:

[..]
> > > Yes, this is a known characteristic of stacked filesystems.  Is 
> > > there some magic I don't know about that would make it easier to 
> > > reflect hard links as aliases?
> > 
> > I think overlayfs had the same issue in the beginning and miklos
> > fixed it.
> > 
> > commit 51f7e52dc943468c6929fa0a82d4afac3c8e9636
> > Author: Miklos Szeredi 
> > Date:   Fri Jul 29 12:05:24 2016 +0200
> > 
> > ovl: share inode for hard link
> 
> That's rather complex, but the principle is simple: use the inode hash
> for all upper inodes that may have aliases.  Aliasable means the
> underlying inode isn't a directory and has i_nlink > 1, so all I have
> to do is perform a lookup through the hash if the underlying is
> aliasable, invalidate the dentry in d_revalidate if the aliasing
> conditions to the underlying change and manually handle hard links and
> it should all work.
> 
> Like this?

Sounds reasonable to me. I did basic testing and this seems to work for me.

In general, I am having random crashes. I just get following on serial
console

--[Cut Here]--

And nothing after that.

Still trying to narrow down.

Vivek


Re: [RFC 1/1] shiftfs: uid/gid shifting bind mount

2017-02-19 Thread Eric W. Biederman
James Bottomley  writes:

> On Fri, 2017-02-17 at 14:57 +1300, Eric W. Biederman wrote:
>> I think I am missing something but I completely do not understand 
>> that subthread that says use file marks and perform the work in the 
>> vfs. The problem is that fundamentally we need multiple mappings and 
>> I don't see a mark on a file (even an inherited mark) providing the 
>> mapping so I don't see the point.
>
> The point of the mark is that it's a statement by the system
> administrator that the underlying subtree is safe to be mounted by an
> unprivileged container in the containers user view (i.e. with
> current_user_ns() == s_user_ns).  For the unprivileged container
> there's no real arbitrary s_user_ns use case because the unprivileged
> container must prove it can set up the mapping, so it would likely
> always be mounting from within a user_ns with the mapping it wanted.

As a statement that it is ok for the unprivileged mapping code to
operate that seems reasonable.  I don't currently the need for such an
ok from the system adminstrator, but if you need it a flag that
propagates to children and child directories seems reasonable.

Eric


Re: [RFC 1/1] shiftfs: uid/gid shifting bind mount

2017-02-19 Thread Eric W. Biederman
James Bottomley  writes:

> On Fri, 2017-02-17 at 14:57 +1300, Eric W. Biederman wrote:
>> I think I am missing something but I completely do not understand 
>> that subthread that says use file marks and perform the work in the 
>> vfs. The problem is that fundamentally we need multiple mappings and 
>> I don't see a mark on a file (even an inherited mark) providing the 
>> mapping so I don't see the point.
>
> The point of the mark is that it's a statement by the system
> administrator that the underlying subtree is safe to be mounted by an
> unprivileged container in the containers user view (i.e. with
> current_user_ns() == s_user_ns).  For the unprivileged container
> there's no real arbitrary s_user_ns use case because the unprivileged
> container must prove it can set up the mapping, so it would likely
> always be mounting from within a user_ns with the mapping it wanted.

As a statement that it is ok for the unprivileged mapping code to
operate that seems reasonable.  I don't currently the need for such an
ok from the system adminstrator, but if you need it a flag that
propagates to children and child directories seems reasonable.

Eric


Re: [RFC 1/1] shiftfs: uid/gid shifting bind mount

2017-02-18 Thread James Bottomley
On Fri, 2017-02-17 at 15:35 -0500, Vivek Goyal wrote:
> On Fri, Feb 17, 2017 at 09:34:07AM -0800, James Bottomley wrote:
> > On Fri, 2017-02-17 at 02:55 +, Al Viro wrote:
> > > On Thu, Feb 16, 2017 at 07:56:30AM -0800, James Bottomley wrote:
> > > 
> > > > > Hi James,
> > > > > 
> > > > > Should it be "return d_splice_alias()" so that if we find an 
> > > > > alias it is returned back to caller and passed in dentry can 
> > > > > be freed. Though I don't know in what cases alias can be 
> > > > > found. And if alias is found how do we make sure alias_dentry
> > > > > ->d_fsdata is pointing to new (real dentry).
> > > > 
> > > > It probably should be for the sake of the pattern.  In our case 
> > > > I don't think we can have any root aliases because the root
> > > > dentry is always pinned in the cache, so cache lookup should 
> > > > always find it.
> > > 
> > > What does that have to do with root dentry?  The real reason why 
> > > that code works (FVerySVO) is that the damn thing allocates a new
> > > inode every time. Including the hardlinks, BTW.
> > 
> > Yes, this is a known characteristic of stacked filesystems.  Is 
> > there some magic I don't know about that would make it easier to 
> > reflect hard links as aliases?
> 
> I think overlayfs had the same issue in the beginning and miklos
> fixed it.
> 
> commit 51f7e52dc943468c6929fa0a82d4afac3c8e9636
> Author: Miklos Szeredi 
> Date:   Fri Jul 29 12:05:24 2016 +0200
> 
> ovl: share inode for hard link

That's rather complex, but the principle is simple: use the inode hash
for all upper inodes that may have aliases.  Aliasable means the
underlying inode isn't a directory and has i_nlink > 1, so all I have
to do is perform a lookup through the hash if the underlying is
aliasable, invalidate the dentry in d_revalidate if the aliasing
conditions to the underlying change and manually handle hard links and
it should all work.

Like this?

James

---

diff --git a/fs/shiftfs.c b/fs/shiftfs.c
index 5b50447..c659812 100644
--- a/fs/shiftfs.c
+++ b/fs/shiftfs.c
@@ -134,6 +134,7 @@ static int shiftfs_d_weak_revalidate(struct dentry *dentry, 
unsigned int flags)
 static int shiftfs_d_revalidate(struct dentry *dentry, unsigned int flags)
 {
struct dentry *real = dentry->d_fsdata;
+   struct inode *reali = d_inode(real), *inode = d_inode(dentry);
int ret;
 
if (d_unhashed(real))
@@ -146,6 +147,15 @@ static int shiftfs_d_revalidate(struct dentry *dentry, 
unsigned int flags)
if (d_is_negative(real) != d_is_negative(dentry))
return 0;
 
+   /*
+* non dir link count is > 1 and our inode is currently not in
+* the inode hash => need to drop and reget our dentry to make
+* sure we're aliasing it correctly.
+*/
+   if (reali &&!S_ISDIR(reali->i_mode) && reali->i_nlink > 1 &&
+   (!inode || inode_unhashed(inode)))
+   return 0;
+
if (!(real->d_flags & DCACHE_OP_REVALIDATE))
return 1;
 
@@ -285,7 +295,8 @@ static int shiftfs_make_object(struct inode *dir, struct 
dentry *dentry,
   umode_t mode, const char *symlink,
   struct dentry *hardlink, bool excl)
 {
-   struct dentry *real = dir->i_private, *new = dentry->d_fsdata;
+   struct dentry *real = dir->i_private, *new = dentry->d_fsdata,
+   *realhardlink = NULL;
struct inode *reali = real->d_inode, *newi;
const struct inode_operations *iop = reali->i_op;
int err;
@@ -293,6 +304,7 @@ static int shiftfs_make_object(struct inode *dir, struct 
dentry *dentry,
bool op_ok = false;
 
if (hardlink) {
+   realhardlink = hardlink->d_fsdata;
op_ok = iop->link;
} else {
switch (mode & S_IFMT) {
@@ -310,7 +322,7 @@ static int shiftfs_make_object(struct inode *dir, struct 
dentry *dentry,
return -EINVAL;
 
 
-   newi = shiftfs_new_inode(dentry->d_sb, mode, NULL);
+   newi = shiftfs_new_inode(dentry->d_sb, mode, realhardlink);
if (!newi)
return -ENOMEM;
 
@@ -320,8 +332,6 @@ static int shiftfs_make_object(struct inode *dir, struct 
dentry *dentry,
 
err = -EINVAL;  /* shut gcc up about uninit var */
if (hardlink) {
-   struct dentry *realhardlink = hardlink->d_fsdata;
-
err = vfs_link(realhardlink, reali, new, NULL);
} else {
switch (mode & S_IFMT) {
@@ -341,7 +351,16 @@ static int shiftfs_make_object(struct inode *dir, struct 
dentry *dentry,
if (err)
goto out_dput;
 
-   shiftfs_fill_inode(newi, new);
+   if (!hardlink)
+   shiftfs_fill_inode(newi, new);
+   else if (inode_unhashed(newi) && !S_ISDIR(newi->i_mode))
+   /*
+* although dentry and hardlink now each point to
+* newi, the link 

Re: [RFC 1/1] shiftfs: uid/gid shifting bind mount

2017-02-18 Thread James Bottomley
On Fri, 2017-02-17 at 15:35 -0500, Vivek Goyal wrote:
> On Fri, Feb 17, 2017 at 09:34:07AM -0800, James Bottomley wrote:
> > On Fri, 2017-02-17 at 02:55 +, Al Viro wrote:
> > > On Thu, Feb 16, 2017 at 07:56:30AM -0800, James Bottomley wrote:
> > > 
> > > > > Hi James,
> > > > > 
> > > > > Should it be "return d_splice_alias()" so that if we find an 
> > > > > alias it is returned back to caller and passed in dentry can 
> > > > > be freed. Though I don't know in what cases alias can be 
> > > > > found. And if alias is found how do we make sure alias_dentry
> > > > > ->d_fsdata is pointing to new (real dentry).
> > > > 
> > > > It probably should be for the sake of the pattern.  In our case 
> > > > I don't think we can have any root aliases because the root
> > > > dentry is always pinned in the cache, so cache lookup should 
> > > > always find it.
> > > 
> > > What does that have to do with root dentry?  The real reason why 
> > > that code works (FVerySVO) is that the damn thing allocates a new
> > > inode every time. Including the hardlinks, BTW.
> > 
> > Yes, this is a known characteristic of stacked filesystems.  Is 
> > there some magic I don't know about that would make it easier to 
> > reflect hard links as aliases?
> 
> I think overlayfs had the same issue in the beginning and miklos
> fixed it.
> 
> commit 51f7e52dc943468c6929fa0a82d4afac3c8e9636
> Author: Miklos Szeredi 
> Date:   Fri Jul 29 12:05:24 2016 +0200
> 
> ovl: share inode for hard link

That's rather complex, but the principle is simple: use the inode hash
for all upper inodes that may have aliases.  Aliasable means the
underlying inode isn't a directory and has i_nlink > 1, so all I have
to do is perform a lookup through the hash if the underlying is
aliasable, invalidate the dentry in d_revalidate if the aliasing
conditions to the underlying change and manually handle hard links and
it should all work.

Like this?

James

---

diff --git a/fs/shiftfs.c b/fs/shiftfs.c
index 5b50447..c659812 100644
--- a/fs/shiftfs.c
+++ b/fs/shiftfs.c
@@ -134,6 +134,7 @@ static int shiftfs_d_weak_revalidate(struct dentry *dentry, 
unsigned int flags)
 static int shiftfs_d_revalidate(struct dentry *dentry, unsigned int flags)
 {
struct dentry *real = dentry->d_fsdata;
+   struct inode *reali = d_inode(real), *inode = d_inode(dentry);
int ret;
 
if (d_unhashed(real))
@@ -146,6 +147,15 @@ static int shiftfs_d_revalidate(struct dentry *dentry, 
unsigned int flags)
if (d_is_negative(real) != d_is_negative(dentry))
return 0;
 
+   /*
+* non dir link count is > 1 and our inode is currently not in
+* the inode hash => need to drop and reget our dentry to make
+* sure we're aliasing it correctly.
+*/
+   if (reali &&!S_ISDIR(reali->i_mode) && reali->i_nlink > 1 &&
+   (!inode || inode_unhashed(inode)))
+   return 0;
+
if (!(real->d_flags & DCACHE_OP_REVALIDATE))
return 1;
 
@@ -285,7 +295,8 @@ static int shiftfs_make_object(struct inode *dir, struct 
dentry *dentry,
   umode_t mode, const char *symlink,
   struct dentry *hardlink, bool excl)
 {
-   struct dentry *real = dir->i_private, *new = dentry->d_fsdata;
+   struct dentry *real = dir->i_private, *new = dentry->d_fsdata,
+   *realhardlink = NULL;
struct inode *reali = real->d_inode, *newi;
const struct inode_operations *iop = reali->i_op;
int err;
@@ -293,6 +304,7 @@ static int shiftfs_make_object(struct inode *dir, struct 
dentry *dentry,
bool op_ok = false;
 
if (hardlink) {
+   realhardlink = hardlink->d_fsdata;
op_ok = iop->link;
} else {
switch (mode & S_IFMT) {
@@ -310,7 +322,7 @@ static int shiftfs_make_object(struct inode *dir, struct 
dentry *dentry,
return -EINVAL;
 
 
-   newi = shiftfs_new_inode(dentry->d_sb, mode, NULL);
+   newi = shiftfs_new_inode(dentry->d_sb, mode, realhardlink);
if (!newi)
return -ENOMEM;
 
@@ -320,8 +332,6 @@ static int shiftfs_make_object(struct inode *dir, struct 
dentry *dentry,
 
err = -EINVAL;  /* shut gcc up about uninit var */
if (hardlink) {
-   struct dentry *realhardlink = hardlink->d_fsdata;
-
err = vfs_link(realhardlink, reali, new, NULL);
} else {
switch (mode & S_IFMT) {
@@ -341,7 +351,16 @@ static int shiftfs_make_object(struct inode *dir, struct 
dentry *dentry,
if (err)
goto out_dput;
 
-   shiftfs_fill_inode(newi, new);
+   if (!hardlink)
+   shiftfs_fill_inode(newi, new);
+   else if (inode_unhashed(newi) && !S_ISDIR(newi->i_mode))
+   /*
+* although dentry and hardlink now each point to
+* newi, the link count was 1 when they 

Re: [RFC 1/1] shiftfs: uid/gid shifting bind mount

2017-02-17 Thread James Bottomley
On Fri, 2017-02-17 at 17:51 +, Al Viro wrote:
> On Fri, Feb 17, 2017 at 09:24:40AM -0800, James Bottomley wrote:
> 
> > > What happens when somebody comes along and creates the damn thing
> > > on 
> > > the underlying fs?  _Not_ via your code, that is - using the 
> > > underlying fs mounted elsewhere.
> > 
> > Point taken.  This, I think fixes the dcache revalidation issue.
> 
> No, it doesn't.  Consider a local filesystem.  Those do not have any
> ->d_revalidate() - the kernel bloody well knows what happens to
> directories.  If e.g. a previously absent file gets created, it's
> been done by the kernel itself and dentry has been made positive; if
> a previously existing file has been removed, dentry has either become
> negative or, if it had been pinned (e.g. file was opened at the time,
> or your code had been holding a reference to it, etc.) it will be 
> unhashed so that new lookups won't find it, etc.  No need to 
> revalidate anything.
> 
> Now, consider your code.  You've done a lookup in the underlying fs.
> It has, at the time, come negative, so you have your (negative) 
> dentry pointing to that on the underlying fs.  If somebody comes and 
> does e.g. mkdir() via your fs, it will call vfs_mkdir() on the 
> underlying sucker, hopefully turning it positive and associate a new 
> in-core inode with your previously negative dentry.  But what happens 
> if mkdir is done via underlying fs, or via another instance of yours 
> over the same tree?
> Underlying dentry goes positive; yours is still negative.  The 
> underlying fs either doesn't have ->d_revalidate() or, if there is 
> one it says that the underlying dentry is valid, thank you very much, 
> no need to invalidate anything.
> 
> In other words, your patch does nothing for object getting created.

Right at the moment, the upper layer doesn't cache negative dentries,
but that's only a partial solution.  I assume you'd now like me to
cache negative dentries (principle of least surprise) and handle the
underlying negative to positive transition in d_revalidate?

I can do that.

James

---

diff --git a/fs/shiftfs.c b/fs/shiftfs.c
index a4a1f98..5b50447 100644
--- a/fs/shiftfs.c
+++ b/fs/shiftfs.c
@@ -118,9 +118,50 @@ static struct dentry *shiftfs_d_real(struct dentry *dentry,
return real;
 }
 
+static int shiftfs_d_weak_revalidate(struct dentry *dentry, unsigned int flags)
+{
+   struct dentry *real = dentry->d_fsdata;
+
+   if (d_unhashed(real))
+   return 0;
+
+   if (!(real->d_flags & DCACHE_OP_WEAK_REVALIDATE))
+   return 1;
+
+   return real->d_op->d_weak_revalidate(real, flags);
+}
+
+static int shiftfs_d_revalidate(struct dentry *dentry, unsigned int flags)
+{
+   struct dentry *real = dentry->d_fsdata;
+   int ret;
+
+   if (d_unhashed(real))
+   return 0;
+
+   /*
+* inode state of underlying changed from positive to negative
+* or vice versa; force a lookup to update our view
+*/
+   if (d_is_negative(real) != d_is_negative(dentry))
+   return 0;
+
+   if (!(real->d_flags & DCACHE_OP_REVALIDATE))
+   return 1;
+
+   ret = real->d_op->d_revalidate(real, flags);
+
+   if (ret == 0 && !(flags & LOOKUP_RCU))
+   d_invalidate(real);
+
+   return ret;
+}
+
 static const struct dentry_operations shiftfs_dentry_ops = {
.d_release  = shiftfs_d_release,
.d_real = shiftfs_d_real,
+   .d_revalidate   = shiftfs_d_revalidate,
+   .d_weak_revalidate = shiftfs_d_weak_revalidate,
 };
 
 static int shiftfs_readlink(struct dentry *dentry, char __user *data,
@@ -423,7 +464,7 @@ static struct dentry *shiftfs_lookup(struct inode *dir, 
struct dentry *dentry,
dentry->d_fsdata = new;
 
if (!new->d_inode)
-   return NULL;
+   goto out;
 
newi = shiftfs_new_inode(dentry->d_sb, new->d_inode->i_mode, new);
if (!newi) {
@@ -431,9 +472,8 @@ static struct dentry *shiftfs_lookup(struct inode *dir, 
struct dentry *dentry,
return ERR_PTR(-ENOMEM);
}
 
-   d_splice_alias(newi, dentry);
-
-   return NULL;
+ out:
+   return d_splice_alias(newi, dentry);
 }
 
 static int shiftfs_permission(struct inode *inode, int mask)



Re: [RFC 1/1] shiftfs: uid/gid shifting bind mount

2017-02-17 Thread James Bottomley
On Fri, 2017-02-17 at 17:51 +, Al Viro wrote:
> On Fri, Feb 17, 2017 at 09:24:40AM -0800, James Bottomley wrote:
> 
> > > What happens when somebody comes along and creates the damn thing
> > > on 
> > > the underlying fs?  _Not_ via your code, that is - using the 
> > > underlying fs mounted elsewhere.
> > 
> > Point taken.  This, I think fixes the dcache revalidation issue.
> 
> No, it doesn't.  Consider a local filesystem.  Those do not have any
> ->d_revalidate() - the kernel bloody well knows what happens to
> directories.  If e.g. a previously absent file gets created, it's
> been done by the kernel itself and dentry has been made positive; if
> a previously existing file has been removed, dentry has either become
> negative or, if it had been pinned (e.g. file was opened at the time,
> or your code had been holding a reference to it, etc.) it will be 
> unhashed so that new lookups won't find it, etc.  No need to 
> revalidate anything.
> 
> Now, consider your code.  You've done a lookup in the underlying fs.
> It has, at the time, come negative, so you have your (negative) 
> dentry pointing to that on the underlying fs.  If somebody comes and 
> does e.g. mkdir() via your fs, it will call vfs_mkdir() on the 
> underlying sucker, hopefully turning it positive and associate a new 
> in-core inode with your previously negative dentry.  But what happens 
> if mkdir is done via underlying fs, or via another instance of yours 
> over the same tree?
> Underlying dentry goes positive; yours is still negative.  The 
> underlying fs either doesn't have ->d_revalidate() or, if there is 
> one it says that the underlying dentry is valid, thank you very much, 
> no need to invalidate anything.
> 
> In other words, your patch does nothing for object getting created.

Right at the moment, the upper layer doesn't cache negative dentries,
but that's only a partial solution.  I assume you'd now like me to
cache negative dentries (principle of least surprise) and handle the
underlying negative to positive transition in d_revalidate?

I can do that.

James

---

diff --git a/fs/shiftfs.c b/fs/shiftfs.c
index a4a1f98..5b50447 100644
--- a/fs/shiftfs.c
+++ b/fs/shiftfs.c
@@ -118,9 +118,50 @@ static struct dentry *shiftfs_d_real(struct dentry *dentry,
return real;
 }
 
+static int shiftfs_d_weak_revalidate(struct dentry *dentry, unsigned int flags)
+{
+   struct dentry *real = dentry->d_fsdata;
+
+   if (d_unhashed(real))
+   return 0;
+
+   if (!(real->d_flags & DCACHE_OP_WEAK_REVALIDATE))
+   return 1;
+
+   return real->d_op->d_weak_revalidate(real, flags);
+}
+
+static int shiftfs_d_revalidate(struct dentry *dentry, unsigned int flags)
+{
+   struct dentry *real = dentry->d_fsdata;
+   int ret;
+
+   if (d_unhashed(real))
+   return 0;
+
+   /*
+* inode state of underlying changed from positive to negative
+* or vice versa; force a lookup to update our view
+*/
+   if (d_is_negative(real) != d_is_negative(dentry))
+   return 0;
+
+   if (!(real->d_flags & DCACHE_OP_REVALIDATE))
+   return 1;
+
+   ret = real->d_op->d_revalidate(real, flags);
+
+   if (ret == 0 && !(flags & LOOKUP_RCU))
+   d_invalidate(real);
+
+   return ret;
+}
+
 static const struct dentry_operations shiftfs_dentry_ops = {
.d_release  = shiftfs_d_release,
.d_real = shiftfs_d_real,
+   .d_revalidate   = shiftfs_d_revalidate,
+   .d_weak_revalidate = shiftfs_d_weak_revalidate,
 };
 
 static int shiftfs_readlink(struct dentry *dentry, char __user *data,
@@ -423,7 +464,7 @@ static struct dentry *shiftfs_lookup(struct inode *dir, 
struct dentry *dentry,
dentry->d_fsdata = new;
 
if (!new->d_inode)
-   return NULL;
+   goto out;
 
newi = shiftfs_new_inode(dentry->d_sb, new->d_inode->i_mode, new);
if (!newi) {
@@ -431,9 +472,8 @@ static struct dentry *shiftfs_lookup(struct inode *dir, 
struct dentry *dentry,
return ERR_PTR(-ENOMEM);
}
 
-   d_splice_alias(newi, dentry);
-
-   return NULL;
+ out:
+   return d_splice_alias(newi, dentry);
 }
 
 static int shiftfs_permission(struct inode *inode, int mask)



Re: [RFC 1/1] shiftfs: uid/gid shifting bind mount

2017-02-17 Thread Vivek Goyal
On Fri, Feb 17, 2017 at 09:34:07AM -0800, James Bottomley wrote:
> On Fri, 2017-02-17 at 02:55 +, Al Viro wrote:
> > On Thu, Feb 16, 2017 at 07:56:30AM -0800, James Bottomley wrote:
> > 
> > > > Hi James,
> > > > 
> > > > Should it be "return d_splice_alias()" so that if we find an 
> > > > alias it is returned back to caller and passed in dentry can be 
> > > > freed. Though I don't know in what cases alias can be found. And 
> > > > if alias is found how do we make sure alias_dentry->d_fsdata is 
> > > > pointing to new (real dentry).
> > > 
> > > It probably should be for the sake of the pattern.  In our case I 
> > > don't think we can have any root aliases because the root dentry is
> > > always pinned in the cache, so cache lookup should always find it.
> > 
> > What does that have to do with root dentry?  The real reason why that 
> > code works (FVerySVO) is that the damn thing allocates a new inode 
> > every time. Including the hardlinks, BTW.
> 
> Yes, this is a known characteristic of stacked filesystems.  Is there
> some magic I don't know about that would make it easier to reflect hard
> links as aliases?

I think overlayfs had the same issue in the beginning and miklos fixed it.

commit 51f7e52dc943468c6929fa0a82d4afac3c8e9636
Author: Miklos Szeredi 
Date:   Fri Jul 29 12:05:24 2016 +0200

ovl: share inode for hard link

Vivek


Re: [RFC 1/1] shiftfs: uid/gid shifting bind mount

2017-02-17 Thread Vivek Goyal
On Fri, Feb 17, 2017 at 09:34:07AM -0800, James Bottomley wrote:
> On Fri, 2017-02-17 at 02:55 +, Al Viro wrote:
> > On Thu, Feb 16, 2017 at 07:56:30AM -0800, James Bottomley wrote:
> > 
> > > > Hi James,
> > > > 
> > > > Should it be "return d_splice_alias()" so that if we find an 
> > > > alias it is returned back to caller and passed in dentry can be 
> > > > freed. Though I don't know in what cases alias can be found. And 
> > > > if alias is found how do we make sure alias_dentry->d_fsdata is 
> > > > pointing to new (real dentry).
> > > 
> > > It probably should be for the sake of the pattern.  In our case I 
> > > don't think we can have any root aliases because the root dentry is
> > > always pinned in the cache, so cache lookup should always find it.
> > 
> > What does that have to do with root dentry?  The real reason why that 
> > code works (FVerySVO) is that the damn thing allocates a new inode 
> > every time. Including the hardlinks, BTW.
> 
> Yes, this is a known characteristic of stacked filesystems.  Is there
> some magic I don't know about that would make it easier to reflect hard
> links as aliases?

I think overlayfs had the same issue in the beginning and miklos fixed it.

commit 51f7e52dc943468c6929fa0a82d4afac3c8e9636
Author: Miklos Szeredi 
Date:   Fri Jul 29 12:05:24 2016 +0200

ovl: share inode for hard link

Vivek


Re: [RFC 1/1] shiftfs: uid/gid shifting bind mount

2017-02-17 Thread Vivek Goyal
On Fri, Feb 17, 2017 at 05:51:18PM +, Al Viro wrote:
> On Fri, Feb 17, 2017 at 09:24:40AM -0800, James Bottomley wrote:
> 
> > > What happens when somebody comes along and creates the damn thing on 
> > > the underlying fs?  _Not_ via your code, that is - using the 
> > > underlying fs mounted elsewhere.
> > 
> > Point taken.  This, I think fixes the dcache revalidation issue.
> 
> No, it doesn't.  Consider a local filesystem.  Those do not have any
> ->d_revalidate() - the kernel bloody well knows what happens to
> directories.  If e.g. a previously absent file gets created, it's
> been done by the kernel itself and dentry has been made positive; if
> a previously existing file has been removed, dentry has either become
> negative or, if it had been pinned (e.g. file was opened at the time,
> or your code had been holding a reference to it, etc.) it will be unhashed
> so that new lookups won't find it, etc.  No need to revalidate anything.
> 
> Now, consider your code.  You've done a lookup in the underlying fs.
> It has, at the time, come negative, so you have your (negative) dentry
> pointing to that on the underlying fs.  If somebody comes and does
> e.g. mkdir() via your fs, it will call vfs_mkdir() on the underlying
> sucker, hopefully turning it positive and associate a new in-core inode
> with your previously negative dentry.  But what happens if mkdir is done
> via underlying fs, or via another instance of yours over the same tree?
> Underlying dentry goes positive; yours is still negative.  The underlying
> fs either doesn't have ->d_revalidate() or, if there is one it says that
> the underlying dentry is valid, thank you very much, no need to invalidate
> anything.
> 
> In other words, your patch does nothing for object getting created.

I thought assumption here is that underlying subtree is not changed
outside of shiftfs. IIUC, overlayfs has the same assumption.

Two shiftfs instances writing to same dir will be a problem though.

Vivek


Re: [RFC 1/1] shiftfs: uid/gid shifting bind mount

2017-02-17 Thread Vivek Goyal
On Fri, Feb 17, 2017 at 05:51:18PM +, Al Viro wrote:
> On Fri, Feb 17, 2017 at 09:24:40AM -0800, James Bottomley wrote:
> 
> > > What happens when somebody comes along and creates the damn thing on 
> > > the underlying fs?  _Not_ via your code, that is - using the 
> > > underlying fs mounted elsewhere.
> > 
> > Point taken.  This, I think fixes the dcache revalidation issue.
> 
> No, it doesn't.  Consider a local filesystem.  Those do not have any
> ->d_revalidate() - the kernel bloody well knows what happens to
> directories.  If e.g. a previously absent file gets created, it's
> been done by the kernel itself and dentry has been made positive; if
> a previously existing file has been removed, dentry has either become
> negative or, if it had been pinned (e.g. file was opened at the time,
> or your code had been holding a reference to it, etc.) it will be unhashed
> so that new lookups won't find it, etc.  No need to revalidate anything.
> 
> Now, consider your code.  You've done a lookup in the underlying fs.
> It has, at the time, come negative, so you have your (negative) dentry
> pointing to that on the underlying fs.  If somebody comes and does
> e.g. mkdir() via your fs, it will call vfs_mkdir() on the underlying
> sucker, hopefully turning it positive and associate a new in-core inode
> with your previously negative dentry.  But what happens if mkdir is done
> via underlying fs, or via another instance of yours over the same tree?
> Underlying dentry goes positive; yours is still negative.  The underlying
> fs either doesn't have ->d_revalidate() or, if there is one it says that
> the underlying dentry is valid, thank you very much, no need to invalidate
> anything.
> 
> In other words, your patch does nothing for object getting created.

I thought assumption here is that underlying subtree is not changed
outside of shiftfs. IIUC, overlayfs has the same assumption.

Two shiftfs instances writing to same dir will be a problem though.

Vivek


Re: [RFC 1/1] shiftfs: uid/gid shifting bind mount

2017-02-17 Thread Al Viro
On Fri, Feb 17, 2017 at 09:24:40AM -0800, James Bottomley wrote:

> > What happens when somebody comes along and creates the damn thing on 
> > the underlying fs?  _Not_ via your code, that is - using the 
> > underlying fs mounted elsewhere.
> 
> Point taken.  This, I think fixes the dcache revalidation issue.

No, it doesn't.  Consider a local filesystem.  Those do not have any
->d_revalidate() - the kernel bloody well knows what happens to
directories.  If e.g. a previously absent file gets created, it's
been done by the kernel itself and dentry has been made positive; if
a previously existing file has been removed, dentry has either become
negative or, if it had been pinned (e.g. file was opened at the time,
or your code had been holding a reference to it, etc.) it will be unhashed
so that new lookups won't find it, etc.  No need to revalidate anything.

Now, consider your code.  You've done a lookup in the underlying fs.
It has, at the time, come negative, so you have your (negative) dentry
pointing to that on the underlying fs.  If somebody comes and does
e.g. mkdir() via your fs, it will call vfs_mkdir() on the underlying
sucker, hopefully turning it positive and associate a new in-core inode
with your previously negative dentry.  But what happens if mkdir is done
via underlying fs, or via another instance of yours over the same tree?
Underlying dentry goes positive; yours is still negative.  The underlying
fs either doesn't have ->d_revalidate() or, if there is one it says that
the underlying dentry is valid, thank you very much, no need to invalidate
anything.

In other words, your patch does nothing for object getting created.


Re: [RFC 1/1] shiftfs: uid/gid shifting bind mount

2017-02-17 Thread Al Viro
On Fri, Feb 17, 2017 at 09:24:40AM -0800, James Bottomley wrote:

> > What happens when somebody comes along and creates the damn thing on 
> > the underlying fs?  _Not_ via your code, that is - using the 
> > underlying fs mounted elsewhere.
> 
> Point taken.  This, I think fixes the dcache revalidation issue.

No, it doesn't.  Consider a local filesystem.  Those do not have any
->d_revalidate() - the kernel bloody well knows what happens to
directories.  If e.g. a previously absent file gets created, it's
been done by the kernel itself and dentry has been made positive; if
a previously existing file has been removed, dentry has either become
negative or, if it had been pinned (e.g. file was opened at the time,
or your code had been holding a reference to it, etc.) it will be unhashed
so that new lookups won't find it, etc.  No need to revalidate anything.

Now, consider your code.  You've done a lookup in the underlying fs.
It has, at the time, come negative, so you have your (negative) dentry
pointing to that on the underlying fs.  If somebody comes and does
e.g. mkdir() via your fs, it will call vfs_mkdir() on the underlying
sucker, hopefully turning it positive and associate a new in-core inode
with your previously negative dentry.  But what happens if mkdir is done
via underlying fs, or via another instance of yours over the same tree?
Underlying dentry goes positive; yours is still negative.  The underlying
fs either doesn't have ->d_revalidate() or, if there is one it says that
the underlying dentry is valid, thank you very much, no need to invalidate
anything.

In other words, your patch does nothing for object getting created.


Re: [RFC 1/1] shiftfs: uid/gid shifting bind mount

2017-02-17 Thread James Bottomley
On Fri, 2017-02-17 at 02:55 +, Al Viro wrote:
> On Thu, Feb 16, 2017 at 07:56:30AM -0800, James Bottomley wrote:
> 
> > > Hi James,
> > > 
> > > Should it be "return d_splice_alias()" so that if we find an 
> > > alias it is returned back to caller and passed in dentry can be 
> > > freed. Though I don't know in what cases alias can be found. And 
> > > if alias is found how do we make sure alias_dentry->d_fsdata is 
> > > pointing to new (real dentry).
> > 
> > It probably should be for the sake of the pattern.  In our case I 
> > don't think we can have any root aliases because the root dentry is
> > always pinned in the cache, so cache lookup should always find it.
> 
> What does that have to do with root dentry?  The real reason why that 
> code works (FVerySVO) is that the damn thing allocates a new inode 
> every time. Including the hardlinks, BTW.

Yes, this is a known characteristic of stacked filesystems.  Is there
some magic I don't know about that would make it easier to reflect hard
links as aliases?

>   So d_splice_alias() will always return NULL - there's no way for 
> any dentries to be pointing to in-core struct inode you've
> just allocated.  Short of a use-after-free, that is...
> 
> Unless I'm missing something subtle, the whole thing is fucked
> in head wrt cache coherency - its dentries are blindly assumed to be
> forever valid, no matter what's happening with the underlying 
> filesystem.

Hopefully the patch in the previous email fixes this.

James



Re: [RFC 1/1] shiftfs: uid/gid shifting bind mount

2017-02-17 Thread James Bottomley
On Fri, 2017-02-17 at 02:55 +, Al Viro wrote:
> On Thu, Feb 16, 2017 at 07:56:30AM -0800, James Bottomley wrote:
> 
> > > Hi James,
> > > 
> > > Should it be "return d_splice_alias()" so that if we find an 
> > > alias it is returned back to caller and passed in dentry can be 
> > > freed. Though I don't know in what cases alias can be found. And 
> > > if alias is found how do we make sure alias_dentry->d_fsdata is 
> > > pointing to new (real dentry).
> > 
> > It probably should be for the sake of the pattern.  In our case I 
> > don't think we can have any root aliases because the root dentry is
> > always pinned in the cache, so cache lookup should always find it.
> 
> What does that have to do with root dentry?  The real reason why that 
> code works (FVerySVO) is that the damn thing allocates a new inode 
> every time. Including the hardlinks, BTW.

Yes, this is a known characteristic of stacked filesystems.  Is there
some magic I don't know about that would make it easier to reflect hard
links as aliases?

>   So d_splice_alias() will always return NULL - there's no way for 
> any dentries to be pointing to in-core struct inode you've
> just allocated.  Short of a use-after-free, that is...
> 
> Unless I'm missing something subtle, the whole thing is fucked
> in head wrt cache coherency - its dentries are blindly assumed to be
> forever valid, no matter what's happening with the underlying 
> filesystem.

Hopefully the patch in the previous email fixes this.

James



Re: [RFC 1/1] shiftfs: uid/gid shifting bind mount

2017-02-17 Thread James Bottomley
On Fri, 2017-02-17 at 02:29 +, Al Viro wrote:
> On Sat, Feb 04, 2017 at 11:19:32AM -0800, James Bottomley wrote:
> 
> > +static const struct dentry_operations shiftfs_dentry_ops = {
> > +   .d_release  = shiftfs_d_release,
> > +   .d_real = shiftfs_d_real,
> > +};
> 
> In other words, those dentries are *never* revalidated.  Nevermind 
> that underlying fs might be mounted elsewhere and be actively 
> modified under you.
> 
> > +static struct dentry *shiftfs_lookup(struct inode *dir, struct
> > dentry *dentry,
> > +unsigned int flags)
> > +{
> > +   struct dentry *real = dir->i_private, *new;
> > +   struct inode *reali = real->d_inode, *newi;
> > +   const struct cred *oldcred, *newcred;
> > +
> > +   inode_lock(reali);
> > +   oldcred = shiftfs_new_creds(, dentry->d_sb);
> > +   new = lookup_one_len(dentry->d_name.name, real, dentry
> > ->d_name.len);
> > +   shiftfs_old_creds(oldcred, );
> > +   inode_unlock(reali);
> > +
> > +   if (IS_ERR(new))
> > +   return new;
> > +
> > +   dentry->d_fsdata = new;
> > +
> > +   if (!new->d_inode)
> > +   return NULL;
> 
> What happens when somebody comes along and creates the damn thing on 
> the underlying fs?  _Not_ via your code, that is - using the 
> underlying fs mounted elsewhere.

Point taken.  This, I think fixes the dcache revalidation issue.

James

---

diff --git a/fs/shiftfs.c b/fs/shiftfs.c
index a4a1f98..1e71efe 100644
--- a/fs/shiftfs.c
+++ b/fs/shiftfs.c
@@ -118,9 +118,43 @@ static struct dentry *shiftfs_d_real(struct dentry *dentry,
return real;
 }
 
+static int shiftfs_d_weak_revalidate(struct dentry *dentry, unsigned int flags)
+{
+   struct dentry *real = dentry->d_fsdata;
+
+   if (d_unhashed(real))
+   return 0;
+
+   if (!(real->d_flags & DCACHE_OP_WEAK_REVALIDATE))
+   return 1;
+
+   return real->d_op->d_weak_revalidate(real, flags);
+}
+
+static int shiftfs_d_revalidate(struct dentry *dentry, unsigned int flags)
+{
+   struct dentry *real = dentry->d_fsdata;
+   int ret;
+
+   if (d_unhashed(real))
+   return 0;
+
+   if (!(real->d_flags & DCACHE_OP_REVALIDATE))
+   return 1;
+
+   ret = real->d_op->d_revalidate(real, flags);
+
+   if (ret == 0 && !(flags & LOOKUP_RCU))
+   d_invalidate(real);
+
+   return ret;
+}
+
 static const struct dentry_operations shiftfs_dentry_ops = {
.d_release  = shiftfs_d_release,
.d_real = shiftfs_d_real,
+   .d_revalidate   = shiftfs_d_revalidate,
+   .d_weak_revalidate = shiftfs_d_weak_revalidate,
 };
 
 static int shiftfs_readlink(struct dentry *dentry, char __user *data,
@@ -431,9 +465,7 @@ static struct dentry *shiftfs_lookup(struct inode *dir, 
struct dentry *dentry,
return ERR_PTR(-ENOMEM);
}
 
-   d_splice_alias(newi, dentry);
-
-   return NULL;
+   return d_splice_alias(newi, dentry);
 }
 
 static int shiftfs_permission(struct inode *inode, int mask)


Re: [RFC 1/1] shiftfs: uid/gid shifting bind mount

2017-02-17 Thread James Bottomley
On Fri, 2017-02-17 at 02:29 +, Al Viro wrote:
> On Sat, Feb 04, 2017 at 11:19:32AM -0800, James Bottomley wrote:
> 
> > +static const struct dentry_operations shiftfs_dentry_ops = {
> > +   .d_release  = shiftfs_d_release,
> > +   .d_real = shiftfs_d_real,
> > +};
> 
> In other words, those dentries are *never* revalidated.  Nevermind 
> that underlying fs might be mounted elsewhere and be actively 
> modified under you.
> 
> > +static struct dentry *shiftfs_lookup(struct inode *dir, struct
> > dentry *dentry,
> > +unsigned int flags)
> > +{
> > +   struct dentry *real = dir->i_private, *new;
> > +   struct inode *reali = real->d_inode, *newi;
> > +   const struct cred *oldcred, *newcred;
> > +
> > +   inode_lock(reali);
> > +   oldcred = shiftfs_new_creds(, dentry->d_sb);
> > +   new = lookup_one_len(dentry->d_name.name, real, dentry
> > ->d_name.len);
> > +   shiftfs_old_creds(oldcred, );
> > +   inode_unlock(reali);
> > +
> > +   if (IS_ERR(new))
> > +   return new;
> > +
> > +   dentry->d_fsdata = new;
> > +
> > +   if (!new->d_inode)
> > +   return NULL;
> 
> What happens when somebody comes along and creates the damn thing on 
> the underlying fs?  _Not_ via your code, that is - using the 
> underlying fs mounted elsewhere.

Point taken.  This, I think fixes the dcache revalidation issue.

James

---

diff --git a/fs/shiftfs.c b/fs/shiftfs.c
index a4a1f98..1e71efe 100644
--- a/fs/shiftfs.c
+++ b/fs/shiftfs.c
@@ -118,9 +118,43 @@ static struct dentry *shiftfs_d_real(struct dentry *dentry,
return real;
 }
 
+static int shiftfs_d_weak_revalidate(struct dentry *dentry, unsigned int flags)
+{
+   struct dentry *real = dentry->d_fsdata;
+
+   if (d_unhashed(real))
+   return 0;
+
+   if (!(real->d_flags & DCACHE_OP_WEAK_REVALIDATE))
+   return 1;
+
+   return real->d_op->d_weak_revalidate(real, flags);
+}
+
+static int shiftfs_d_revalidate(struct dentry *dentry, unsigned int flags)
+{
+   struct dentry *real = dentry->d_fsdata;
+   int ret;
+
+   if (d_unhashed(real))
+   return 0;
+
+   if (!(real->d_flags & DCACHE_OP_REVALIDATE))
+   return 1;
+
+   ret = real->d_op->d_revalidate(real, flags);
+
+   if (ret == 0 && !(flags & LOOKUP_RCU))
+   d_invalidate(real);
+
+   return ret;
+}
+
 static const struct dentry_operations shiftfs_dentry_ops = {
.d_release  = shiftfs_d_release,
.d_real = shiftfs_d_real,
+   .d_revalidate   = shiftfs_d_revalidate,
+   .d_weak_revalidate = shiftfs_d_weak_revalidate,
 };
 
 static int shiftfs_readlink(struct dentry *dentry, char __user *data,
@@ -431,9 +465,7 @@ static struct dentry *shiftfs_lookup(struct inode *dir, 
struct dentry *dentry,
return ERR_PTR(-ENOMEM);
}
 
-   d_splice_alias(newi, dentry);
-
-   return NULL;
+   return d_splice_alias(newi, dentry);
 }
 
 static int shiftfs_permission(struct inode *inode, int mask)


Re: [RFC 1/1] shiftfs: uid/gid shifting bind mount

2017-02-17 Thread James Bottomley
On Fri, 2017-02-17 at 14:57 +1300, Eric W. Biederman wrote:
> I think I am missing something but I completely do not understand 
> that subthread that says use file marks and perform the work in the 
> vfs. The problem is that fundamentally we need multiple mappings and 
> I don't see a mark on a file (even an inherited mark) providing the 
> mapping so I don't see the point.

The point of the mark is that it's a statement by the system
administrator that the underlying subtree is safe to be mounted by an
unprivileged container in the containers user view (i.e. with
current_user_ns() == s_user_ns).  For the unprivileged container
there's no real arbitrary s_user_ns use case because the unprivileged
container must prove it can set up the mapping, so it would likely
always be mounting from within a user_ns with the mapping it wanted.

James



Re: [RFC 1/1] shiftfs: uid/gid shifting bind mount

2017-02-17 Thread James Bottomley
On Fri, 2017-02-17 at 14:57 +1300, Eric W. Biederman wrote:
> I think I am missing something but I completely do not understand 
> that subthread that says use file marks and perform the work in the 
> vfs. The problem is that fundamentally we need multiple mappings and 
> I don't see a mark on a file (even an inherited mark) providing the 
> mapping so I don't see the point.

The point of the mark is that it's a statement by the system
administrator that the underlying subtree is safe to be mounted by an
unprivileged container in the containers user view (i.e. with
current_user_ns() == s_user_ns).  For the unprivileged container
there's no real arbitrary s_user_ns use case because the unprivileged
container must prove it can set up the mapping, so it would likely
always be mounting from within a user_ns with the mapping it wanted.

James



Re: [RFC 1/1] shiftfs: uid/gid shifting bind mount

2017-02-17 Thread Djalal Harouni
On Fri, Feb 17, 2017 at 2:57 AM, Eric W. Biederman
 wrote:
> James Bottomley  writes:
>
>> On Thu, 2017-02-16 at 11:42 -0500, Vivek Goyal wrote:
>>> On Thu, Feb 16, 2017 at 07:51:58AM -0800, James Bottomley wrote:
>>>
>>> [..]
>>> > > Two levels of checks will simplify this a bit. Top level inode
>>> > > will belong to the user namespace of caller and checks should
>>> > > pass. And mounter's creds will have ownership over the real inode
>>> > > so no additional namespace shifting required there.
>>> >
>>> > That's the problem: for a marked mount, they don't.
>>>
>>> In this new model it does not fit directly.
>>>
>>> I was playing with a slightly different approach and modified patches
>>> so that real root still does the mounting and takes an mount option
>>> which specifies which user namespace we want to shift into. Thanks to
>>> Eric for the idea.
>>>
>>> mount -t shiftfs -o userns_fd= source shifted-fs
>
>> This is a non-starter because it doesn't work for the unprivileged use
>> case, which is what I'm really interested in.
>
> But I believe it does.  It just requires a bit more work for in the
> shiftfs filesystem above.  It should be perfectly possible with the help
> of newuidmap to create a user namespace with the desired mappings.
>
> My understanding is that Vivek started with requiring root to mount the
> filesystem only as a simplification during development, and that the
> plan is to get the basic use case working and then allow unprivileged
> mounting.
>
>> For fully unprivileged
>> containers you don't have an orchestration system to ask to build the
>> container.  You can get init scripts to set stuff up for you, like the
>> marks, but ideally it should just work even without that (so an inode
>> flag following project semantics seems really appealing), but after
>> that the unprivileged user should be able to build their own
>> containers.
>>
>> As you saw from the reply to Eric, this approach (which I have tried)
>> also opens up a whole can of worms for non-FS_USERNS_MOUNT filesystems.
>>
>
> From what I can see we have two cases we care about.
> A) A non-default mapping from the filesystem to the rest of the system
>and roughly s_user_ns provides that but we need a review of the
>filesystems to make certain something has not been forgotten.
>
> B) A filesystem image sitting around in a directory somewhere that
>we want to map differently into different user namespaces while
>using the same files as backing store.
>
> For the second case what is interesting technically is that we want
> multiple mappings.  A user namespace appears adequate to specify those
> extra mappings (effectively from kuids to kuids).
>
> So we need something to associate the additional mapping with a
> directory tree.  A stackable filesystem with it's own s_user_ns field
> appears a very straight forward way to do that.  Especially if it can
> figure out how to assert that the underlying filesystem image is
> read-only (doesn't overlayfs require that?).  Making the entire stack
> read-only.
>
> I don't see a problem with that for unprivileged use (except possibly
> the read-only enforcement on the unerlying directory tree).
>
> What Vivek is talking about appears to be perfectly correct.  Performing
> the underlying filesystem permission checks as the possibly unprivileged
> user who mounted shiftfs.  After performing a set of permission checks
> (at the shiftfs level) as the user who is accessing the files.
>
>
> . . .
>
>
> I think I am missing something but I completely do not understand that
> subthread that says use file marks and perform the work in the vfs.
> The problem is that fundamentally we need multiple mappings and I don't
> see a mark on a file (even an inherited mark) providing the mapping so I
> don't see the point.
>
> Which leaves two possible places to store the extra mapping.  In the
> struct mount.  Or in a stacked filesystem super_block.  For a stacked
> filesystem I can see where to store the extra translation.  In the upper
> filesystems upper inode.   And we can perform the practical permission
> check on the upper inode as well.
>
> For a vfs level solution it looks like we would have to change all of
> the permission checking code in the kernel to have a special case for
> this kind of mapping.  Which does not sound maintainable.

Facts: for basic permissions: 3 files changed, 19 insertions(+), 6
deletions(-)  https://lkml.org/lkml/2016/5/4/417

That made permissions work for basically *all* filesystems. However
yes it does not handle xattr acls...

> So at the moment I don't think a vfs level solution makes any sense.
>
The permissions change was already done when userns were merged. What
you may need is VFS accessors, instead of working directly on
inode->i_uid ask the VFS to give you the right i_uid (which can also
be the case of projectid proposed by Christoph iff I got it right...)
you need it for 

Re: [RFC 1/1] shiftfs: uid/gid shifting bind mount

2017-02-17 Thread Djalal Harouni
On Fri, Feb 17, 2017 at 2:57 AM, Eric W. Biederman
 wrote:
> James Bottomley  writes:
>
>> On Thu, 2017-02-16 at 11:42 -0500, Vivek Goyal wrote:
>>> On Thu, Feb 16, 2017 at 07:51:58AM -0800, James Bottomley wrote:
>>>
>>> [..]
>>> > > Two levels of checks will simplify this a bit. Top level inode
>>> > > will belong to the user namespace of caller and checks should
>>> > > pass. And mounter's creds will have ownership over the real inode
>>> > > so no additional namespace shifting required there.
>>> >
>>> > That's the problem: for a marked mount, they don't.
>>>
>>> In this new model it does not fit directly.
>>>
>>> I was playing with a slightly different approach and modified patches
>>> so that real root still does the mounting and takes an mount option
>>> which specifies which user namespace we want to shift into. Thanks to
>>> Eric for the idea.
>>>
>>> mount -t shiftfs -o userns_fd= source shifted-fs
>
>> This is a non-starter because it doesn't work for the unprivileged use
>> case, which is what I'm really interested in.
>
> But I believe it does.  It just requires a bit more work for in the
> shiftfs filesystem above.  It should be perfectly possible with the help
> of newuidmap to create a user namespace with the desired mappings.
>
> My understanding is that Vivek started with requiring root to mount the
> filesystem only as a simplification during development, and that the
> plan is to get the basic use case working and then allow unprivileged
> mounting.
>
>> For fully unprivileged
>> containers you don't have an orchestration system to ask to build the
>> container.  You can get init scripts to set stuff up for you, like the
>> marks, but ideally it should just work even without that (so an inode
>> flag following project semantics seems really appealing), but after
>> that the unprivileged user should be able to build their own
>> containers.
>>
>> As you saw from the reply to Eric, this approach (which I have tried)
>> also opens up a whole can of worms for non-FS_USERNS_MOUNT filesystems.
>>
>
> From what I can see we have two cases we care about.
> A) A non-default mapping from the filesystem to the rest of the system
>and roughly s_user_ns provides that but we need a review of the
>filesystems to make certain something has not been forgotten.
>
> B) A filesystem image sitting around in a directory somewhere that
>we want to map differently into different user namespaces while
>using the same files as backing store.
>
> For the second case what is interesting technically is that we want
> multiple mappings.  A user namespace appears adequate to specify those
> extra mappings (effectively from kuids to kuids).
>
> So we need something to associate the additional mapping with a
> directory tree.  A stackable filesystem with it's own s_user_ns field
> appears a very straight forward way to do that.  Especially if it can
> figure out how to assert that the underlying filesystem image is
> read-only (doesn't overlayfs require that?).  Making the entire stack
> read-only.
>
> I don't see a problem with that for unprivileged use (except possibly
> the read-only enforcement on the unerlying directory tree).
>
> What Vivek is talking about appears to be perfectly correct.  Performing
> the underlying filesystem permission checks as the possibly unprivileged
> user who mounted shiftfs.  After performing a set of permission checks
> (at the shiftfs level) as the user who is accessing the files.
>
>
> . . .
>
>
> I think I am missing something but I completely do not understand that
> subthread that says use file marks and perform the work in the vfs.
> The problem is that fundamentally we need multiple mappings and I don't
> see a mark on a file (even an inherited mark) providing the mapping so I
> don't see the point.
>
> Which leaves two possible places to store the extra mapping.  In the
> struct mount.  Or in a stacked filesystem super_block.  For a stacked
> filesystem I can see where to store the extra translation.  In the upper
> filesystems upper inode.   And we can perform the practical permission
> check on the upper inode as well.
>
> For a vfs level solution it looks like we would have to change all of
> the permission checking code in the kernel to have a special case for
> this kind of mapping.  Which does not sound maintainable.

Facts: for basic permissions: 3 files changed, 19 insertions(+), 6
deletions(-)  https://lkml.org/lkml/2016/5/4/417

That made permissions work for basically *all* filesystems. However
yes it does not handle xattr acls...

> So at the moment I don't think a vfs level solution makes any sense.
>
The permissions change was already done when userns were merged. What
you may need is VFS accessors, instead of working directly on
inode->i_uid ask the VFS to give you the right i_uid (which can also
be the case of projectid proposed by Christoph iff I got it right...)
you need it for both ways: to report to userspace and the other way to
pass it to 

Re: [RFC 1/1] shiftfs: uid/gid shifting bind mount

2017-02-16 Thread Al Viro
On Thu, Feb 16, 2017 at 07:56:30AM -0800, James Bottomley wrote:

> > Hi James,
> > 
> > Should it be "return d_splice_alias()" so that if we find an alias it 
> > is returned back to caller and passed in dentry can be freed. Though 
> > I don't know in what cases alias can be found. And if alias is found 
> > how do we make sure alias_dentry->d_fsdata is pointing to new (real
> > dentry).
> 
> It probably should be for the sake of the pattern.  In our case I don't
> think we can have any root aliases because the root dentry is always
> pinned in the cache, so cache lookup should always find it.

What does that have to do with root dentry?  The real reason why that code
works (FVerySVO) is that the damn thing allocates a new inode every time.
Including the hardlinks, BTW.  So d_splice_alias() will always return NULL -
there's no way for any dentries to be pointing to in-core struct inode you've
just allocated.  Short of a use-after-free, that is...

Unless I'm missing something subtle, the whole thing is fucked
in head wrt cache coherency - its dentries are blindly assumed to be
forever valid, no matter what's happening with the underlying filesystem.


Re: [RFC 1/1] shiftfs: uid/gid shifting bind mount

2017-02-16 Thread Al Viro
On Thu, Feb 16, 2017 at 07:56:30AM -0800, James Bottomley wrote:

> > Hi James,
> > 
> > Should it be "return d_splice_alias()" so that if we find an alias it 
> > is returned back to caller and passed in dentry can be freed. Though 
> > I don't know in what cases alias can be found. And if alias is found 
> > how do we make sure alias_dentry->d_fsdata is pointing to new (real
> > dentry).
> 
> It probably should be for the sake of the pattern.  In our case I don't
> think we can have any root aliases because the root dentry is always
> pinned in the cache, so cache lookup should always find it.

What does that have to do with root dentry?  The real reason why that code
works (FVerySVO) is that the damn thing allocates a new inode every time.
Including the hardlinks, BTW.  So d_splice_alias() will always return NULL -
there's no way for any dentries to be pointing to in-core struct inode you've
just allocated.  Short of a use-after-free, that is...

Unless I'm missing something subtle, the whole thing is fucked
in head wrt cache coherency - its dentries are blindly assumed to be
forever valid, no matter what's happening with the underlying filesystem.


Re: [RFC 1/1] shiftfs: uid/gid shifting bind mount

2017-02-16 Thread Al Viro
On Sat, Feb 04, 2017 at 11:19:32AM -0800, James Bottomley wrote:

> +static const struct dentry_operations shiftfs_dentry_ops = {
> + .d_release  = shiftfs_d_release,
> + .d_real = shiftfs_d_real,
> +};

In other words, those dentries are *never* revalidated.  Nevermind that
underlying fs might be mounted elsewhere and be actively modified under
you.

> +static struct dentry *shiftfs_lookup(struct inode *dir, struct dentry 
> *dentry,
> +  unsigned int flags)
> +{
> + struct dentry *real = dir->i_private, *new;
> + struct inode *reali = real->d_inode, *newi;
> + const struct cred *oldcred, *newcred;
> +
> + inode_lock(reali);
> + oldcred = shiftfs_new_creds(, dentry->d_sb);
> + new = lookup_one_len(dentry->d_name.name, real, dentry->d_name.len);
> + shiftfs_old_creds(oldcred, );
> + inode_unlock(reali);
> +
> + if (IS_ERR(new))
> + return new;
> +
> + dentry->d_fsdata = new;
> +
> + if (!new->d_inode)
> + return NULL;

What happens when somebody comes along and creates the damn thing on the
underlying fs?  _Not_ via your code, that is - using the underlying fs
mounted elsewhere.


Re: [RFC 1/1] shiftfs: uid/gid shifting bind mount

2017-02-16 Thread Al Viro
On Sat, Feb 04, 2017 at 11:19:32AM -0800, James Bottomley wrote:

> +static const struct dentry_operations shiftfs_dentry_ops = {
> + .d_release  = shiftfs_d_release,
> + .d_real = shiftfs_d_real,
> +};

In other words, those dentries are *never* revalidated.  Nevermind that
underlying fs might be mounted elsewhere and be actively modified under
you.

> +static struct dentry *shiftfs_lookup(struct inode *dir, struct dentry 
> *dentry,
> +  unsigned int flags)
> +{
> + struct dentry *real = dir->i_private, *new;
> + struct inode *reali = real->d_inode, *newi;
> + const struct cred *oldcred, *newcred;
> +
> + inode_lock(reali);
> + oldcred = shiftfs_new_creds(, dentry->d_sb);
> + new = lookup_one_len(dentry->d_name.name, real, dentry->d_name.len);
> + shiftfs_old_creds(oldcred, );
> + inode_unlock(reali);
> +
> + if (IS_ERR(new))
> + return new;
> +
> + dentry->d_fsdata = new;
> +
> + if (!new->d_inode)
> + return NULL;

What happens when somebody comes along and creates the damn thing on the
underlying fs?  _Not_ via your code, that is - using the underlying fs
mounted elsewhere.


Re: [RFC 1/1] shiftfs: uid/gid shifting bind mount

2017-02-16 Thread Eric W. Biederman
James Bottomley  writes:

> On Thu, 2017-02-16 at 11:42 -0500, Vivek Goyal wrote:
>> On Thu, Feb 16, 2017 at 07:51:58AM -0800, James Bottomley wrote:
>> 
>> [..]
>> > > Two levels of checks will simplify this a bit. Top level inode 
>> > > will belong to the user namespace of caller and checks should 
>> > > pass. And mounter's creds will have ownership over the real inode 
>> > > so no additional namespace shifting required there.
>> > 
>> > That's the problem: for a marked mount, they don't.
>> 
>> In this new model it does not fit directly. 
>> 
>> I was playing with a slightly different approach and modified patches 
>> so that real root still does the mounting and takes an mount option
>> which specifies which user namespace we want to shift into. Thanks to 
>> Eric for the idea.
>> 
>> mount -t shiftfs -o userns_fd= source shifted-fs

> This is a non-starter because it doesn't work for the unprivileged use
> case, which is what I'm really interested in.

But I believe it does.  It just requires a bit more work for in the
shiftfs filesystem above.  It should be perfectly possible with the help
of newuidmap to create a user namespace with the desired mappings.

My understanding is that Vivek started with requiring root to mount the
filesystem only as a simplification during development, and that the
plan is to get the basic use case working and then allow unprivileged
mounting.

> For fully unprivileged
> containers you don't have an orchestration system to ask to build the
> container.  You can get init scripts to set stuff up for you, like the
> marks, but ideally it should just work even without that (so an inode
> flag following project semantics seems really appealing), but after
> that the unprivileged user should be able to build their own
> containers.
>
> As you saw from the reply to Eric, this approach (which I have tried)
> also opens up a whole can of worms for non-FS_USERNS_MOUNT filesystems.
>

>From what I can see we have two cases we care about.
A) A non-default mapping from the filesystem to the rest of the system
   and roughly s_user_ns provides that but we need a review of the
   filesystems to make certain something has not been forgotten.

B) A filesystem image sitting around in a directory somewhere that
   we want to map differently into different user namespaces while
   using the same files as backing store.

For the second case what is interesting technically is that we want
multiple mappings.  A user namespace appears adequate to specify those
extra mappings (effectively from kuids to kuids).

So we need something to associate the additional mapping with a
directory tree.  A stackable filesystem with it's own s_user_ns field
appears a very straight forward way to do that.  Especially if it can
figure out how to assert that the underlying filesystem image is
read-only (doesn't overlayfs require that?).  Making the entire stack
read-only.

I don't see a problem with that for unprivileged use (except possibly
the read-only enforcement on the unerlying directory tree).

What Vivek is talking about appears to be perfectly correct.  Performing
the underlying filesystem permission checks as the possibly unprivileged
user who mounted shiftfs.  After performing a set of permission checks
(at the shiftfs level) as the user who is accessing the files.


. . .


I think I am missing something but I completely do not understand that
subthread that says use file marks and perform the work in the vfs.
The problem is that fundamentally we need multiple mappings and I don't
see a mark on a file (even an inherited mark) providing the mapping so I
don't see the point.

Which leaves two possible places to store the extra mapping.  In the
struct mount.  Or in a stacked filesystem super_block.  For a stacked
filesystem I can see where to store the extra translation.  In the upper
filesystems upper inode.   And we can perform the practical permission
check on the upper inode as well.

For a vfs level solution it looks like we would have to change all of
the permission checking code in the kernel to have a special case for
this kind of mapping.  Which does not sound maintainable.

So at the moment I don't think a vfs level solution makes any sense.

And then if you have a stacked filesystem with FS_USERNS_MOUNT set so it
can be mounted by an unprivileged user.   I think it makes sense to
check the mounters creds agains the real inode.  To verify the user that
mounted the filesystem has the permission to perform the desired access.

Which makes only allows the mounter as much permisison as the mounter
would have if they performed the work with fuse instead of a special
in-kernel filesystem.

In a DAC model of the world that makes lots of sense.  I don't know what
actually makes sense in a MAC world.  But I am certain that is something
that can be worked through.

Eric






Re: [RFC 1/1] shiftfs: uid/gid shifting bind mount

2017-02-16 Thread Eric W. Biederman
James Bottomley  writes:

> On Thu, 2017-02-16 at 11:42 -0500, Vivek Goyal wrote:
>> On Thu, Feb 16, 2017 at 07:51:58AM -0800, James Bottomley wrote:
>> 
>> [..]
>> > > Two levels of checks will simplify this a bit. Top level inode 
>> > > will belong to the user namespace of caller and checks should 
>> > > pass. And mounter's creds will have ownership over the real inode 
>> > > so no additional namespace shifting required there.
>> > 
>> > That's the problem: for a marked mount, they don't.
>> 
>> In this new model it does not fit directly. 
>> 
>> I was playing with a slightly different approach and modified patches 
>> so that real root still does the mounting and takes an mount option
>> which specifies which user namespace we want to shift into. Thanks to 
>> Eric for the idea.
>> 
>> mount -t shiftfs -o userns_fd= source shifted-fs

> This is a non-starter because it doesn't work for the unprivileged use
> case, which is what I'm really interested in.

But I believe it does.  It just requires a bit more work for in the
shiftfs filesystem above.  It should be perfectly possible with the help
of newuidmap to create a user namespace with the desired mappings.

My understanding is that Vivek started with requiring root to mount the
filesystem only as a simplification during development, and that the
plan is to get the basic use case working and then allow unprivileged
mounting.

> For fully unprivileged
> containers you don't have an orchestration system to ask to build the
> container.  You can get init scripts to set stuff up for you, like the
> marks, but ideally it should just work even without that (so an inode
> flag following project semantics seems really appealing), but after
> that the unprivileged user should be able to build their own
> containers.
>
> As you saw from the reply to Eric, this approach (which I have tried)
> also opens up a whole can of worms for non-FS_USERNS_MOUNT filesystems.
>

>From what I can see we have two cases we care about.
A) A non-default mapping from the filesystem to the rest of the system
   and roughly s_user_ns provides that but we need a review of the
   filesystems to make certain something has not been forgotten.

B) A filesystem image sitting around in a directory somewhere that
   we want to map differently into different user namespaces while
   using the same files as backing store.

For the second case what is interesting technically is that we want
multiple mappings.  A user namespace appears adequate to specify those
extra mappings (effectively from kuids to kuids).

So we need something to associate the additional mapping with a
directory tree.  A stackable filesystem with it's own s_user_ns field
appears a very straight forward way to do that.  Especially if it can
figure out how to assert that the underlying filesystem image is
read-only (doesn't overlayfs require that?).  Making the entire stack
read-only.

I don't see a problem with that for unprivileged use (except possibly
the read-only enforcement on the unerlying directory tree).

What Vivek is talking about appears to be perfectly correct.  Performing
the underlying filesystem permission checks as the possibly unprivileged
user who mounted shiftfs.  After performing a set of permission checks
(at the shiftfs level) as the user who is accessing the files.


. . .


I think I am missing something but I completely do not understand that
subthread that says use file marks and perform the work in the vfs.
The problem is that fundamentally we need multiple mappings and I don't
see a mark on a file (even an inherited mark) providing the mapping so I
don't see the point.

Which leaves two possible places to store the extra mapping.  In the
struct mount.  Or in a stacked filesystem super_block.  For a stacked
filesystem I can see where to store the extra translation.  In the upper
filesystems upper inode.   And we can perform the practical permission
check on the upper inode as well.

For a vfs level solution it looks like we would have to change all of
the permission checking code in the kernel to have a special case for
this kind of mapping.  Which does not sound maintainable.

So at the moment I don't think a vfs level solution makes any sense.

And then if you have a stacked filesystem with FS_USERNS_MOUNT set so it
can be mounted by an unprivileged user.   I think it makes sense to
check the mounters creds agains the real inode.  To verify the user that
mounted the filesystem has the permission to perform the desired access.

Which makes only allows the mounter as much permisison as the mounter
would have if they performed the work with fuse instead of a special
in-kernel filesystem.

In a DAC model of the world that makes lots of sense.  I don't know what
actually makes sense in a MAC world.  But I am certain that is something
that can be worked through.

Eric






Re: [RFC 1/1] shiftfs: uid/gid shifting bind mount

2017-02-16 Thread James Bottomley
On Thu, 2017-02-16 at 11:42 -0500, Vivek Goyal wrote:
> On Thu, Feb 16, 2017 at 07:51:58AM -0800, James Bottomley wrote:
> 
> [..]
> > > Two levels of checks will simplify this a bit. Top level inode 
> > > will belong to the user namespace of caller and checks should 
> > > pass. And mounter's creds will have ownership over the real inode 
> > > so no additional namespace shifting required there.
> > 
> > That's the problem: for a marked mount, they don't.
> 
> In this new model it does not fit directly. 
> 
> I was playing with a slightly different approach and modified patches 
> so that real root still does the mounting and takes an mount option
> which specifies which user namespace we want to shift into. Thanks to 
> Eric for the idea.
> 
> mount -t shiftfs -o userns_fd= source shifted-fs

This is a non-starter because it doesn't work for the unprivileged use
case, which is what I'm really interested in.  For fully unprivileged
containers you don't have an orchestration system to ask to build the
container.  You can get init scripts to set stuff up for you, like the
marks, but ideally it should just work even without that (so an inode
flag following project semantics seems really appealing), but after
that the unprivileged user should be able to build their own
containers.

As you saw from the reply to Eric, this approach (which I have tried)
also opens up a whole can of worms for non-FS_USERNS_MOUNT filesystems.

James


> In this case real-root is mounter and notion of using mounter's creds 
> on real-inode works. 
> This requires a user namespace to be created before shiftfs can be 
> mounted and then container admin should be able to bind mount shifted
> -fs.
> 
> In this model, intervention of real-root is still required to setup
> container and shiftfs. I guess that might not satisfy your needs 
> where unprivileged user should be able to launch container and be 
> able to make use of shiftfs, IIUC.
> 
> Vivek
> 



Re: [RFC 1/1] shiftfs: uid/gid shifting bind mount

2017-02-16 Thread James Bottomley
On Thu, 2017-02-16 at 11:42 -0500, Vivek Goyal wrote:
> On Thu, Feb 16, 2017 at 07:51:58AM -0800, James Bottomley wrote:
> 
> [..]
> > > Two levels of checks will simplify this a bit. Top level inode 
> > > will belong to the user namespace of caller and checks should 
> > > pass. And mounter's creds will have ownership over the real inode 
> > > so no additional namespace shifting required there.
> > 
> > That's the problem: for a marked mount, they don't.
> 
> In this new model it does not fit directly. 
> 
> I was playing with a slightly different approach and modified patches 
> so that real root still does the mounting and takes an mount option
> which specifies which user namespace we want to shift into. Thanks to 
> Eric for the idea.
> 
> mount -t shiftfs -o userns_fd= source shifted-fs

This is a non-starter because it doesn't work for the unprivileged use
case, which is what I'm really interested in.  For fully unprivileged
containers you don't have an orchestration system to ask to build the
container.  You can get init scripts to set stuff up for you, like the
marks, but ideally it should just work even without that (so an inode
flag following project semantics seems really appealing), but after
that the unprivileged user should be able to build their own
containers.

As you saw from the reply to Eric, this approach (which I have tried)
also opens up a whole can of worms for non-FS_USERNS_MOUNT filesystems.

James


> In this case real-root is mounter and notion of using mounter's creds 
> on real-inode works. 
> This requires a user namespace to be created before shiftfs can be 
> mounted and then container admin should be able to bind mount shifted
> -fs.
> 
> In this model, intervention of real-root is still required to setup
> container and shiftfs. I guess that might not satisfy your needs 
> where unprivileged user should be able to launch container and be 
> able to make use of shiftfs, IIUC.
> 
> Vivek
> 



Re: [RFC 1/1] shiftfs: uid/gid shifting bind mount

2017-02-16 Thread Vivek Goyal
On Thu, Feb 16, 2017 at 07:51:58AM -0800, James Bottomley wrote:

[..]
> > Two levels of checks will simplify this a bit. Top level inode will 
> > belong to the user namespace of caller and checks should pass. And 
> > mounter's creds will have ownership over the real inode so no 
> > additional namespace shifting required there.
> 
> That's the problem: for a marked mount, they don't.

In this new model it does not fit directly. 

I was playing with a slightly different approach and modified patches so
that real root still does the mounting and takes an mount option which
specifies which user namespace we want to shift into. Thanks to Eric for
the idea.

mount -t shiftfs -o userns_fd= source shifted-fs

In this case real-root is mounter and notion of using mounter's creds on
real-inode works.

This requires a user namespace to be created before shiftfs can be mounted
and then container admin should be able to bind mount shifted-fs.

In this model, intervention of real-root is still required to setup
container and shiftfs. I guess that might not satisfy your needs where
unprivileged user should be able to launch container and be able to
make use of shiftfs, IIUC.

Vivek


Re: [RFC 1/1] shiftfs: uid/gid shifting bind mount

2017-02-16 Thread Vivek Goyal
On Thu, Feb 16, 2017 at 07:51:58AM -0800, James Bottomley wrote:

[..]
> > Two levels of checks will simplify this a bit. Top level inode will 
> > belong to the user namespace of caller and checks should pass. And 
> > mounter's creds will have ownership over the real inode so no 
> > additional namespace shifting required there.
> 
> That's the problem: for a marked mount, they don't.

In this new model it does not fit directly. 

I was playing with a slightly different approach and modified patches so
that real root still does the mounting and takes an mount option which
specifies which user namespace we want to shift into. Thanks to Eric for
the idea.

mount -t shiftfs -o userns_fd= source shifted-fs

In this case real-root is mounter and notion of using mounter's creds on
real-inode works.

This requires a user namespace to be created before shiftfs can be mounted
and then container admin should be able to bind mount shifted-fs.

In this model, intervention of real-root is still required to setup
container and shiftfs. I guess that might not satisfy your needs where
unprivileged user should be able to launch container and be able to
make use of shiftfs, IIUC.

Vivek


Re: [RFC 1/1] shiftfs: uid/gid shifting bind mount

2017-02-16 Thread James Bottomley
On Wed, 2017-02-15 at 15:34 -0500, Vivek Goyal wrote:
> On Sat, Feb 04, 2017 at 11:19:32AM -0800, James Bottomley wrote:
> 
> [..]
> > +static struct dentry *shiftfs_lookup(struct inode *dir, struct
> > dentry *dentry,
> > +unsigned int flags)
> > +{
> > +   struct dentry *real = dir->i_private, *new;
> > +   struct inode *reali = real->d_inode, *newi;
> > +   const struct cred *oldcred, *newcred;
> > +
> > +   inode_lock(reali);
> > +   oldcred = shiftfs_new_creds(, dentry->d_sb);
> > +   new = lookup_one_len(dentry->d_name.name, real, dentry
> > ->d_name.len);
> > +   shiftfs_old_creds(oldcred, );
> > +   inode_unlock(reali);
> > +
> > +   if (IS_ERR(new))
> > +   return new;
> > +
> > +   dentry->d_fsdata = new;
> > +
> > +   if (!new->d_inode)
> > +   return NULL;
> > +
> > +   newi = shiftfs_new_inode(dentry->d_sb, new->d_inode
> > ->i_mode, new);
> > +   if (!newi) {
> > +   dput(new);
> > +   return ERR_PTR(-ENOMEM);
> > +   }
> > +
> > +   d_splice_alias(newi, dentry);
> 
> Hi James,
> 
> Should it be "return d_splice_alias()" so that if we find an alias it 
> is returned back to caller and passed in dentry can be freed. Though 
> I don't know in what cases alias can be found. And if alias is found 
> how do we make sure alias_dentry->d_fsdata is pointing to new (real
> dentry).

It probably should be for the sake of the pattern.  In our case I don't
think we can have any root aliases because the root dentry is always
pinned in the cache, so cache lookup should always find it.

James



Re: [RFC 1/1] shiftfs: uid/gid shifting bind mount

2017-02-16 Thread James Bottomley
On Wed, 2017-02-15 at 15:34 -0500, Vivek Goyal wrote:
> On Sat, Feb 04, 2017 at 11:19:32AM -0800, James Bottomley wrote:
> 
> [..]
> > +static struct dentry *shiftfs_lookup(struct inode *dir, struct
> > dentry *dentry,
> > +unsigned int flags)
> > +{
> > +   struct dentry *real = dir->i_private, *new;
> > +   struct inode *reali = real->d_inode, *newi;
> > +   const struct cred *oldcred, *newcred;
> > +
> > +   inode_lock(reali);
> > +   oldcred = shiftfs_new_creds(, dentry->d_sb);
> > +   new = lookup_one_len(dentry->d_name.name, real, dentry
> > ->d_name.len);
> > +   shiftfs_old_creds(oldcred, );
> > +   inode_unlock(reali);
> > +
> > +   if (IS_ERR(new))
> > +   return new;
> > +
> > +   dentry->d_fsdata = new;
> > +
> > +   if (!new->d_inode)
> > +   return NULL;
> > +
> > +   newi = shiftfs_new_inode(dentry->d_sb, new->d_inode
> > ->i_mode, new);
> > +   if (!newi) {
> > +   dput(new);
> > +   return ERR_PTR(-ENOMEM);
> > +   }
> > +
> > +   d_splice_alias(newi, dentry);
> 
> Hi James,
> 
> Should it be "return d_splice_alias()" so that if we find an alias it 
> is returned back to caller and passed in dentry can be freed. Though 
> I don't know in what cases alias can be found. And if alias is found 
> how do we make sure alias_dentry->d_fsdata is pointing to new (real
> dentry).

It probably should be for the sake of the pattern.  In our case I don't
think we can have any root aliases because the root dentry is always
pinned in the cache, so cache lookup should always find it.

James



Re: [RFC 1/1] shiftfs: uid/gid shifting bind mount

2017-02-16 Thread James Bottomley
On Wed, 2017-02-15 at 09:17 -0500, Vivek Goyal wrote:
> On Tue, Feb 14, 2017 at 03:45:55PM -0800, James Bottomley wrote:
> > On Tue, 2017-02-14 at 18:03 -0500, Vivek Goyal wrote:

[...]
> > > Given we have already shifted the uid/gid for shiftfs inode, I am
> > > wondering that why can't we simply call
> > > generic_permission(shiftfs_inode, mask) directly in the context
> > > of caller. Something like..
> > > 
> > > shiftfs_permission() {
> > >   err = generic_permission(inode, mask);
> > >   if (err)
> > >   return err;
> > > 
> > >   switch_to_mounter_creds;
> > >   err = inode_permission(reali, mask);
> > >   revert_creds();
> > > 
> > >   return err;
> > > }
> > 
> > Because if the reali->d_iop->permission exists, you should use it. 
> >  You could argue shiftfs_permission should be
> > 
> > if (iop->permission) {
> > oldcred = shiftfs_new_creds(, inode->i_sb);
> > err = iop->permission(reali, mask);
> > shiftfs_old_creds(oldcred, );
> > } else
> > err = generic_permission(inode, mask);
> > 
> > But really that's a small optimisation.
> 
> ok. I thought using mounter's creds for real inode checks, will 
> probably do away with need of modifying caller's user namespace in
> shiftfs_get_up_creds().

Well, no ... the mounter of a marked superblock is container admin, but
the owner in the filesystem view is real root.  The unprivileged
mounter's credentials aren't sufficient, therefore. 

> cred->fsuid = KUIDT_INIT(from_kuid(sb->s_user_ns, cred->fsuid));
> cred->fsgid = KGIDT_INIT(from_kgid(sb->s_user_ns, cred->fsgid));
> cred->user_ns = ssi->userns;
> 
> IIUC, we are shifting caller's fsuid and fsgid into caller's user
> namespace but at the same time using the user_ns of reali->sb
> ->sb_user_ns. Feels little twisted to me. May be I am
> misunderstanding it.

Actually what we're doing is shifting the credentials into the
underlying mount's filesystem view.

> Two levels of checks will simplify this a bit. Top level inode will 
> belong to the user namespace of caller and checks should pass. And 
> mounter's creds will have ownership over the real inode so no 
> additional namespace shifting required there.

That's the problem: for a marked mount, they don't.

>  We could also save these creds at mount time once and don't have to
> prepare it for every call. (not sure if it has significant
> performance issue or not).  Just thinking aloud...

If it proves to be an issue, I suppose the shift could be cached, but I
really don't think it matters that much.

James



Re: [RFC 1/1] shiftfs: uid/gid shifting bind mount

2017-02-16 Thread James Bottomley
On Wed, 2017-02-15 at 09:17 -0500, Vivek Goyal wrote:
> On Tue, Feb 14, 2017 at 03:45:55PM -0800, James Bottomley wrote:
> > On Tue, 2017-02-14 at 18:03 -0500, Vivek Goyal wrote:

[...]
> > > Given we have already shifted the uid/gid for shiftfs inode, I am
> > > wondering that why can't we simply call
> > > generic_permission(shiftfs_inode, mask) directly in the context
> > > of caller. Something like..
> > > 
> > > shiftfs_permission() {
> > >   err = generic_permission(inode, mask);
> > >   if (err)
> > >   return err;
> > > 
> > >   switch_to_mounter_creds;
> > >   err = inode_permission(reali, mask);
> > >   revert_creds();
> > > 
> > >   return err;
> > > }
> > 
> > Because if the reali->d_iop->permission exists, you should use it. 
> >  You could argue shiftfs_permission should be
> > 
> > if (iop->permission) {
> > oldcred = shiftfs_new_creds(, inode->i_sb);
> > err = iop->permission(reali, mask);
> > shiftfs_old_creds(oldcred, );
> > } else
> > err = generic_permission(inode, mask);
> > 
> > But really that's a small optimisation.
> 
> ok. I thought using mounter's creds for real inode checks, will 
> probably do away with need of modifying caller's user namespace in
> shiftfs_get_up_creds().

Well, no ... the mounter of a marked superblock is container admin, but
the owner in the filesystem view is real root.  The unprivileged
mounter's credentials aren't sufficient, therefore. 

> cred->fsuid = KUIDT_INIT(from_kuid(sb->s_user_ns, cred->fsuid));
> cred->fsgid = KGIDT_INIT(from_kgid(sb->s_user_ns, cred->fsgid));
> cred->user_ns = ssi->userns;
> 
> IIUC, we are shifting caller's fsuid and fsgid into caller's user
> namespace but at the same time using the user_ns of reali->sb
> ->sb_user_ns. Feels little twisted to me. May be I am
> misunderstanding it.

Actually what we're doing is shifting the credentials into the
underlying mount's filesystem view.

> Two levels of checks will simplify this a bit. Top level inode will 
> belong to the user namespace of caller and checks should pass. And 
> mounter's creds will have ownership over the real inode so no 
> additional namespace shifting required there.

That's the problem: for a marked mount, they don't.

>  We could also save these creds at mount time once and don't have to
> prepare it for every call. (not sure if it has significant
> performance issue or not).  Just thinking aloud...

If it proves to be an issue, I suppose the shift could be cached, but I
really don't think it matters that much.

James



Re: [RFC 1/1] shiftfs: uid/gid shifting bind mount

2017-02-15 Thread Vivek Goyal
On Sat, Feb 04, 2017 at 11:19:32AM -0800, James Bottomley wrote:

[..]
> +static struct dentry *shiftfs_lookup(struct inode *dir, struct dentry 
> *dentry,
> +  unsigned int flags)
> +{
> + struct dentry *real = dir->i_private, *new;
> + struct inode *reali = real->d_inode, *newi;
> + const struct cred *oldcred, *newcred;
> +
> + inode_lock(reali);
> + oldcred = shiftfs_new_creds(, dentry->d_sb);
> + new = lookup_one_len(dentry->d_name.name, real, dentry->d_name.len);
> + shiftfs_old_creds(oldcred, );
> + inode_unlock(reali);
> +
> + if (IS_ERR(new))
> + return new;
> +
> + dentry->d_fsdata = new;
> +
> + if (!new->d_inode)
> + return NULL;
> +
> + newi = shiftfs_new_inode(dentry->d_sb, new->d_inode->i_mode, new);
> + if (!newi) {
> + dput(new);
> + return ERR_PTR(-ENOMEM);
> + }
> +
> + d_splice_alias(newi, dentry);

Hi James,

Should it be "return d_splice_alias()" so that if we find an alias it is
returned back to caller and passed in dentry can be freed. Though I don't
know in what cases alias can be found. And if alias is found how do we
make sure alias_dentry->d_fsdata is pointing to new (real dentry).

> +
> + return NULL;
> +}

Vivek


Re: [RFC 1/1] shiftfs: uid/gid shifting bind mount

2017-02-15 Thread Vivek Goyal
On Sat, Feb 04, 2017 at 11:19:32AM -0800, James Bottomley wrote:

[..]
> +static struct dentry *shiftfs_lookup(struct inode *dir, struct dentry 
> *dentry,
> +  unsigned int flags)
> +{
> + struct dentry *real = dir->i_private, *new;
> + struct inode *reali = real->d_inode, *newi;
> + const struct cred *oldcred, *newcred;
> +
> + inode_lock(reali);
> + oldcred = shiftfs_new_creds(, dentry->d_sb);
> + new = lookup_one_len(dentry->d_name.name, real, dentry->d_name.len);
> + shiftfs_old_creds(oldcred, );
> + inode_unlock(reali);
> +
> + if (IS_ERR(new))
> + return new;
> +
> + dentry->d_fsdata = new;
> +
> + if (!new->d_inode)
> + return NULL;
> +
> + newi = shiftfs_new_inode(dentry->d_sb, new->d_inode->i_mode, new);
> + if (!newi) {
> + dput(new);
> + return ERR_PTR(-ENOMEM);
> + }
> +
> + d_splice_alias(newi, dentry);

Hi James,

Should it be "return d_splice_alias()" so that if we find an alias it is
returned back to caller and passed in dentry can be freed. Though I don't
know in what cases alias can be found. And if alias is found how do we
make sure alias_dentry->d_fsdata is pointing to new (real dentry).

> +
> + return NULL;
> +}

Vivek


Re: [RFC 1/1] shiftfs: uid/gid shifting bind mount

2017-02-15 Thread Vivek Goyal
On Tue, Feb 14, 2017 at 03:45:55PM -0800, James Bottomley wrote:
> On Tue, 2017-02-14 at 18:03 -0500, Vivek Goyal wrote:
> > On Sun, Feb 05, 2017 at 05:18:11PM -0800, James Bottomley wrote:
> > 
> > [..]
> > > >  shiftfs is going to miss out on overlayfs bug fixes related to
> > > > user 
> > > > credentials differ from mounter credentials, like fd3220d ("ovl: 
> > > > update S_ISGID when setting posix ACLs"). I am not sure that this
> > > > specific case is relevant to shiftfs, but there could be other.
> > > 
> > > OK, so shiftfs doesn't have this bug and the reason why is
> > > illustrative: basically shiftfs does three things
> > > 
> > >1. lookups via a uid/gid shifted dentry cache
> > >2. shifted credential inode operations permission checks on the
> > >   underlying filesystem
> > >3. location marking for unprivileged mount
> > > 
> > > I think we've already seen that 1. isn't from overlayfs but the
> > > functionality could be added to overlayfs, I suppose.  The big 
> > > problem is 2.  The overlayfs code emulates the permission checks, 
> > > which makes it rather complex (this is where you get your bugs like 
> > > the above from).  I did actually look at adding 2. to overlayfs on 
> > > the theory that a single layer overlay might be closest to what 
> > > this is, but eventually concluded I'd have to take the special 
> > > cases and add a whole lot more to them ... it really would increase 
> > > the maintenance burden substantially and make the code an
> > > unreadable rats nest.
> > 
> > Hi James,
> > 
> > If we merge this functionality in overlayfs, then we could avoid 
> > extra copy of dentry/inode and that might be a significant advantage.
> 
> I made that argument to Viro originally when I tried to do all lookups
> via the underlying cache.  In the end, it's 192 bytes per dentry and
> 584 per inode, all of which are reclaimable, so it's not much of an
> advantage and it is a great simplification to the code.  In general if
> you have a natural separation, you should make the layers reflect it.

ok.

> 
> My container use case doesn't use overlayfs currently, so to me it
> wouldn't provide any advantage whatsoever.

In docker and other use cases, this probably will be used in conjunction
with overlayfs as containers would like to write data and that should not
go back to image dir and should be sent to container specific dir.

> 
> > W.r.t permission checks, I am wondering will it make sense to do what
> > overlayfs is doing for shiftfs. That is permission is checked on
> > two inodes. We use creds of task for checking permission on
> > shiftfs/overlay inode and mounter's creds on real inode.
> 
> The mounter's creds for overlay are usually admin ones, so it's local
> permission check asks should I? and the later one asks can I? (as in
> would my original admin creds allow this).  In many ways, overlayfs is
> ignoring the fact that the underlying ->permissions() call might have
> failed for some good reason on the current creds.  I don't think any
> serious trouble results from this but it strikes me as icky.

So we do call ->permission() of underlying inode but with the creds of
mounter (as you noted). Given we don't call reali->permission() with
the creds of task, it resulted in issues with disk quota. mounter
had CAP_SYS_RESOURCE so disk quota was being ignored. But that's easily
fixable by taking away CAP_SYS_RESOURCE from mounter's creds if caller
does not have CAP_SYS_RESOURCE.

> 
> > Given we have already shifted the uid/gid for shiftfs inode, I am 
> > wondering that why can't we simply call generic_permission(shiftfs_in
> > ode, mask) directly in the context of caller. Something like..
> > 
> > shiftfs_permission() {
> >   err = generic_permission(inode, mask);
> >   if (err)
> > return err;
> > 
> >   switch_to_mounter_creds;
> >   err = inode_permission(reali, mask);
> >   revert_creds();
> > 
> >   return err;
> > }
> 
> Because if the reali->d_iop->permission exists, you should use it.  You
> could argue shiftfs_permission should be
> 
>   if (iop->permission) {
>   oldcred = shiftfs_new_creds(, inode->i_sb);
>   err = iop->permission(reali, mask);
>   shiftfs_old_creds(oldcred, );
>   } else
>   err = generic_permission(inode, mask);
> 
> But really that's a small optimisation.

ok. I thought using mounter's creds for real inode checks, will probably
do away with need of modifying caller's user namespace in
shiftfs_get_up_creds().

cred->fsuid = KUIDT_INIT(from_kuid(sb->s_user_ns, cred->fsuid));
cred->fsgid = KGIDT_INIT(from_kgid(sb->s_user_ns, cred->fsgid));
cred->user_ns = ssi->userns;

IIUC, we are shifting caller's fsuid and fsgid into caller's user
namespace but at the same time using the user_ns of reali->sb->sb_user_ns.
Feels little twisted to me. May be I am misunderstanding it.

Two levels of checks will simplify this a bit. Top level inode will belong
to the user namespace of caller and 

Re: [RFC 1/1] shiftfs: uid/gid shifting bind mount

2017-02-15 Thread Vivek Goyal
On Tue, Feb 14, 2017 at 03:45:55PM -0800, James Bottomley wrote:
> On Tue, 2017-02-14 at 18:03 -0500, Vivek Goyal wrote:
> > On Sun, Feb 05, 2017 at 05:18:11PM -0800, James Bottomley wrote:
> > 
> > [..]
> > > >  shiftfs is going to miss out on overlayfs bug fixes related to
> > > > user 
> > > > credentials differ from mounter credentials, like fd3220d ("ovl: 
> > > > update S_ISGID when setting posix ACLs"). I am not sure that this
> > > > specific case is relevant to shiftfs, but there could be other.
> > > 
> > > OK, so shiftfs doesn't have this bug and the reason why is
> > > illustrative: basically shiftfs does three things
> > > 
> > >1. lookups via a uid/gid shifted dentry cache
> > >2. shifted credential inode operations permission checks on the
> > >   underlying filesystem
> > >3. location marking for unprivileged mount
> > > 
> > > I think we've already seen that 1. isn't from overlayfs but the
> > > functionality could be added to overlayfs, I suppose.  The big 
> > > problem is 2.  The overlayfs code emulates the permission checks, 
> > > which makes it rather complex (this is where you get your bugs like 
> > > the above from).  I did actually look at adding 2. to overlayfs on 
> > > the theory that a single layer overlay might be closest to what 
> > > this is, but eventually concluded I'd have to take the special 
> > > cases and add a whole lot more to them ... it really would increase 
> > > the maintenance burden substantially and make the code an
> > > unreadable rats nest.
> > 
> > Hi James,
> > 
> > If we merge this functionality in overlayfs, then we could avoid 
> > extra copy of dentry/inode and that might be a significant advantage.
> 
> I made that argument to Viro originally when I tried to do all lookups
> via the underlying cache.  In the end, it's 192 bytes per dentry and
> 584 per inode, all of which are reclaimable, so it's not much of an
> advantage and it is a great simplification to the code.  In general if
> you have a natural separation, you should make the layers reflect it.

ok.

> 
> My container use case doesn't use overlayfs currently, so to me it
> wouldn't provide any advantage whatsoever.

In docker and other use cases, this probably will be used in conjunction
with overlayfs as containers would like to write data and that should not
go back to image dir and should be sent to container specific dir.

> 
> > W.r.t permission checks, I am wondering will it make sense to do what
> > overlayfs is doing for shiftfs. That is permission is checked on
> > two inodes. We use creds of task for checking permission on
> > shiftfs/overlay inode and mounter's creds on real inode.
> 
> The mounter's creds for overlay are usually admin ones, so it's local
> permission check asks should I? and the later one asks can I? (as in
> would my original admin creds allow this).  In many ways, overlayfs is
> ignoring the fact that the underlying ->permissions() call might have
> failed for some good reason on the current creds.  I don't think any
> serious trouble results from this but it strikes me as icky.

So we do call ->permission() of underlying inode but with the creds of
mounter (as you noted). Given we don't call reali->permission() with
the creds of task, it resulted in issues with disk quota. mounter
had CAP_SYS_RESOURCE so disk quota was being ignored. But that's easily
fixable by taking away CAP_SYS_RESOURCE from mounter's creds if caller
does not have CAP_SYS_RESOURCE.

> 
> > Given we have already shifted the uid/gid for shiftfs inode, I am 
> > wondering that why can't we simply call generic_permission(shiftfs_in
> > ode, mask) directly in the context of caller. Something like..
> > 
> > shiftfs_permission() {
> >   err = generic_permission(inode, mask);
> >   if (err)
> > return err;
> > 
> >   switch_to_mounter_creds;
> >   err = inode_permission(reali, mask);
> >   revert_creds();
> > 
> >   return err;
> > }
> 
> Because if the reali->d_iop->permission exists, you should use it.  You
> could argue shiftfs_permission should be
> 
>   if (iop->permission) {
>   oldcred = shiftfs_new_creds(, inode->i_sb);
>   err = iop->permission(reali, mask);
>   shiftfs_old_creds(oldcred, );
>   } else
>   err = generic_permission(inode, mask);
> 
> But really that's a small optimisation.

ok. I thought using mounter's creds for real inode checks, will probably
do away with need of modifying caller's user namespace in
shiftfs_get_up_creds().

cred->fsuid = KUIDT_INIT(from_kuid(sb->s_user_ns, cred->fsuid));
cred->fsgid = KGIDT_INIT(from_kgid(sb->s_user_ns, cred->fsgid));
cred->user_ns = ssi->userns;

IIUC, we are shifting caller's fsuid and fsgid into caller's user
namespace but at the same time using the user_ns of reali->sb->sb_user_ns.
Feels little twisted to me. May be I am misunderstanding it.

Two levels of checks will simplify this a bit. Top level inode will belong
to the user namespace of caller and 

Re: [RFC 1/1] shiftfs: uid/gid shifting bind mount

2017-02-15 Thread Djalal Harouni
On Wed, Feb 15, 2017 at 10:37 AM, Eric W. Biederman
 wrote:
> Djalal Harouni  writes:
>
>> On Mon, Feb 13, 2017 at 11:15 AM, Eric W. Biederman
>>  wrote:
>>> James Bottomley  writes:
>
 So is this.  Basically anything that begins by mounting gets a super
 block and can use the s_user_ns to map from the filesystem view to the
 kernel view of ids.  Apart from greater sophistication in the
 parametrisation, it sounds like we have all the machinery you need.
  I'm sure the containers people will consider reasonable patches to
 change this.
>>>
>>> Yes.
>>>
>>> And to be clear we have all of that merged now and mostly present and
>>> hooked up in all filesystems without any shiftfs like changes needed.
>>>
>>> To use this with a filesystem a last pass needs to be had to verify that
>>> the cases where something does not map are handled cleanly.
>>
>> Still this does not answer the question how to dynamically
>> *attach/share* data or read-only volumes as defined by
>> orchestration/container tools into several containers. Am I missing
>> something or is the plan to have per superblock mount for each one ?
>
> Agreed.  That is a related problem and the problem that shiftfs
> is working to solve.
>
> If you only need a single mapping the infrastructure is basically done
> in the kernel today.  If you need multiple mappings we need something
> more.

Yes, I'm asking since there is that vfs+userns proposed approach that
I linked in this thread, that deals with this particular problem: in
which mount namespace<->container the volume appears, maybe that can
be used on top of the s_user_ns ...

Thanks!


-- 
tixxdz


Re: [RFC 1/1] shiftfs: uid/gid shifting bind mount

2017-02-15 Thread Djalal Harouni
On Wed, Feb 15, 2017 at 10:37 AM, Eric W. Biederman
 wrote:
> Djalal Harouni  writes:
>
>> On Mon, Feb 13, 2017 at 11:15 AM, Eric W. Biederman
>>  wrote:
>>> James Bottomley  writes:
>
 So is this.  Basically anything that begins by mounting gets a super
 block and can use the s_user_ns to map from the filesystem view to the
 kernel view of ids.  Apart from greater sophistication in the
 parametrisation, it sounds like we have all the machinery you need.
  I'm sure the containers people will consider reasonable patches to
 change this.
>>>
>>> Yes.
>>>
>>> And to be clear we have all of that merged now and mostly present and
>>> hooked up in all filesystems without any shiftfs like changes needed.
>>>
>>> To use this with a filesystem a last pass needs to be had to verify that
>>> the cases where something does not map are handled cleanly.
>>
>> Still this does not answer the question how to dynamically
>> *attach/share* data or read-only volumes as defined by
>> orchestration/container tools into several containers. Am I missing
>> something or is the plan to have per superblock mount for each one ?
>
> Agreed.  That is a related problem and the problem that shiftfs
> is working to solve.
>
> If you only need a single mapping the infrastructure is basically done
> in the kernel today.  If you need multiple mappings we need something
> more.

Yes, I'm asking since there is that vfs+userns proposed approach that
I linked in this thread, that deals with this particular problem: in
which mount namespace<->container the volume appears, maybe that can
be used on top of the s_user_ns ...

Thanks!


-- 
tixxdz


Re: [RFC 1/1] shiftfs: uid/gid shifting bind mount

2017-02-15 Thread Eric W. Biederman
Djalal Harouni  writes:

> On Mon, Feb 13, 2017 at 11:15 AM, Eric W. Biederman
>  wrote:
>> James Bottomley  writes:

>>> So is this.  Basically anything that begins by mounting gets a super
>>> block and can use the s_user_ns to map from the filesystem view to the
>>> kernel view of ids.  Apart from greater sophistication in the
>>> parametrisation, it sounds like we have all the machinery you need.
>>>  I'm sure the containers people will consider reasonable patches to
>>> change this.
>>
>> Yes.
>>
>> And to be clear we have all of that merged now and mostly present and
>> hooked up in all filesystems without any shiftfs like changes needed.
>>
>> To use this with a filesystem a last pass needs to be had to verify that
>> the cases where something does not map are handled cleanly.
>
> Still this does not answer the question how to dynamically
> *attach/share* data or read-only volumes as defined by
> orchestration/container tools into several containers. Am I missing
> something or is the plan to have per superblock mount for each one ?

Agreed.  That is a related problem and the problem that shiftfs
is working to solve.

If you only need a single mapping the infrastructure is basically done
in the kernel today.  If you need multiple mappings we need something
more.

Eric



Re: [RFC 1/1] shiftfs: uid/gid shifting bind mount

2017-02-15 Thread Eric W. Biederman
Djalal Harouni  writes:

> On Mon, Feb 13, 2017 at 11:15 AM, Eric W. Biederman
>  wrote:
>> James Bottomley  writes:

>>> So is this.  Basically anything that begins by mounting gets a super
>>> block and can use the s_user_ns to map from the filesystem view to the
>>> kernel view of ids.  Apart from greater sophistication in the
>>> parametrisation, it sounds like we have all the machinery you need.
>>>  I'm sure the containers people will consider reasonable patches to
>>> change this.
>>
>> Yes.
>>
>> And to be clear we have all of that merged now and mostly present and
>> hooked up in all filesystems without any shiftfs like changes needed.
>>
>> To use this with a filesystem a last pass needs to be had to verify that
>> the cases where something does not map are handled cleanly.
>
> Still this does not answer the question how to dynamically
> *attach/share* data or read-only volumes as defined by
> orchestration/container tools into several containers. Am I missing
> something or is the plan to have per superblock mount for each one ?

Agreed.  That is a related problem and the problem that shiftfs
is working to solve.

If you only need a single mapping the infrastructure is basically done
in the kernel today.  If you need multiple mappings we need something
more.

Eric



Re: [RFC 1/1] shiftfs: uid/gid shifting bind mount

2017-02-15 Thread Djalal Harouni
On Mon, Feb 13, 2017 at 11:15 AM, Eric W. Biederman
 wrote:
> James Bottomley  writes:
>
>> On Thu, 2017-02-09 at 02:36 -0800, Josh Triplett wrote:
>>> On Wed, Feb 08, 2017 at 07:22:45AM -0800, James Bottomley wrote:
>>> > On Tue, 2017-02-07 at 17:54 -0800, Josh Triplett wrote:
>>> > > On Tue, Feb 07, 2017 at 11:49:33AM -0800, Christoph Hellwig
>>> > > wrote:
>>> > > > On Tue, Feb 07, 2017 at 11:02:03AM -0800, James Bottomley
>>> > > > wrote:
>>> > > > > >   Another option would be to require something like a
>>> > > > > > project as used for project quotas as the root.  This would
>>> > > > > > also be conveniant as it could storge the used remapping
>>> > > > > > tables.
>>> > > > >
>>> > > > > So this would be like the current project quota except set on
>>> > > > > a subtree?  I could see it being done that way but I don't
>>> > > > > see what advantage it has over using flags in the subtree
>>> > > > > itself (the mapping is known based on the mount namespace, so
>>> > > > > there's really only a single bit of information to store).
>>> > > >
>>> > > > projects (which are the underling concept for project quotas)
>>> > > > are per-subtree in practice - the flag is set on an inode and
>>> > > > then all directories and files underneath inherit the project
>>> > > > ID, hardlinking outside a project is prohinited.
>>> > >
>>> > > I'm interested in having a VFS-level way to do more than just a
>>> > > shift; I'd like to be able to arbitrarily remap IDs between
>>> > > what's on disk and the system IDs.
>>> >
>>> > OK, so the shift is effectively an arbitrary remap because it
>>> > allows multiple ranges to be mapped (althought the userns currently
>>> > imposes a maximum number of five extents but that limit is a bit
>>> > arbitrary just to try to limit the amount of space the
>>> > parametrisation takes).  See
>>> > kernel/user_namespace.c:map_id_up/down()
>>> >
>>> > >   If we're talking about developing a VFS-level solution for
>>> > > this, I'd like to avoid limiting it to just a shift.  (A
>>> > > shift/range would definitely be the simplest solution for many
>>> > > common container cases, but not all.)
>>> >
>>> > I assume the above satisfies you on this point, but raises the
>>> > question: do you want an arbitrary shift not parametrised by a user
>>> > namespace?  If so how many such shifts do you want ... giving some
>>> > details of the use case would be helpful.
>>>
>>> The limit of five extents means this may not work in the most general
>>> case, no.
>>
>> That's not an API limit, so it can be changed if there's a need.  The
>> problem was merely how to parametrise a mapping without taking too much
>> space.
>>
>>> One use case: given an on-disk filesystem, its name-to-number
>>> mapping, and your host name-to-number mapping, mount the filesystem
>>> with all the UIDs bidirectionally mapped to those on your host
>>> system.
>>
>> This is pretty much what the s_user_ns does.
>>
>>> Another use case: given an on-disk filesystem with potentially
>>> arbitrary UIDs (not necessarily in a clean contiguous block), and a
>>> pile of unprivileged UIDs, mount the filesystem such that every on
>>> -disk UID gets a unique unprivileged UID.
>>
>> So is this.  Basically anything that begins by mounting gets a super
>> block and can use the s_user_ns to map from the filesystem view to the
>> kernel view of ids.  Apart from greater sophistication in the
>> parametrisation, it sounds like we have all the machinery you need.
>>  I'm sure the containers people will consider reasonable patches to
>> change this.
>
> Yes.
>
> And to be clear we have all of that merged now and mostly present and
> hooked up in all filesystems without any shiftfs like changes needed.
>
> To use this with a filesystem a last pass needs to be had to verify that
> the cases where something does not map are handled cleanly.

Still this does not answer the question how to dynamically
*attach/share* data or read-only volumes as defined by
orchestration/container tools into several containers. Am I missing
something or is the plan to have per superblock mount for each one ?

-- 
tixxdz


Re: [RFC 1/1] shiftfs: uid/gid shifting bind mount

2017-02-15 Thread Djalal Harouni
On Mon, Feb 13, 2017 at 11:15 AM, Eric W. Biederman
 wrote:
> James Bottomley  writes:
>
>> On Thu, 2017-02-09 at 02:36 -0800, Josh Triplett wrote:
>>> On Wed, Feb 08, 2017 at 07:22:45AM -0800, James Bottomley wrote:
>>> > On Tue, 2017-02-07 at 17:54 -0800, Josh Triplett wrote:
>>> > > On Tue, Feb 07, 2017 at 11:49:33AM -0800, Christoph Hellwig
>>> > > wrote:
>>> > > > On Tue, Feb 07, 2017 at 11:02:03AM -0800, James Bottomley
>>> > > > wrote:
>>> > > > > >   Another option would be to require something like a
>>> > > > > > project as used for project quotas as the root.  This would
>>> > > > > > also be conveniant as it could storge the used remapping
>>> > > > > > tables.
>>> > > > >
>>> > > > > So this would be like the current project quota except set on
>>> > > > > a subtree?  I could see it being done that way but I don't
>>> > > > > see what advantage it has over using flags in the subtree
>>> > > > > itself (the mapping is known based on the mount namespace, so
>>> > > > > there's really only a single bit of information to store).
>>> > > >
>>> > > > projects (which are the underling concept for project quotas)
>>> > > > are per-subtree in practice - the flag is set on an inode and
>>> > > > then all directories and files underneath inherit the project
>>> > > > ID, hardlinking outside a project is prohinited.
>>> > >
>>> > > I'm interested in having a VFS-level way to do more than just a
>>> > > shift; I'd like to be able to arbitrarily remap IDs between
>>> > > what's on disk and the system IDs.
>>> >
>>> > OK, so the shift is effectively an arbitrary remap because it
>>> > allows multiple ranges to be mapped (althought the userns currently
>>> > imposes a maximum number of five extents but that limit is a bit
>>> > arbitrary just to try to limit the amount of space the
>>> > parametrisation takes).  See
>>> > kernel/user_namespace.c:map_id_up/down()
>>> >
>>> > >   If we're talking about developing a VFS-level solution for
>>> > > this, I'd like to avoid limiting it to just a shift.  (A
>>> > > shift/range would definitely be the simplest solution for many
>>> > > common container cases, but not all.)
>>> >
>>> > I assume the above satisfies you on this point, but raises the
>>> > question: do you want an arbitrary shift not parametrised by a user
>>> > namespace?  If so how many such shifts do you want ... giving some
>>> > details of the use case would be helpful.
>>>
>>> The limit of five extents means this may not work in the most general
>>> case, no.
>>
>> That's not an API limit, so it can be changed if there's a need.  The
>> problem was merely how to parametrise a mapping without taking too much
>> space.
>>
>>> One use case: given an on-disk filesystem, its name-to-number
>>> mapping, and your host name-to-number mapping, mount the filesystem
>>> with all the UIDs bidirectionally mapped to those on your host
>>> system.
>>
>> This is pretty much what the s_user_ns does.
>>
>>> Another use case: given an on-disk filesystem with potentially
>>> arbitrary UIDs (not necessarily in a clean contiguous block), and a
>>> pile of unprivileged UIDs, mount the filesystem such that every on
>>> -disk UID gets a unique unprivileged UID.
>>
>> So is this.  Basically anything that begins by mounting gets a super
>> block and can use the s_user_ns to map from the filesystem view to the
>> kernel view of ids.  Apart from greater sophistication in the
>> parametrisation, it sounds like we have all the machinery you need.
>>  I'm sure the containers people will consider reasonable patches to
>> change this.
>
> Yes.
>
> And to be clear we have all of that merged now and mostly present and
> hooked up in all filesystems without any shiftfs like changes needed.
>
> To use this with a filesystem a last pass needs to be had to verify that
> the cases where something does not map are handled cleanly.

Still this does not answer the question how to dynamically
*attach/share* data or read-only volumes as defined by
orchestration/container tools into several containers. Am I missing
something or is the plan to have per superblock mount for each one ?

-- 
tixxdz


Re: [RFC 1/1] shiftfs: uid/gid shifting bind mount

2017-02-14 Thread James Bottomley
On Tue, 2017-02-14 at 18:03 -0500, Vivek Goyal wrote:
> On Sun, Feb 05, 2017 at 05:18:11PM -0800, James Bottomley wrote:
> 
> [..]
> > >  shiftfs is going to miss out on overlayfs bug fixes related to
> > > user 
> > > credentials differ from mounter credentials, like fd3220d ("ovl: 
> > > update S_ISGID when setting posix ACLs"). I am not sure that this
> > > specific case is relevant to shiftfs, but there could be other.
> > 
> > OK, so shiftfs doesn't have this bug and the reason why is
> > illustrative: basically shiftfs does three things
> > 
> >1. lookups via a uid/gid shifted dentry cache
> >2. shifted credential inode operations permission checks on the
> >   underlying filesystem
> >3. location marking for unprivileged mount
> > 
> > I think we've already seen that 1. isn't from overlayfs but the
> > functionality could be added to overlayfs, I suppose.  The big 
> > problem is 2.  The overlayfs code emulates the permission checks, 
> > which makes it rather complex (this is where you get your bugs like 
> > the above from).  I did actually look at adding 2. to overlayfs on 
> > the theory that a single layer overlay might be closest to what 
> > this is, but eventually concluded I'd have to take the special 
> > cases and add a whole lot more to them ... it really would increase 
> > the maintenance burden substantially and make the code an
> > unreadable rats nest.
> 
> Hi James,
> 
> If we merge this functionality in overlayfs, then we could avoid 
> extra copy of dentry/inode and that might be a significant advantage.

I made that argument to Viro originally when I tried to do all lookups
via the underlying cache.  In the end, it's 192 bytes per dentry and
584 per inode, all of which are reclaimable, so it's not much of an
advantage and it is a great simplification to the code.  In general if
you have a natural separation, you should make the layers reflect it.

My container use case doesn't use overlayfs currently, so to me it
wouldn't provide any advantage whatsoever.

> W.r.t permission checks, I am wondering will it make sense to do what
> overlayfs is doing for shiftfs. That is permission is checked on
> two inodes. We use creds of task for checking permission on
> shiftfs/overlay inode and mounter's creds on real inode.

The mounter's creds for overlay are usually admin ones, so it's local
permission check asks should I? and the later one asks can I? (as in
would my original admin creds allow this).  In many ways, overlayfs is
ignoring the fact that the underlying ->permissions() call might have
failed for some good reason on the current creds.  I don't think any
serious trouble results from this but it strikes me as icky.

> Given we have already shifted the uid/gid for shiftfs inode, I am 
> wondering that why can't we simply call generic_permission(shiftfs_in
> ode, mask) directly in the context of caller. Something like..
> 
> shiftfs_permission() {
>   err = generic_permission(inode, mask);
>   if (err)
>   return err;
> 
>   switch_to_mounter_creds;
>   err = inode_permission(reali, mask);
>   revert_creds();
> 
>   return err;
> }

Because if the reali->d_iop->permission exists, you should use it.  You
could argue shiftfs_permission should be

if (iop->permission) {
oldcred = shiftfs_new_creds(, inode->i_sb);
err = iop->permission(reali, mask);
shiftfs_old_creds(oldcred, );
} else
err = generic_permission(inode, mask);

But really that's a small optimisation.

James



Re: [RFC 1/1] shiftfs: uid/gid shifting bind mount

2017-02-14 Thread James Bottomley
On Tue, 2017-02-14 at 18:03 -0500, Vivek Goyal wrote:
> On Sun, Feb 05, 2017 at 05:18:11PM -0800, James Bottomley wrote:
> 
> [..]
> > >  shiftfs is going to miss out on overlayfs bug fixes related to
> > > user 
> > > credentials differ from mounter credentials, like fd3220d ("ovl: 
> > > update S_ISGID when setting posix ACLs"). I am not sure that this
> > > specific case is relevant to shiftfs, but there could be other.
> > 
> > OK, so shiftfs doesn't have this bug and the reason why is
> > illustrative: basically shiftfs does three things
> > 
> >1. lookups via a uid/gid shifted dentry cache
> >2. shifted credential inode operations permission checks on the
> >   underlying filesystem
> >3. location marking for unprivileged mount
> > 
> > I think we've already seen that 1. isn't from overlayfs but the
> > functionality could be added to overlayfs, I suppose.  The big 
> > problem is 2.  The overlayfs code emulates the permission checks, 
> > which makes it rather complex (this is where you get your bugs like 
> > the above from).  I did actually look at adding 2. to overlayfs on 
> > the theory that a single layer overlay might be closest to what 
> > this is, but eventually concluded I'd have to take the special 
> > cases and add a whole lot more to them ... it really would increase 
> > the maintenance burden substantially and make the code an
> > unreadable rats nest.
> 
> Hi James,
> 
> If we merge this functionality in overlayfs, then we could avoid 
> extra copy of dentry/inode and that might be a significant advantage.

I made that argument to Viro originally when I tried to do all lookups
via the underlying cache.  In the end, it's 192 bytes per dentry and
584 per inode, all of which are reclaimable, so it's not much of an
advantage and it is a great simplification to the code.  In general if
you have a natural separation, you should make the layers reflect it.

My container use case doesn't use overlayfs currently, so to me it
wouldn't provide any advantage whatsoever.

> W.r.t permission checks, I am wondering will it make sense to do what
> overlayfs is doing for shiftfs. That is permission is checked on
> two inodes. We use creds of task for checking permission on
> shiftfs/overlay inode and mounter's creds on real inode.

The mounter's creds for overlay are usually admin ones, so it's local
permission check asks should I? and the later one asks can I? (as in
would my original admin creds allow this).  In many ways, overlayfs is
ignoring the fact that the underlying ->permissions() call might have
failed for some good reason on the current creds.  I don't think any
serious trouble results from this but it strikes me as icky.

> Given we have already shifted the uid/gid for shiftfs inode, I am 
> wondering that why can't we simply call generic_permission(shiftfs_in
> ode, mask) directly in the context of caller. Something like..
> 
> shiftfs_permission() {
>   err = generic_permission(inode, mask);
>   if (err)
>   return err;
> 
>   switch_to_mounter_creds;
>   err = inode_permission(reali, mask);
>   revert_creds();
> 
>   return err;
> }

Because if the reali->d_iop->permission exists, you should use it.  You
could argue shiftfs_permission should be

if (iop->permission) {
oldcred = shiftfs_new_creds(, inode->i_sb);
err = iop->permission(reali, mask);
shiftfs_old_creds(oldcred, );
} else
err = generic_permission(inode, mask);

But really that's a small optimisation.

James



Re: [RFC 1/1] shiftfs: uid/gid shifting bind mount

2017-02-14 Thread Vivek Goyal
On Sun, Feb 05, 2017 at 05:18:11PM -0800, James Bottomley wrote:

[..]
> >  shiftfs is going to miss out on overlayfs bug fixes related to user 
> > credentials differ from mounter credentials, like fd3220d ("ovl: 
> > update S_ISGID when setting posix ACLs"). I am not sure that this 
> > specific case is relevant to shiftfs, but there could be other.
> 
> OK, so shiftfs doesn't have this bug and the reason why is
> illustrative: basically shiftfs does three things
> 
>1. lookups via a uid/gid shifted dentry cache
>2. shifted credential inode operations permission checks on the
>   underlying filesystem
>3. location marking for unprivileged mount
> 
> I think we've already seen that 1. isn't from overlayfs but the
> functionality could be added to overlayfs, I suppose.  The big problem
> is 2.  The overlayfs code emulates the permission checks, which makes
> it rather complex (this is where you get your bugs like the above
> from).  I did actually look at adding 2. to overlayfs on the theory
> that a single layer overlay might be closest to what this is, but
> eventually concluded I'd have to take the special cases and add a whole
> lot more to them ... it really would increase the maintenance burden
> substantially and make the code an unreadable rats nest.

Hi James,

If we merge this functionality in overlayfs, then we could avoid extra
copy of dentry/inode and that might be a significant advantage.

W.r.t permission checks, I am wondering will it make sense to do what
overlayfs is doing for shiftfs. That is permission is checked on
two inodes. We use creds of task for checking permission on
shiftfs/overlay inode and mounter's creds on real inode.

Given we have already shifted the uid/gid for shiftfs inode, I am 
wondering that why can't we simply call generic_permission(shiftfs_inode,
mask) directly in the context of caller. Something like..

shiftfs_permission() {
  err = generic_permission(inode, mask);
  if (err)
return err;

  switch_to_mounter_creds;
  err = inode_permission(reali, mask);
  revert_creds();

  return err;
}

Vivek


Re: [RFC 1/1] shiftfs: uid/gid shifting bind mount

2017-02-14 Thread Vivek Goyal
On Sun, Feb 05, 2017 at 05:18:11PM -0800, James Bottomley wrote:

[..]
> >  shiftfs is going to miss out on overlayfs bug fixes related to user 
> > credentials differ from mounter credentials, like fd3220d ("ovl: 
> > update S_ISGID when setting posix ACLs"). I am not sure that this 
> > specific case is relevant to shiftfs, but there could be other.
> 
> OK, so shiftfs doesn't have this bug and the reason why is
> illustrative: basically shiftfs does three things
> 
>1. lookups via a uid/gid shifted dentry cache
>2. shifted credential inode operations permission checks on the
>   underlying filesystem
>3. location marking for unprivileged mount
> 
> I think we've already seen that 1. isn't from overlayfs but the
> functionality could be added to overlayfs, I suppose.  The big problem
> is 2.  The overlayfs code emulates the permission checks, which makes
> it rather complex (this is where you get your bugs like the above
> from).  I did actually look at adding 2. to overlayfs on the theory
> that a single layer overlay might be closest to what this is, but
> eventually concluded I'd have to take the special cases and add a whole
> lot more to them ... it really would increase the maintenance burden
> substantially and make the code an unreadable rats nest.

Hi James,

If we merge this functionality in overlayfs, then we could avoid extra
copy of dentry/inode and that might be a significant advantage.

W.r.t permission checks, I am wondering will it make sense to do what
overlayfs is doing for shiftfs. That is permission is checked on
two inodes. We use creds of task for checking permission on
shiftfs/overlay inode and mounter's creds on real inode.

Given we have already shifted the uid/gid for shiftfs inode, I am 
wondering that why can't we simply call generic_permission(shiftfs_inode,
mask) directly in the context of caller. Something like..

shiftfs_permission() {
  err = generic_permission(inode, mask);
  if (err)
return err;

  switch_to_mounter_creds;
  err = inode_permission(reali, mask);
  revert_creds();

  return err;
}

Vivek


Re: [RFC 1/1] shiftfs: uid/gid shifting bind mount

2017-02-13 Thread Eric W. Biederman
James Bottomley  writes:

> On Thu, 2017-02-09 at 02:36 -0800, Josh Triplett wrote:
>> On Wed, Feb 08, 2017 at 07:22:45AM -0800, James Bottomley wrote:
>> > On Tue, 2017-02-07 at 17:54 -0800, Josh Triplett wrote:
>> > > On Tue, Feb 07, 2017 at 11:49:33AM -0800, Christoph Hellwig
>> > > wrote:
>> > > > On Tue, Feb 07, 2017 at 11:02:03AM -0800, James Bottomley
>> > > > wrote:
>> > > > > >   Another option would be to require something like a 
>> > > > > > project as used for project quotas as the root.  This would 
>> > > > > > also be conveniant as it could storge the used remapping
>> > > > > > tables.
>> > > > > 
>> > > > > So this would be like the current project quota except set on 
>> > > > > a subtree?  I could see it being done that way but I don't 
>> > > > > see what advantage it has over using flags in the subtree 
>> > > > > itself (the mapping is known based on the mount namespace, so 
>> > > > > there's really only a single bit of information to store).
>> > > > 
>> > > > projects (which are the underling concept for project quotas) 
>> > > > are per-subtree in practice - the flag is set on an inode and 
>> > > > then all directories and files underneath inherit the project 
>> > > > ID, hardlinking outside a project is prohinited.
>> > > 
>> > > I'm interested in having a VFS-level way to do more than just a 
>> > > shift; I'd like to be able to arbitrarily remap IDs between 
>> > > what's on disk and the system IDs.
>> > 
>> > OK, so the shift is effectively an arbitrary remap because it 
>> > allows multiple ranges to be mapped (althought the userns currently
>> > imposes a maximum number of five extents but that limit is a bit 
>> > arbitrary just to try to limit the amount of space the 
>> > parametrisation takes).  See
>> > kernel/user_namespace.c:map_id_up/down()
>> > 
>> > >   If we're talking about developing a VFS-level solution for 
>> > > this, I'd like to avoid limiting it to just a shift.  (A 
>> > > shift/range would definitely be the simplest solution for many 
>> > > common container cases, but not all.)
>> > 
>> > I assume the above satisfies you on this point, but raises the
>> > question: do you want an arbitrary shift not parametrised by a user
>> > namespace?  If so how many such shifts do you want ... giving some
>> > details of the use case would be helpful.
>> 
>> The limit of five extents means this may not work in the most general
>> case, no.
>
> That's not an API limit, so it can be changed if there's a need.  The
> problem was merely how to parametrise a mapping without taking too much
> space.
>
>> One use case: given an on-disk filesystem, its name-to-number 
>> mapping, and your host name-to-number mapping, mount the filesystem 
>> with all the UIDs bidirectionally mapped to those on your host
>> system.
>
> This is pretty much what the s_user_ns does.
>
>> Another use case: given an on-disk filesystem with potentially 
>> arbitrary UIDs (not necessarily in a clean contiguous block), and a 
>> pile of unprivileged UIDs, mount the filesystem such that every on
>> -disk UID gets a unique unprivileged UID.
>
> So is this.  Basically anything that begins by mounting gets a super
> block and can use the s_user_ns to map from the filesystem view to the
> kernel view of ids.  Apart from greater sophistication in the
> parametrisation, it sounds like we have all the machinery you need. 
>  I'm sure the containers people will consider reasonable patches to
> change this.

Yes.

And to be clear we have all of that merged now and mostly present and
hooked up in all filesystems without any shiftfs like changes needed.

To use this with a filesystem a last pass needs to be had to verify that
the cases where something does not map are handled cleanly.

Eric



Re: [RFC 1/1] shiftfs: uid/gid shifting bind mount

2017-02-13 Thread Eric W. Biederman
James Bottomley  writes:

> On Thu, 2017-02-09 at 02:36 -0800, Josh Triplett wrote:
>> On Wed, Feb 08, 2017 at 07:22:45AM -0800, James Bottomley wrote:
>> > On Tue, 2017-02-07 at 17:54 -0800, Josh Triplett wrote:
>> > > On Tue, Feb 07, 2017 at 11:49:33AM -0800, Christoph Hellwig
>> > > wrote:
>> > > > On Tue, Feb 07, 2017 at 11:02:03AM -0800, James Bottomley
>> > > > wrote:
>> > > > > >   Another option would be to require something like a 
>> > > > > > project as used for project quotas as the root.  This would 
>> > > > > > also be conveniant as it could storge the used remapping
>> > > > > > tables.
>> > > > > 
>> > > > > So this would be like the current project quota except set on 
>> > > > > a subtree?  I could see it being done that way but I don't 
>> > > > > see what advantage it has over using flags in the subtree 
>> > > > > itself (the mapping is known based on the mount namespace, so 
>> > > > > there's really only a single bit of information to store).
>> > > > 
>> > > > projects (which are the underling concept for project quotas) 
>> > > > are per-subtree in practice - the flag is set on an inode and 
>> > > > then all directories and files underneath inherit the project 
>> > > > ID, hardlinking outside a project is prohinited.
>> > > 
>> > > I'm interested in having a VFS-level way to do more than just a 
>> > > shift; I'd like to be able to arbitrarily remap IDs between 
>> > > what's on disk and the system IDs.
>> > 
>> > OK, so the shift is effectively an arbitrary remap because it 
>> > allows multiple ranges to be mapped (althought the userns currently
>> > imposes a maximum number of five extents but that limit is a bit 
>> > arbitrary just to try to limit the amount of space the 
>> > parametrisation takes).  See
>> > kernel/user_namespace.c:map_id_up/down()
>> > 
>> > >   If we're talking about developing a VFS-level solution for 
>> > > this, I'd like to avoid limiting it to just a shift.  (A 
>> > > shift/range would definitely be the simplest solution for many 
>> > > common container cases, but not all.)
>> > 
>> > I assume the above satisfies you on this point, but raises the
>> > question: do you want an arbitrary shift not parametrised by a user
>> > namespace?  If so how many such shifts do you want ... giving some
>> > details of the use case would be helpful.
>> 
>> The limit of five extents means this may not work in the most general
>> case, no.
>
> That's not an API limit, so it can be changed if there's a need.  The
> problem was merely how to parametrise a mapping without taking too much
> space.
>
>> One use case: given an on-disk filesystem, its name-to-number 
>> mapping, and your host name-to-number mapping, mount the filesystem 
>> with all the UIDs bidirectionally mapped to those on your host
>> system.
>
> This is pretty much what the s_user_ns does.
>
>> Another use case: given an on-disk filesystem with potentially 
>> arbitrary UIDs (not necessarily in a clean contiguous block), and a 
>> pile of unprivileged UIDs, mount the filesystem such that every on
>> -disk UID gets a unique unprivileged UID.
>
> So is this.  Basically anything that begins by mounting gets a super
> block and can use the s_user_ns to map from the filesystem view to the
> kernel view of ids.  Apart from greater sophistication in the
> parametrisation, it sounds like we have all the machinery you need. 
>  I'm sure the containers people will consider reasonable patches to
> change this.

Yes.

And to be clear we have all of that merged now and mostly present and
hooked up in all filesystems without any shiftfs like changes needed.

To use this with a filesystem a last pass needs to be had to verify that
the cases where something does not map are handled cleanly.

Eric



Re: [RFC 1/1] shiftfs: uid/gid shifting bind mount

2017-02-09 Thread James Bottomley
On Thu, 2017-02-09 at 02:36 -0800, Josh Triplett wrote:
> On Wed, Feb 08, 2017 at 07:22:45AM -0800, James Bottomley wrote:
> > On Tue, 2017-02-07 at 17:54 -0800, Josh Triplett wrote:
> > > On Tue, Feb 07, 2017 at 11:49:33AM -0800, Christoph Hellwig
> > > wrote:
> > > > On Tue, Feb 07, 2017 at 11:02:03AM -0800, James Bottomley
> > > > wrote:
> > > > > >   Another option would be to require something like a 
> > > > > > project as used for project quotas as the root.  This would 
> > > > > > also be conveniant as it could storge the used remapping
> > > > > > tables.
> > > > > 
> > > > > So this would be like the current project quota except set on 
> > > > > a subtree?  I could see it being done that way but I don't 
> > > > > see what advantage it has over using flags in the subtree 
> > > > > itself (the mapping is known based on the mount namespace, so 
> > > > > there's really only a single bit of information to store).
> > > > 
> > > > projects (which are the underling concept for project quotas) 
> > > > are per-subtree in practice - the flag is set on an inode and 
> > > > then all directories and files underneath inherit the project 
> > > > ID, hardlinking outside a project is prohinited.
> > > 
> > > I'm interested in having a VFS-level way to do more than just a 
> > > shift; I'd like to be able to arbitrarily remap IDs between 
> > > what's on disk and the system IDs.
> > 
> > OK, so the shift is effectively an arbitrary remap because it 
> > allows multiple ranges to be mapped (althought the userns currently
> > imposes a maximum number of five extents but that limit is a bit 
> > arbitrary just to try to limit the amount of space the 
> > parametrisation takes).  See
> > kernel/user_namespace.c:map_id_up/down()
> > 
> > >   If we're talking about developing a VFS-level solution for 
> > > this, I'd like to avoid limiting it to just a shift.  (A 
> > > shift/range would definitely be the simplest solution for many 
> > > common container cases, but not all.)
> > 
> > I assume the above satisfies you on this point, but raises the
> > question: do you want an arbitrary shift not parametrised by a user
> > namespace?  If so how many such shifts do you want ... giving some
> > details of the use case would be helpful.
> 
> The limit of five extents means this may not work in the most general
> case, no.

That's not an API limit, so it can be changed if there's a need.  The
problem was merely how to parametrise a mapping without taking too much
space.

> One use case: given an on-disk filesystem, its name-to-number 
> mapping, and your host name-to-number mapping, mount the filesystem 
> with all the UIDs bidirectionally mapped to those on your host
> system.

This is pretty much what the s_user_ns does.

> Another use case: given an on-disk filesystem with potentially 
> arbitrary UIDs (not necessarily in a clean contiguous block), and a 
> pile of unprivileged UIDs, mount the filesystem such that every on
> -disk UID gets a unique unprivileged UID.

So is this.  Basically anything that begins by mounting gets a super
block and can use the s_user_ns to map from the filesystem view to the
kernel view of ids.  Apart from greater sophistication in the
parametrisation, it sounds like we have all the machinery you need. 
 I'm sure the containers people will consider reasonable patches to
change this.

James

> (I have some additional use cases, but they would require the ability 
> to extend the mapping on the fly without remounting.)
> 



Re: [RFC 1/1] shiftfs: uid/gid shifting bind mount

2017-02-09 Thread James Bottomley
On Thu, 2017-02-09 at 02:36 -0800, Josh Triplett wrote:
> On Wed, Feb 08, 2017 at 07:22:45AM -0800, James Bottomley wrote:
> > On Tue, 2017-02-07 at 17:54 -0800, Josh Triplett wrote:
> > > On Tue, Feb 07, 2017 at 11:49:33AM -0800, Christoph Hellwig
> > > wrote:
> > > > On Tue, Feb 07, 2017 at 11:02:03AM -0800, James Bottomley
> > > > wrote:
> > > > > >   Another option would be to require something like a 
> > > > > > project as used for project quotas as the root.  This would 
> > > > > > also be conveniant as it could storge the used remapping
> > > > > > tables.
> > > > > 
> > > > > So this would be like the current project quota except set on 
> > > > > a subtree?  I could see it being done that way but I don't 
> > > > > see what advantage it has over using flags in the subtree 
> > > > > itself (the mapping is known based on the mount namespace, so 
> > > > > there's really only a single bit of information to store).
> > > > 
> > > > projects (which are the underling concept for project quotas) 
> > > > are per-subtree in practice - the flag is set on an inode and 
> > > > then all directories and files underneath inherit the project 
> > > > ID, hardlinking outside a project is prohinited.
> > > 
> > > I'm interested in having a VFS-level way to do more than just a 
> > > shift; I'd like to be able to arbitrarily remap IDs between 
> > > what's on disk and the system IDs.
> > 
> > OK, so the shift is effectively an arbitrary remap because it 
> > allows multiple ranges to be mapped (althought the userns currently
> > imposes a maximum number of five extents but that limit is a bit 
> > arbitrary just to try to limit the amount of space the 
> > parametrisation takes).  See
> > kernel/user_namespace.c:map_id_up/down()
> > 
> > >   If we're talking about developing a VFS-level solution for 
> > > this, I'd like to avoid limiting it to just a shift.  (A 
> > > shift/range would definitely be the simplest solution for many 
> > > common container cases, but not all.)
> > 
> > I assume the above satisfies you on this point, but raises the
> > question: do you want an arbitrary shift not parametrised by a user
> > namespace?  If so how many such shifts do you want ... giving some
> > details of the use case would be helpful.
> 
> The limit of five extents means this may not work in the most general
> case, no.

That's not an API limit, so it can be changed if there's a need.  The
problem was merely how to parametrise a mapping without taking too much
space.

> One use case: given an on-disk filesystem, its name-to-number 
> mapping, and your host name-to-number mapping, mount the filesystem 
> with all the UIDs bidirectionally mapped to those on your host
> system.

This is pretty much what the s_user_ns does.

> Another use case: given an on-disk filesystem with potentially 
> arbitrary UIDs (not necessarily in a clean contiguous block), and a 
> pile of unprivileged UIDs, mount the filesystem such that every on
> -disk UID gets a unique unprivileged UID.

So is this.  Basically anything that begins by mounting gets a super
block and can use the s_user_ns to map from the filesystem view to the
kernel view of ids.  Apart from greater sophistication in the
parametrisation, it sounds like we have all the machinery you need. 
 I'm sure the containers people will consider reasonable patches to
change this.

James

> (I have some additional use cases, but they would require the ability 
> to extend the mapping on the fly without remounting.)
> 



Re: [RFC 1/1] shiftfs: uid/gid shifting bind mount

2017-02-09 Thread Josh Triplett
On Wed, Feb 08, 2017 at 07:22:45AM -0800, James Bottomley wrote:
> On Tue, 2017-02-07 at 17:54 -0800, Josh Triplett wrote:
> > On Tue, Feb 07, 2017 at 11:49:33AM -0800, Christoph Hellwig wrote:
> > > On Tue, Feb 07, 2017 at 11:02:03AM -0800, James Bottomley wrote:
> > > > >   Another option would be to require something like a project
> > > > > as used 
> > > > > for project quotas as the root.  This would also be conveniant
> > > > > as it 
> > > > > could storge the used remapping tables.
> > > > 
> > > > So this would be like the current project quota except set on a
> > > > subtree?  I could see it being done that way but I don't see what
> > > > advantage it has over using flags in the subtree itself (the
> > > > mapping is
> > > > known based on the mount namespace, so there's really only a
> > > > single bit
> > > > of information to store).
> > > 
> > > projects (which are the underling concept for project quotas) are
> > > per-subtree in practice - the flag is set on an inode and then
> > > all directories and files underneath inherit the project ID,
> > > hardlinking outside a project is prohinited.
> > 
> > I'm interested in having a VFS-level way to do more than just a 
> > shift; I'd like to be able to arbitrarily remap IDs between what's on 
> > disk and the system IDs.
> 
> OK, so the shift is effectively an arbitrary remap because it allows
> multiple ranges to be mapped (althought the userns currently imposes a
> maximum number of five extents but that limit is a bit arbitrary just
> to try to limit the amount of space the parametrisation takes).  See
> kernel/user_namespace.c:map_id_up/down()
> 
> >   If we're talking about developing a VFS-level solution for this, 
> > I'd like to avoid limiting it to just a shift.  (A shift/range
> > would definitely be the simplest solution for many common container
> > cases, but not all.)
> 
> I assume the above satisfies you on this point, but raises the
> question: do you want an arbitrary shift not parametrised by a user
> namespace?  If so how many such shifts do you want ... giving some
> details of the use case would be helpful.

The limit of five extents means this may not work in the most general
case, no.

One use case: given an on-disk filesystem, its name-to-number mapping,
and your host name-to-number mapping, mount the filesystem with all the
UIDs bidirectionally mapped to those on your host system.

Another use case: given an on-disk filesystem with potentially arbitrary
UIDs (not necessarily in a clean contiguous block), and a pile of
unprivileged UIDs, mount the filesystem such that every on-disk UID gets
a unique unprivileged UID.

(I have some additional use cases, but they would require the ability to
extend the mapping on the fly without remounting.)


Re: [RFC 1/1] shiftfs: uid/gid shifting bind mount

2017-02-09 Thread Josh Triplett
On Wed, Feb 08, 2017 at 07:22:45AM -0800, James Bottomley wrote:
> On Tue, 2017-02-07 at 17:54 -0800, Josh Triplett wrote:
> > On Tue, Feb 07, 2017 at 11:49:33AM -0800, Christoph Hellwig wrote:
> > > On Tue, Feb 07, 2017 at 11:02:03AM -0800, James Bottomley wrote:
> > > > >   Another option would be to require something like a project
> > > > > as used 
> > > > > for project quotas as the root.  This would also be conveniant
> > > > > as it 
> > > > > could storge the used remapping tables.
> > > > 
> > > > So this would be like the current project quota except set on a
> > > > subtree?  I could see it being done that way but I don't see what
> > > > advantage it has over using flags in the subtree itself (the
> > > > mapping is
> > > > known based on the mount namespace, so there's really only a
> > > > single bit
> > > > of information to store).
> > > 
> > > projects (which are the underling concept for project quotas) are
> > > per-subtree in practice - the flag is set on an inode and then
> > > all directories and files underneath inherit the project ID,
> > > hardlinking outside a project is prohinited.
> > 
> > I'm interested in having a VFS-level way to do more than just a 
> > shift; I'd like to be able to arbitrarily remap IDs between what's on 
> > disk and the system IDs.
> 
> OK, so the shift is effectively an arbitrary remap because it allows
> multiple ranges to be mapped (althought the userns currently imposes a
> maximum number of five extents but that limit is a bit arbitrary just
> to try to limit the amount of space the parametrisation takes).  See
> kernel/user_namespace.c:map_id_up/down()
> 
> >   If we're talking about developing a VFS-level solution for this, 
> > I'd like to avoid limiting it to just a shift.  (A shift/range
> > would definitely be the simplest solution for many common container
> > cases, but not all.)
> 
> I assume the above satisfies you on this point, but raises the
> question: do you want an arbitrary shift not parametrised by a user
> namespace?  If so how many such shifts do you want ... giving some
> details of the use case would be helpful.

The limit of five extents means this may not work in the most general
case, no.

One use case: given an on-disk filesystem, its name-to-number mapping,
and your host name-to-number mapping, mount the filesystem with all the
UIDs bidirectionally mapped to those on your host system.

Another use case: given an on-disk filesystem with potentially arbitrary
UIDs (not necessarily in a clean contiguous block), and a pile of
unprivileged UIDs, mount the filesystem such that every on-disk UID gets
a unique unprivileged UID.

(I have some additional use cases, but they would require the ability to
extend the mapping on the fly without remounting.)


Re: [RFC 1/1] shiftfs: uid/gid shifting bind mount

2017-02-08 Thread James Bottomley
On Tue, 2017-02-07 at 17:54 -0800, Josh Triplett wrote:
> On Tue, Feb 07, 2017 at 11:49:33AM -0800, Christoph Hellwig wrote:
> > On Tue, Feb 07, 2017 at 11:02:03AM -0800, James Bottomley wrote:
> > > >   Another option would be to require something like a project
> > > > as used 
> > > > for project quotas as the root.  This would also be conveniant
> > > > as it 
> > > > could storge the used remapping tables.
> > > 
> > > So this would be like the current project quota except set on a
> > > subtree?  I could see it being done that way but I don't see what
> > > advantage it has over using flags in the subtree itself (the
> > > mapping is
> > > known based on the mount namespace, so there's really only a
> > > single bit
> > > of information to store).
> > 
> > projects (which are the underling concept for project quotas) are
> > per-subtree in practice - the flag is set on an inode and then
> > all directories and files underneath inherit the project ID,
> > hardlinking outside a project is prohinited.
> 
> I'm interested in having a VFS-level way to do more than just a 
> shift; I'd like to be able to arbitrarily remap IDs between what's on 
> disk and the system IDs.

OK, so the shift is effectively an arbitrary remap because it allows
multiple ranges to be mapped (althought the userns currently imposes a
maximum number of five extents but that limit is a bit arbitrary just
to try to limit the amount of space the parametrisation takes).  See
kernel/user_namespace.c:map_id_up/down()

>   If we're talking about developing a VFS-level solution for this, 
> I'd like to avoid limiting it to just a shift.  (A shift/range
> would definitely be the simplest solution for many common container
> cases, but not all.)

I assume the above satisfies you on this point, but raises the
question: do you want an arbitrary shift not parametrised by a user
namespace?  If so how many such shifts do you want ... giving some
details of the use case would be helpful.

James



Re: [RFC 1/1] shiftfs: uid/gid shifting bind mount

2017-02-08 Thread James Bottomley
On Tue, 2017-02-07 at 17:54 -0800, Josh Triplett wrote:
> On Tue, Feb 07, 2017 at 11:49:33AM -0800, Christoph Hellwig wrote:
> > On Tue, Feb 07, 2017 at 11:02:03AM -0800, James Bottomley wrote:
> > > >   Another option would be to require something like a project
> > > > as used 
> > > > for project quotas as the root.  This would also be conveniant
> > > > as it 
> > > > could storge the used remapping tables.
> > > 
> > > So this would be like the current project quota except set on a
> > > subtree?  I could see it being done that way but I don't see what
> > > advantage it has over using flags in the subtree itself (the
> > > mapping is
> > > known based on the mount namespace, so there's really only a
> > > single bit
> > > of information to store).
> > 
> > projects (which are the underling concept for project quotas) are
> > per-subtree in practice - the flag is set on an inode and then
> > all directories and files underneath inherit the project ID,
> > hardlinking outside a project is prohinited.
> 
> I'm interested in having a VFS-level way to do more than just a 
> shift; I'd like to be able to arbitrarily remap IDs between what's on 
> disk and the system IDs.

OK, so the shift is effectively an arbitrary remap because it allows
multiple ranges to be mapped (althought the userns currently imposes a
maximum number of five extents but that limit is a bit arbitrary just
to try to limit the amount of space the parametrisation takes).  See
kernel/user_namespace.c:map_id_up/down()

>   If we're talking about developing a VFS-level solution for this, 
> I'd like to avoid limiting it to just a shift.  (A shift/range
> would definitely be the simplest solution for many common container
> cases, but not all.)

I assume the above satisfies you on this point, but raises the
question: do you want an arbitrary shift not parametrised by a user
namespace?  If so how many such shifts do you want ... giving some
details of the use case would be helpful.

James



Re: [RFC 1/1] shiftfs: uid/gid shifting bind mount

2017-02-08 Thread James Bottomley
On Wed, 2017-02-08 at 08:44 +0200, Amir Goldstein wrote:
> On Wed, Feb 8, 2017 at 1:42 AM, James Bottomley
[...]
> > So I've been thinking about how to do this without subtree marking 
> > and yet retain the subtree properties similar to project id.  The
> > advantage would be that if it can be done using only inode 
> > properties, then none of the permission prototypes need change. 
> >  The only real subtree property we need is ability to bind into an 
> > unprivileged mount namespace, but we already have that.  The gotcha 
> > about marking inodes is that they're all or nothing, so every 
> > subtree that gets access to the inode inherits the mark.  This 
> > means that we cannot allow a user access to a marked inode without 
> > the cover of an unprivileged user namespace, but I think that's 
> > fixable in the permission check (basically if the inode is marked 
> > you *only* get access if you have a user_ns != init_user_ns and we 
> > do the permission shifts or you have user_ns == init_user_ns and
> > you are admin capable).
> > 
> 
> I didn't follow, but it sounds like your proposed solutions is only
> good for single level of userns nesting. Do you think you can
> redefine it in terms of "container root projid".

I don't quite understand what you're getting at.  user_ns mappings
nest, but what we see depends on where you're trying to look at it. 
 Let's take the kernel's view as the primary one.  That's the kuid_t. 
 The user has a different view, the uid_t and now we have the
filesystem view (no actual type for this).  The user view is produced
by from the kernel view by chaining up all the maps from the
current_user_ns and the filesystem view is produced by doing the same
thing for the s_user_ns.  So however many levels of user namespace
nesting we have operating, we only have three views of what an id is:
the user view, the kernel view and the filesystem view.  All nesting
does is change how those views are mapped but it doesn't alter the
number of views.

What the original shiftfs patches (not the ones that use s_user_ns) did
was to introduce effectively an inode view and map between the kernel
and the inode view using the shift mapping parameters; then the inode
view would get mapped through the s_user_ns to become the filesystem
view.  In the s_user_ns version of shiftfs (the current patches),
there's still an inode view, but we know that what we want to write to
disk is the user view, so effectively the user view and the inode view
become the same if the filesystem is marked otherwise the inode view
and the kernel view are the same if it isn't.  That's why I only need a
single bit to tell me if I'm mapping or not and there are two separate
regimes to check the permissions in: the user == inode view and the
kernel == inode view.

James



Re: [RFC 1/1] shiftfs: uid/gid shifting bind mount

2017-02-08 Thread James Bottomley
On Wed, 2017-02-08 at 08:44 +0200, Amir Goldstein wrote:
> On Wed, Feb 8, 2017 at 1:42 AM, James Bottomley
[...]
> > So I've been thinking about how to do this without subtree marking 
> > and yet retain the subtree properties similar to project id.  The
> > advantage would be that if it can be done using only inode 
> > properties, then none of the permission prototypes need change. 
> >  The only real subtree property we need is ability to bind into an 
> > unprivileged mount namespace, but we already have that.  The gotcha 
> > about marking inodes is that they're all or nothing, so every 
> > subtree that gets access to the inode inherits the mark.  This 
> > means that we cannot allow a user access to a marked inode without 
> > the cover of an unprivileged user namespace, but I think that's 
> > fixable in the permission check (basically if the inode is marked 
> > you *only* get access if you have a user_ns != init_user_ns and we 
> > do the permission shifts or you have user_ns == init_user_ns and
> > you are admin capable).
> > 
> 
> I didn't follow, but it sounds like your proposed solutions is only
> good for single level of userns nesting. Do you think you can
> redefine it in terms of "container root projid".

I don't quite understand what you're getting at.  user_ns mappings
nest, but what we see depends on where you're trying to look at it. 
 Let's take the kernel's view as the primary one.  That's the kuid_t. 
 The user has a different view, the uid_t and now we have the
filesystem view (no actual type for this).  The user view is produced
by from the kernel view by chaining up all the maps from the
current_user_ns and the filesystem view is produced by doing the same
thing for the s_user_ns.  So however many levels of user namespace
nesting we have operating, we only have three views of what an id is:
the user view, the kernel view and the filesystem view.  All nesting
does is change how those views are mapped but it doesn't alter the
number of views.

What the original shiftfs patches (not the ones that use s_user_ns) did
was to introduce effectively an inode view and map between the kernel
and the inode view using the shift mapping parameters; then the inode
view would get mapped through the s_user_ns to become the filesystem
view.  In the s_user_ns version of shiftfs (the current patches),
there's still an inode view, but we know that what we want to write to
disk is the user view, so effectively the user view and the inode view
become the same if the filesystem is marked otherwise the inode view
and the kernel view are the same if it isn't.  That's why I only need a
single bit to tell me if I'm mapping or not and there are two separate
regimes to check the permissions in: the user == inode view and the
kernel == inode view.

James



Re: [RFC 1/1] shiftfs: uid/gid shifting bind mount

2017-02-08 Thread James Bottomley
On Wed, 2017-02-08 at 08:44 +0200, Amir Goldstein wrote:
> On Wed, Feb 8, 2017 at 1:42 AM, James Bottomley
>  wrote:
> > On Tue, 2017-02-07 at 14:25 -0800, Christoph Hellwig wrote:
> > > On Tue, Feb 07, 2017 at 11:01:29PM +0200, Amir Goldstein wrote:
> > > > Project id's are not exactly "subtree" semantic, but 
> > > > inheritance semantics,
> > > > which is not the same when non empty directories get their
> > > > project
> > > > id changed.
> > > > Here is a recap:
> > > > https://lwn.net/Articles/623835/
> > > 
> > > Yes - but if we abuse them for containers we could refine the
> > > semantics to simply not allow change of project ids from inside
> > > containers based on say capabilities.
> > 
> 
> You mean something like this:
> https://lwn.net/Articles/632917/
> 
> With the suggested protected_projects, projid 0 (also inside 
> container) gets a special meaning, much like user 0, so we may do 
> interesting things with the projid that is mapped to 0.
> 
> > We can't really abuse projectid, it's part of the user namespace
> > mapping (for project quota).  What we can do is have a new id that
> > behaves like it.
> > 
> 
> Perhaps we *can* use projid without abusing it. userns already maps 
> projids, but there is no concept of "owning project" for a userns, 
> nor does it make a lot of sense, because projid is not part of the 
> credentials. But if we re-brand it as "container root projid", we can 
> try to use it for defining semantics to grant unprivileged access to
> a subtree.
> 
> The functionality you are trying to get with shiftfs mark does
> sounds a bit like "container root projid":
> - inodes with mapped projid MAY be uid/gid shifted
> - inodes with unmapped projid MAY NOT
> 
> I realize this may be very raw, but its a start. If you like this
> direction we can try to develop it.

So I don't think hijacking project id is the way to go.  If we do that
we interfere with using project quotas within containers.  Now that
project quotas work for both xfs and ext4, it's no longer really an xfs
specific feature.

I could see adding a shift on a per projectid basis, so project id
still had its quota meaning, but you could get the uid/gid shift from a
given project id.  However, the big kicker is that the only filesystems
you can actually set a projectid on (via the fsxattr) are ext4 and xfs.
 That's too few to make it work universally (we'd at least need btrfs
and possibly a few others).

However, that's just mechanism.  We can begin with a volatile mark and
work out how we want to store it later.  I think following projectid
properties is the important one, so the choice of whether to hijack, or
attach to projectid is preserved but not mandated.

James



Re: [RFC 1/1] shiftfs: uid/gid shifting bind mount

2017-02-08 Thread James Bottomley
On Wed, 2017-02-08 at 08:44 +0200, Amir Goldstein wrote:
> On Wed, Feb 8, 2017 at 1:42 AM, James Bottomley
>  wrote:
> > On Tue, 2017-02-07 at 14:25 -0800, Christoph Hellwig wrote:
> > > On Tue, Feb 07, 2017 at 11:01:29PM +0200, Amir Goldstein wrote:
> > > > Project id's are not exactly "subtree" semantic, but 
> > > > inheritance semantics,
> > > > which is not the same when non empty directories get their
> > > > project
> > > > id changed.
> > > > Here is a recap:
> > > > https://lwn.net/Articles/623835/
> > > 
> > > Yes - but if we abuse them for containers we could refine the
> > > semantics to simply not allow change of project ids from inside
> > > containers based on say capabilities.
> > 
> 
> You mean something like this:
> https://lwn.net/Articles/632917/
> 
> With the suggested protected_projects, projid 0 (also inside 
> container) gets a special meaning, much like user 0, so we may do 
> interesting things with the projid that is mapped to 0.
> 
> > We can't really abuse projectid, it's part of the user namespace
> > mapping (for project quota).  What we can do is have a new id that
> > behaves like it.
> > 
> 
> Perhaps we *can* use projid without abusing it. userns already maps 
> projids, but there is no concept of "owning project" for a userns, 
> nor does it make a lot of sense, because projid is not part of the 
> credentials. But if we re-brand it as "container root projid", we can 
> try to use it for defining semantics to grant unprivileged access to
> a subtree.
> 
> The functionality you are trying to get with shiftfs mark does
> sounds a bit like "container root projid":
> - inodes with mapped projid MAY be uid/gid shifted
> - inodes with unmapped projid MAY NOT
> 
> I realize this may be very raw, but its a start. If you like this
> direction we can try to develop it.

So I don't think hijacking project id is the way to go.  If we do that
we interfere with using project quotas within containers.  Now that
project quotas work for both xfs and ext4, it's no longer really an xfs
specific feature.

I could see adding a shift on a per projectid basis, so project id
still had its quota meaning, but you could get the uid/gid shift from a
given project id.  However, the big kicker is that the only filesystems
you can actually set a projectid on (via the fsxattr) are ext4 and xfs.
 That's too few to make it work universally (we'd at least need btrfs
and possibly a few others).

However, that's just mechanism.  We can begin with a volatile mark and
work out how we want to store it later.  I think following projectid
properties is the important one, so the choice of whether to hijack, or
attach to projectid is preserved but not mandated.

James



Re: [RFC 1/1] shiftfs: uid/gid shifting bind mount

2017-02-08 Thread Konstantin Khlebnikov

On 08.02.2017 09:44, Amir Goldstein wrote:

On Wed, Feb 8, 2017 at 1:42 AM, James Bottomley
 wrote:

On Tue, 2017-02-07 at 14:25 -0800, Christoph Hellwig wrote:

On Tue, Feb 07, 2017 at 11:01:29PM +0200, Amir Goldstein wrote:

Project id's are not exactly "subtree" semantic, but inheritance
semantics,
which is not the same when non empty directories get their project
id changed.
Here is a recap:
https://lwn.net/Articles/623835/


Yes - but if we abuse them for containers we could refine the
semantics to simply not allow change of project ids from inside
containers based on say capabilities.




You mean something like this:
https://lwn.net/Articles/632917/

With the suggested protected_projects, projid 0 (also inside container)
gets a special meaning, much like user 0, so we may do interesting
things with the projid that is mapped to 0.


We can't really abuse projectid, it's part of the user namespace
mapping (for project quota).  What we can do is have a new id that
behaves like it.



Perhaps we *can* use projid without abusing it.
userns already maps projids, but there is no concept of "owning project"
for a userns, nor does it make a lot of sense, because projid is not
part of the credentials.
But if we re-brand it as "container root projid", we can try to use it
for defining semantics to grant unprivileged access to a subtree.

The functionality you are trying to get with shiftfs mark does
sounds a bit like "container root projid":
- inodes with mapped projid MAY be uid/gid shifted
- inodes with unmapped projid MAY NOT

I realize this may be very raw, but its a start. If you like this
direction we can try to develop it.


But like I said, we don't really need a ful ID, it would basically just
be a single bit mark to say remap or not when doing permission checks
against this inode.  It would follow some of the project id semantics
(like inheritance from parent dir)



But a single bit would only work for single level of userns nesting won't it?



I guess we should define the semantics for the required sub-tree
marking, before we can talk about solutions.


Good plan.


So I've been thinking about how to do this without subtree marking and
yet retain the subtree properties similar to project id.  The advantage
would be that if it can be done using only inode properties, then none
of the permission prototypes need change.  The only real subtree
property we need is ability to bind into an unprivileged mount
namespace, but we already have that.  The gotcha about marking inodes
is that they're all or nothing, so every subtree that gets access to
the inode inherits the mark.  This means that we cannot allow a user
access to a marked inode without the cover of an unprivileged user
namespace, but I think that's fixable in the permission check
(basically if the inode is marked you *only* get access if you have a
user_ns != init_user_ns and we do the permission shifts or you have
user_ns == init_user_ns and you are admin capable).



I didn't follow, but it sounds like your proposed solutions is only
good for single level of userns nesting.
Do you think you can redefine it in terms of "container root projid".



Looks like all this started from mangling uid/gid or some other metadata.
As usual, I have to propose funny/insane solutions:
proxify filesystem with fuse and mangle everything in userspace.
Or add some kind of userspace-driver remapping/mangling into overlay,
for example using BPF script (I see it everywhere nowdays).


Re: [RFC 1/1] shiftfs: uid/gid shifting bind mount

2017-02-08 Thread Konstantin Khlebnikov

On 08.02.2017 09:44, Amir Goldstein wrote:

On Wed, Feb 8, 2017 at 1:42 AM, James Bottomley
 wrote:

On Tue, 2017-02-07 at 14:25 -0800, Christoph Hellwig wrote:

On Tue, Feb 07, 2017 at 11:01:29PM +0200, Amir Goldstein wrote:

Project id's are not exactly "subtree" semantic, but inheritance
semantics,
which is not the same when non empty directories get their project
id changed.
Here is a recap:
https://lwn.net/Articles/623835/


Yes - but if we abuse them for containers we could refine the
semantics to simply not allow change of project ids from inside
containers based on say capabilities.




You mean something like this:
https://lwn.net/Articles/632917/

With the suggested protected_projects, projid 0 (also inside container)
gets a special meaning, much like user 0, so we may do interesting
things with the projid that is mapped to 0.


We can't really abuse projectid, it's part of the user namespace
mapping (for project quota).  What we can do is have a new id that
behaves like it.



Perhaps we *can* use projid without abusing it.
userns already maps projids, but there is no concept of "owning project"
for a userns, nor does it make a lot of sense, because projid is not
part of the credentials.
But if we re-brand it as "container root projid", we can try to use it
for defining semantics to grant unprivileged access to a subtree.

The functionality you are trying to get with shiftfs mark does
sounds a bit like "container root projid":
- inodes with mapped projid MAY be uid/gid shifted
- inodes with unmapped projid MAY NOT

I realize this may be very raw, but its a start. If you like this
direction we can try to develop it.


But like I said, we don't really need a ful ID, it would basically just
be a single bit mark to say remap or not when doing permission checks
against this inode.  It would follow some of the project id semantics
(like inheritance from parent dir)



But a single bit would only work for single level of userns nesting won't it?



I guess we should define the semantics for the required sub-tree
marking, before we can talk about solutions.


Good plan.


So I've been thinking about how to do this without subtree marking and
yet retain the subtree properties similar to project id.  The advantage
would be that if it can be done using only inode properties, then none
of the permission prototypes need change.  The only real subtree
property we need is ability to bind into an unprivileged mount
namespace, but we already have that.  The gotcha about marking inodes
is that they're all or nothing, so every subtree that gets access to
the inode inherits the mark.  This means that we cannot allow a user
access to a marked inode without the cover of an unprivileged user
namespace, but I think that's fixable in the permission check
(basically if the inode is marked you *only* get access if you have a
user_ns != init_user_ns and we do the permission shifts or you have
user_ns == init_user_ns and you are admin capable).



I didn't follow, but it sounds like your proposed solutions is only
good for single level of userns nesting.
Do you think you can redefine it in terms of "container root projid".



Looks like all this started from mangling uid/gid or some other metadata.
As usual, I have to propose funny/insane solutions:
proxify filesystem with fuse and mangle everything in userspace.
Or add some kind of userspace-driver remapping/mangling into overlay,
for example using BPF script (I see it everywhere nowdays).


Re: [RFC 1/1] shiftfs: uid/gid shifting bind mount

2017-02-07 Thread Amir Goldstein
On Wed, Feb 8, 2017 at 1:42 AM, James Bottomley
 wrote:
> On Tue, 2017-02-07 at 14:25 -0800, Christoph Hellwig wrote:
>> On Tue, Feb 07, 2017 at 11:01:29PM +0200, Amir Goldstein wrote:
>> > Project id's are not exactly "subtree" semantic, but inheritance
>> > semantics,
>> > which is not the same when non empty directories get their project
>> > id changed.
>> > Here is a recap:
>> > https://lwn.net/Articles/623835/
>>
>> Yes - but if we abuse them for containers we could refine the
>> semantics to simply not allow change of project ids from inside
>> containers based on say capabilities.
>

You mean something like this:
https://lwn.net/Articles/632917/

With the suggested protected_projects, projid 0 (also inside container)
gets a special meaning, much like user 0, so we may do interesting
things with the projid that is mapped to 0.

> We can't really abuse projectid, it's part of the user namespace
> mapping (for project quota).  What we can do is have a new id that
> behaves like it.
>

Perhaps we *can* use projid without abusing it.
userns already maps projids, but there is no concept of "owning project"
for a userns, nor does it make a lot of sense, because projid is not
part of the credentials.
But if we re-brand it as "container root projid", we can try to use it
for defining semantics to grant unprivileged access to a subtree.

The functionality you are trying to get with shiftfs mark does
sounds a bit like "container root projid":
- inodes with mapped projid MAY be uid/gid shifted
- inodes with unmapped projid MAY NOT

I realize this may be very raw, but its a start. If you like this
direction we can try to develop it.

> But like I said, we don't really need a ful ID, it would basically just
> be a single bit mark to say remap or not when doing permission checks
> against this inode.  It would follow some of the project id semantics
> (like inheritance from parent dir)
>

But a single bit would only work for single level of userns nesting won't it?


>> > I guess we should define the semantics for the required sub-tree
>> > marking, before we can talk about solutions.
>>
>> Good plan.
>
> So I've been thinking about how to do this without subtree marking and
> yet retain the subtree properties similar to project id.  The advantage
> would be that if it can be done using only inode properties, then none
> of the permission prototypes need change.  The only real subtree
> property we need is ability to bind into an unprivileged mount
> namespace, but we already have that.  The gotcha about marking inodes
> is that they're all or nothing, so every subtree that gets access to
> the inode inherits the mark.  This means that we cannot allow a user
> access to a marked inode without the cover of an unprivileged user
> namespace, but I think that's fixable in the permission check
> (basically if the inode is marked you *only* get access if you have a
> user_ns != init_user_ns and we do the permission shifts or you have
> user_ns == init_user_ns and you are admin capable).
>

I didn't follow, but it sounds like your proposed solutions is only
good for single level of userns nesting.
Do you think you can redefine it in terms of "container root projid".


Re: [RFC 1/1] shiftfs: uid/gid shifting bind mount

2017-02-07 Thread Amir Goldstein
On Wed, Feb 8, 2017 at 1:42 AM, James Bottomley
 wrote:
> On Tue, 2017-02-07 at 14:25 -0800, Christoph Hellwig wrote:
>> On Tue, Feb 07, 2017 at 11:01:29PM +0200, Amir Goldstein wrote:
>> > Project id's are not exactly "subtree" semantic, but inheritance
>> > semantics,
>> > which is not the same when non empty directories get their project
>> > id changed.
>> > Here is a recap:
>> > https://lwn.net/Articles/623835/
>>
>> Yes - but if we abuse them for containers we could refine the
>> semantics to simply not allow change of project ids from inside
>> containers based on say capabilities.
>

You mean something like this:
https://lwn.net/Articles/632917/

With the suggested protected_projects, projid 0 (also inside container)
gets a special meaning, much like user 0, so we may do interesting
things with the projid that is mapped to 0.

> We can't really abuse projectid, it's part of the user namespace
> mapping (for project quota).  What we can do is have a new id that
> behaves like it.
>

Perhaps we *can* use projid without abusing it.
userns already maps projids, but there is no concept of "owning project"
for a userns, nor does it make a lot of sense, because projid is not
part of the credentials.
But if we re-brand it as "container root projid", we can try to use it
for defining semantics to grant unprivileged access to a subtree.

The functionality you are trying to get with shiftfs mark does
sounds a bit like "container root projid":
- inodes with mapped projid MAY be uid/gid shifted
- inodes with unmapped projid MAY NOT

I realize this may be very raw, but its a start. If you like this
direction we can try to develop it.

> But like I said, we don't really need a ful ID, it would basically just
> be a single bit mark to say remap or not when doing permission checks
> against this inode.  It would follow some of the project id semantics
> (like inheritance from parent dir)
>

But a single bit would only work for single level of userns nesting won't it?


>> > I guess we should define the semantics for the required sub-tree
>> > marking, before we can talk about solutions.
>>
>> Good plan.
>
> So I've been thinking about how to do this without subtree marking and
> yet retain the subtree properties similar to project id.  The advantage
> would be that if it can be done using only inode properties, then none
> of the permission prototypes need change.  The only real subtree
> property we need is ability to bind into an unprivileged mount
> namespace, but we already have that.  The gotcha about marking inodes
> is that they're all or nothing, so every subtree that gets access to
> the inode inherits the mark.  This means that we cannot allow a user
> access to a marked inode without the cover of an unprivileged user
> namespace, but I think that's fixable in the permission check
> (basically if the inode is marked you *only* get access if you have a
> user_ns != init_user_ns and we do the permission shifts or you have
> user_ns == init_user_ns and you are admin capable).
>

I didn't follow, but it sounds like your proposed solutions is only
good for single level of userns nesting.
Do you think you can redefine it in terms of "container root projid".


Re: [RFC 1/1] shiftfs: uid/gid shifting bind mount

2017-02-07 Thread Josh Triplett
On Tue, Feb 07, 2017 at 11:49:33AM -0800, Christoph Hellwig wrote:
> On Tue, Feb 07, 2017 at 11:02:03AM -0800, James Bottomley wrote:
> > >   Another option would be to require something like a project as used 
> > > for project quotas as the root.  This would also be conveniant as it 
> > > could storge the used remapping tables.
> > 
> > So this would be like the current project quota except set on a
> > subtree?  I could see it being done that way but I don't see what
> > advantage it has over using flags in the subtree itself (the mapping is
> > known based on the mount namespace, so there's really only a single bit
> > of information to store).
> 
> projects (which are the underling concept for project quotas) are
> per-subtree in practice - the flag is set on an inode and then
> all directories and files underneath inherit the project ID,
> hardlinking outside a project is prohinited.

I'm interested in having a VFS-level way to do more than just a shift;
I'd like to be able to arbitrarily remap IDs between what's on disk and
the system IDs.  If we're talking about developing a VFS-level solution
for this, I'd like to avoid limiting it to just a shift.  (A shift/range
would definitely be the simplest solution for many common container
cases, but not all.)


Re: [RFC 1/1] shiftfs: uid/gid shifting bind mount

2017-02-07 Thread Josh Triplett
On Tue, Feb 07, 2017 at 11:49:33AM -0800, Christoph Hellwig wrote:
> On Tue, Feb 07, 2017 at 11:02:03AM -0800, James Bottomley wrote:
> > >   Another option would be to require something like a project as used 
> > > for project quotas as the root.  This would also be conveniant as it 
> > > could storge the used remapping tables.
> > 
> > So this would be like the current project quota except set on a
> > subtree?  I could see it being done that way but I don't see what
> > advantage it has over using flags in the subtree itself (the mapping is
> > known based on the mount namespace, so there's really only a single bit
> > of information to store).
> 
> projects (which are the underling concept for project quotas) are
> per-subtree in practice - the flag is set on an inode and then
> all directories and files underneath inherit the project ID,
> hardlinking outside a project is prohinited.

I'm interested in having a VFS-level way to do more than just a shift;
I'd like to be able to arbitrarily remap IDs between what's on disk and
the system IDs.  If we're talking about developing a VFS-level solution
for this, I'd like to avoid limiting it to just a shift.  (A shift/range
would definitely be the simplest solution for many common container
cases, but not all.)


Re: [RFC 1/1] shiftfs: uid/gid shifting bind mount

2017-02-07 Thread James Bottomley
On Tue, 2017-02-07 at 14:25 -0800, Christoph Hellwig wrote:
> On Tue, Feb 07, 2017 at 11:01:29PM +0200, Amir Goldstein wrote:
> > Project id's are not exactly "subtree" semantic, but inheritance
> > semantics,
> > which is not the same when non empty directories get their project
> > id changed.
> > Here is a recap:
> > https://lwn.net/Articles/623835/
> 
> Yes - but if we abuse them for containers we could refine the 
> semantics to simply not allow change of project ids from inside 
> containers based on say capabilities.

We can't really abuse projectid, it's part of the user namespace
mapping (for project quota).  What we can do is have a new id that
behaves like it.

But like I said, we don't really need a ful ID, it would basically just
be a single bit mark to say remap or not when doing permission checks
against this inode.  It would follow some of the project id semantics
(like inheritance from parent dir)

> > I guess we should define the semantics for the required sub-tree 
> > marking, before we can talk about solutions.
> 
> Good plan.

So I've been thinking about how to do this without subtree marking and
yet retain the subtree properties similar to project id.  The advantage
would be that if it can be done using only inode properties, then none
of the permission prototypes need change.  The only real subtree
property we need is ability to bind into an unprivileged mount
namespace, but we already have that.  The gotcha about marking inodes
is that they're all or nothing, so every subtree that gets access to
the inode inherits the mark.  This means that we cannot allow a user
access to a marked inode without the cover of an unprivileged user
namespace, but I think that's fixable in the permission check
(basically if the inode is marked you *only* get access if you have a
user_ns != init_user_ns and we do the permission shifts or you have
user_ns == init_user_ns and you are admin capable).

James



Re: [RFC 1/1] shiftfs: uid/gid shifting bind mount

2017-02-07 Thread James Bottomley
On Tue, 2017-02-07 at 14:25 -0800, Christoph Hellwig wrote:
> On Tue, Feb 07, 2017 at 11:01:29PM +0200, Amir Goldstein wrote:
> > Project id's are not exactly "subtree" semantic, but inheritance
> > semantics,
> > which is not the same when non empty directories get their project
> > id changed.
> > Here is a recap:
> > https://lwn.net/Articles/623835/
> 
> Yes - but if we abuse them for containers we could refine the 
> semantics to simply not allow change of project ids from inside 
> containers based on say capabilities.

We can't really abuse projectid, it's part of the user namespace
mapping (for project quota).  What we can do is have a new id that
behaves like it.

But like I said, we don't really need a ful ID, it would basically just
be a single bit mark to say remap or not when doing permission checks
against this inode.  It would follow some of the project id semantics
(like inheritance from parent dir)

> > I guess we should define the semantics for the required sub-tree 
> > marking, before we can talk about solutions.
> 
> Good plan.

So I've been thinking about how to do this without subtree marking and
yet retain the subtree properties similar to project id.  The advantage
would be that if it can be done using only inode properties, then none
of the permission prototypes need change.  The only real subtree
property we need is ability to bind into an unprivileged mount
namespace, but we already have that.  The gotcha about marking inodes
is that they're all or nothing, so every subtree that gets access to
the inode inherits the mark.  This means that we cannot allow a user
access to a marked inode without the cover of an unprivileged user
namespace, but I think that's fixable in the permission check
(basically if the inode is marked you *only* get access if you have a
user_ns != init_user_ns and we do the permission shifts or you have
user_ns == init_user_ns and you are admin capable).

James



Re: [RFC 1/1] shiftfs: uid/gid shifting bind mount

2017-02-07 Thread Christoph Hellwig
On Tue, Feb 07, 2017 at 11:01:29PM +0200, Amir Goldstein wrote:
> Project id's are not exactly "subtree" semantic, but inheritance semantics,
> which is not the same when non empty directories get their project id changed.
> Here is a recap:
> https://lwn.net/Articles/623835/

Yes - but if we abuse them for containers we could refine the semantics
to simply not allow change of project ids from inside containers
based on say capabilities.

> I guess we should define the semantics for the required sub-tree marking,
> before we can talk about solutions.

Good plan.


Re: [RFC 1/1] shiftfs: uid/gid shifting bind mount

2017-02-07 Thread Christoph Hellwig
On Tue, Feb 07, 2017 at 11:01:29PM +0200, Amir Goldstein wrote:
> Project id's are not exactly "subtree" semantic, but inheritance semantics,
> which is not the same when non empty directories get their project id changed.
> Here is a recap:
> https://lwn.net/Articles/623835/

Yes - but if we abuse them for containers we could refine the semantics
to simply not allow change of project ids from inside containers
based on say capabilities.

> I guess we should define the semantics for the required sub-tree marking,
> before we can talk about solutions.

Good plan.


Re: [RFC 1/1] shiftfs: uid/gid shifting bind mount

2017-02-07 Thread Amir Goldstein
On Tue, Feb 7, 2017 at 10:05 PM, James Bottomley
 wrote:
> On Tue, 2017-02-07 at 11:49 -0800, Christoph Hellwig wrote:
>> On Tue, Feb 07, 2017 at 11:02:03AM -0800, James Bottomley wrote:
>> > >   Another option would be to require something like a project as
>> > > used
>> > > for project quotas as the root.  This would also be conveniant as
>> > > it
>> > > could storge the used remapping tables.
>> >
>> > So this would be like the current project quota except set on a
>> > subtree?  I could see it being done that way but I don't see what
>> > advantage it has over using flags in the subtree itself (the
>> > mapping is
>> > known based on the mount namespace, so there's really only a single
>> > bit
>> > of information to store).
>>
>> projects (which are the underling concept for project quotas) are
>> per-subtree in practice - the flag is set on an inode and then
>> all directories and files underneath inherit the project ID,
>> hardlinking outside a project is prohinited.
>
> OK, this is what I don't understand: how is something that's inode
> based limited to be per-subtree?  The way I've seen the VFS operate it
> seems that any given inode (and indeed dentry) can appear in many
> subtrees so how do I limit them to just one?
>

Project id's are not exactly "subtree" semantic, but inheritance semantics,
which is not the same when non empty directories get their project id changed.
Here is a recap:
https://lwn.net/Articles/623835/

So if you created an empty directory and "marked" it for shiftuid and all
descendants inherited this property you would be able to check that property
on a per inode basis. Not sure that is what you are looking for?
I guess we should define the semantics for the required sub-tree marking,
before we can talk about solutions.


Re: [RFC 1/1] shiftfs: uid/gid shifting bind mount

2017-02-07 Thread Amir Goldstein
On Tue, Feb 7, 2017 at 10:05 PM, James Bottomley
 wrote:
> On Tue, 2017-02-07 at 11:49 -0800, Christoph Hellwig wrote:
>> On Tue, Feb 07, 2017 at 11:02:03AM -0800, James Bottomley wrote:
>> > >   Another option would be to require something like a project as
>> > > used
>> > > for project quotas as the root.  This would also be conveniant as
>> > > it
>> > > could storge the used remapping tables.
>> >
>> > So this would be like the current project quota except set on a
>> > subtree?  I could see it being done that way but I don't see what
>> > advantage it has over using flags in the subtree itself (the
>> > mapping is
>> > known based on the mount namespace, so there's really only a single
>> > bit
>> > of information to store).
>>
>> projects (which are the underling concept for project quotas) are
>> per-subtree in practice - the flag is set on an inode and then
>> all directories and files underneath inherit the project ID,
>> hardlinking outside a project is prohinited.
>
> OK, this is what I don't understand: how is something that's inode
> based limited to be per-subtree?  The way I've seen the VFS operate it
> seems that any given inode (and indeed dentry) can appear in many
> subtrees so how do I limit them to just one?
>

Project id's are not exactly "subtree" semantic, but inheritance semantics,
which is not the same when non empty directories get their project id changed.
Here is a recap:
https://lwn.net/Articles/623835/

So if you created an empty directory and "marked" it for shiftuid and all
descendants inherited this property you would be able to check that property
on a per inode basis. Not sure that is what you are looking for?
I guess we should define the semantics for the required sub-tree marking,
before we can talk about solutions.


Re: [RFC 1/1] shiftfs: uid/gid shifting bind mount

2017-02-07 Thread James Bottomley
On Tue, 2017-02-07 at 11:49 -0800, Christoph Hellwig wrote:
> On Tue, Feb 07, 2017 at 11:02:03AM -0800, James Bottomley wrote:
> > >   Another option would be to require something like a project as
> > > used 
> > > for project quotas as the root.  This would also be conveniant as
> > > it 
> > > could storge the used remapping tables.
> > 
> > So this would be like the current project quota except set on a
> > subtree?  I could see it being done that way but I don't see what
> > advantage it has over using flags in the subtree itself (the
> > mapping is
> > known based on the mount namespace, so there's really only a single
> > bit
> > of information to store).
> 
> projects (which are the underling concept for project quotas) are
> per-subtree in practice - the flag is set on an inode and then
> all directories and files underneath inherit the project ID,
> hardlinking outside a project is prohinited.

OK, this is what I don't understand: how is something that's inode
based limited to be per-subtree?  The way I've seen the VFS operate it
seems that any given inode (and indeed dentry) can appear in many
subtrees so how do I limit them to just one?

James



Re: [RFC 1/1] shiftfs: uid/gid shifting bind mount

2017-02-07 Thread James Bottomley
On Tue, 2017-02-07 at 11:49 -0800, Christoph Hellwig wrote:
> On Tue, Feb 07, 2017 at 11:02:03AM -0800, James Bottomley wrote:
> > >   Another option would be to require something like a project as
> > > used 
> > > for project quotas as the root.  This would also be conveniant as
> > > it 
> > > could storge the used remapping tables.
> > 
> > So this would be like the current project quota except set on a
> > subtree?  I could see it being done that way but I don't see what
> > advantage it has over using flags in the subtree itself (the
> > mapping is
> > known based on the mount namespace, so there's really only a single
> > bit
> > of information to store).
> 
> projects (which are the underling concept for project quotas) are
> per-subtree in practice - the flag is set on an inode and then
> all directories and files underneath inherit the project ID,
> hardlinking outside a project is prohinited.

OK, this is what I don't understand: how is something that's inode
based limited to be per-subtree?  The way I've seen the VFS operate it
seems that any given inode (and indeed dentry) can appear in many
subtrees so how do I limit them to just one?

James



Re: [RFC 1/1] shiftfs: uid/gid shifting bind mount

2017-02-07 Thread Christoph Hellwig
On Tue, Feb 07, 2017 at 11:02:03AM -0800, James Bottomley wrote:
> >   Another option would be to require something like a project as used 
> > for project quotas as the root.  This would also be conveniant as it 
> > could storge the used remapping tables.
> 
> So this would be like the current project quota except set on a
> subtree?  I could see it being done that way but I don't see what
> advantage it has over using flags in the subtree itself (the mapping is
> known based on the mount namespace, so there's really only a single bit
> of information to store).

projects (which are the underling concept for project quotas) are
per-subtree in practice - the flag is set on an inode and then
all directories and files underneath inherit the project ID,
hardlinking outside a project is prohinited.


Re: [RFC 1/1] shiftfs: uid/gid shifting bind mount

2017-02-07 Thread Christoph Hellwig
On Tue, Feb 07, 2017 at 11:02:03AM -0800, James Bottomley wrote:
> >   Another option would be to require something like a project as used 
> > for project quotas as the root.  This would also be conveniant as it 
> > could storge the used remapping tables.
> 
> So this would be like the current project quota except set on a
> subtree?  I could see it being done that way but I don't see what
> advantage it has over using flags in the subtree itself (the mapping is
> known based on the mount namespace, so there's really only a single bit
> of information to store).

projects (which are the underling concept for project quotas) are
per-subtree in practice - the flag is set on an inode and then
all directories and files underneath inherit the project ID,
hardlinking outside a project is prohinited.


Re: [RFC 1/1] shiftfs: uid/gid shifting bind mount

2017-02-07 Thread Christoph Hellwig
On Tue, Feb 07, 2017 at 11:01:08AM -0800, James Bottomley wrote:
> I was talking about inode stability in the filesystem underlying the
> export.  I believe you're talking about inode number stability
> guarantees of the nfs client code itself, which are unrelated to the
> inode number guarantees of the exported filesystem?

They are 1:1 correlated for a Linux server at least.


Re: [RFC 1/1] shiftfs: uid/gid shifting bind mount

2017-02-07 Thread Christoph Hellwig
On Tue, Feb 07, 2017 at 11:01:08AM -0800, James Bottomley wrote:
> I was talking about inode stability in the filesystem underlying the
> export.  I believe you're talking about inode number stability
> guarantees of the nfs client code itself, which are unrelated to the
> inode number guarantees of the exported filesystem?

They are 1:1 correlated for a Linux server at least.


Re: [RFC 1/1] shiftfs: uid/gid shifting bind mount

2017-02-07 Thread Djalal Harouni
On Tue, Feb 7, 2017 at 7:20 PM, James Bottomley
 wrote:
> On Tue, 2017-02-07 at 19:59 +0200, Amir Goldstein wrote:
>> On Tue, Feb 7, 2017 at 6:37 PM, James Bottomley
>>  wrote:
>> > On Tue, 2017-02-07 at 01:19 -0800, Christoph Hellwig wrote:
>> > > On Sat, Feb 04, 2017 at 11:19:32AM -0800, James Bottomley wrote:
>> > > > This allows any subtree to be uid/gid shifted and bound
>> > > > elsewhere.
>> > > >  It does this by operating simlarly to overlayfs.  Its primary
>> > > > use
>> > > > is for shifting the underlying uids of filesystems used to
>> > > > support
>> > > > unpriviliged (uid shifted) containers.  The usual use case here
>> > > > is
>> > > > that the container is operating with an uid shifted
>> > > > unprivileged
>> > > > root but sometimes needs to make use of or work with a
>> > > > filesystem
>> > > > image that has root at real uid 0.
>> > > >
>> > > > The mechanism is to allow any subordinate mount namespace to
>> > > > mount
>> > > > a shiftfs filesystem (by marking it FS_USERNS_MOUNT) but only
>> > > > allowing it to mount marked subtrees (using the -o mark option
>> > > > as
>> > > > root).  Once mounted, the subtree is mapped via the super block
>> > > > user namespace so that the interior ids of the mounting user
>> > > > namespace are the ids written to the filesystem.
>> > >
>> > > Please move this into VFS instead of a stackable fs.  We might
>> > > need
>> > > addtional parameters to getattr/setattr to specify the ID
>> > > translation, but that's why better than a horrible hack like
>> > > this.
>> >
>> > I would need a lot more than that: getattr controls the cosmetic
>> > permission display to the user, but enforcement is done in the core
>> > permission checks which are inode based.  To make this a real bind
>> > mount, the core permission checks will have to become subtree aware
>> > because knowledge of whether we need a uid shift in the permission
>> > check becomes a subtree property.  Effectively inode_permission
>> > would
>> > become dentry_permission and generic_permission would take a dentry
>> > instead of an inode.  This will be a huge amount of VFS and
>> > underlying
>> > filesystem churn, since the permissions calls are threaded through
>> > a
>> > huge chunk of code.
>> >
>>
>> I am not even sure that would be enough.
>> dentry does not contain information about the mount user came from,
>> and sb contains only information about the user ns of the mounter of
>> the file system, not the mounter of the bind mount, right?
>> I think I am missing some big pieces of the big picture.
>> Would love to hear what Eric has to say.
>
> I'm not really sure until it gets prototyped, but I think the
> filesystem user namespace would also have to become a subtree property.

Sorry I don't want to derail the thread, but that was already prototyped

> The whole reason for shiftfs being a properly mounted filesystem is
> because it needs a super block to capture the namespace it's being
> mounted in.
>
> However, when you have a container that you want remapping inside, you
> must have a user namespace which owns a mount namespace, so we can
> deduce the information from the mount namespace.  All we probably need
> the subtree to tell us is if we're shifting or not.

That's one of the use cases that you will definitely end up with... if
anyone did read that incomplete VFS RFC proposal:

"2) The solution is based on VFS and mount namespaces, we use the user
namespace of the containing mount namespace to check if we should shift
UIDs/GIDs from/to virtual <=> on-disk view.
If a filesystem was mounted with "vfs_shift_uids" and "vfs_shift_gids"
options, and if it shows up inside a mount namespace that supports VFS
UIDs/GIDs shifts then during each access we will remap UID/GID either
to virtual or to on-disk view using simple helper functions to allow the
access. In case the mount or current mount namespace do not support VFS
UID/GID shifts, we fallback to the old behaviour, no shift is performed." [1]

[1] https://lkml.org/lkml/2016/5/4/411



-- 
tixxdz


Re: [RFC 1/1] shiftfs: uid/gid shifting bind mount

2017-02-07 Thread Djalal Harouni
On Tue, Feb 7, 2017 at 7:20 PM, James Bottomley
 wrote:
> On Tue, 2017-02-07 at 19:59 +0200, Amir Goldstein wrote:
>> On Tue, Feb 7, 2017 at 6:37 PM, James Bottomley
>>  wrote:
>> > On Tue, 2017-02-07 at 01:19 -0800, Christoph Hellwig wrote:
>> > > On Sat, Feb 04, 2017 at 11:19:32AM -0800, James Bottomley wrote:
>> > > > This allows any subtree to be uid/gid shifted and bound
>> > > > elsewhere.
>> > > >  It does this by operating simlarly to overlayfs.  Its primary
>> > > > use
>> > > > is for shifting the underlying uids of filesystems used to
>> > > > support
>> > > > unpriviliged (uid shifted) containers.  The usual use case here
>> > > > is
>> > > > that the container is operating with an uid shifted
>> > > > unprivileged
>> > > > root but sometimes needs to make use of or work with a
>> > > > filesystem
>> > > > image that has root at real uid 0.
>> > > >
>> > > > The mechanism is to allow any subordinate mount namespace to
>> > > > mount
>> > > > a shiftfs filesystem (by marking it FS_USERNS_MOUNT) but only
>> > > > allowing it to mount marked subtrees (using the -o mark option
>> > > > as
>> > > > root).  Once mounted, the subtree is mapped via the super block
>> > > > user namespace so that the interior ids of the mounting user
>> > > > namespace are the ids written to the filesystem.
>> > >
>> > > Please move this into VFS instead of a stackable fs.  We might
>> > > need
>> > > addtional parameters to getattr/setattr to specify the ID
>> > > translation, but that's why better than a horrible hack like
>> > > this.
>> >
>> > I would need a lot more than that: getattr controls the cosmetic
>> > permission display to the user, but enforcement is done in the core
>> > permission checks which are inode based.  To make this a real bind
>> > mount, the core permission checks will have to become subtree aware
>> > because knowledge of whether we need a uid shift in the permission
>> > check becomes a subtree property.  Effectively inode_permission
>> > would
>> > become dentry_permission and generic_permission would take a dentry
>> > instead of an inode.  This will be a huge amount of VFS and
>> > underlying
>> > filesystem churn, since the permissions calls are threaded through
>> > a
>> > huge chunk of code.
>> >
>>
>> I am not even sure that would be enough.
>> dentry does not contain information about the mount user came from,
>> and sb contains only information about the user ns of the mounter of
>> the file system, not the mounter of the bind mount, right?
>> I think I am missing some big pieces of the big picture.
>> Would love to hear what Eric has to say.
>
> I'm not really sure until it gets prototyped, but I think the
> filesystem user namespace would also have to become a subtree property.

Sorry I don't want to derail the thread, but that was already prototyped

> The whole reason for shiftfs being a properly mounted filesystem is
> because it needs a super block to capture the namespace it's being
> mounted in.
>
> However, when you have a container that you want remapping inside, you
> must have a user namespace which owns a mount namespace, so we can
> deduce the information from the mount namespace.  All we probably need
> the subtree to tell us is if we're shifting or not.

That's one of the use cases that you will definitely end up with... if
anyone did read that incomplete VFS RFC proposal:

"2) The solution is based on VFS and mount namespaces, we use the user
namespace of the containing mount namespace to check if we should shift
UIDs/GIDs from/to virtual <=> on-disk view.
If a filesystem was mounted with "vfs_shift_uids" and "vfs_shift_gids"
options, and if it shows up inside a mount namespace that supports VFS
UIDs/GIDs shifts then during each access we will remap UID/GID either
to virtual or to on-disk view using simple helper functions to allow the
access. In case the mount or current mount namespace do not support VFS
UID/GID shifts, we fallback to the old behaviour, no shift is performed." [1]

[1] https://lkml.org/lkml/2016/5/4/411



-- 
tixxdz


Re: [RFC 1/1] shiftfs: uid/gid shifting bind mount

2017-02-07 Thread James Bottomley
On Tue, 2017-02-07 at 10:10 -0800, Christoph Hellwig wrote:
> On Tue, Feb 07, 2017 at 07:59:00PM +0200, Amir Goldstein wrote:
> > I am not even sure that would be enough.
> > dentry does not contain information about the mount user came from,
> > and sb contains only information about the user ns of the mounter
> > of
> > the file system, not the mounter of the bind mount, right?
> > I think I am missing some big pieces of the big picture.
> > Would love to hear what Eric has to say.
> 
> IFF we want to do what shiftfs does properly we need vfsmount + 
> inode, no need for the dentry.

Yes, sorry ... I was thinking the dentry contained the mnt, but it
doesn't, that's the path.  However, threading the mnt through looks
substantially harder.

> But maybe we need to go back and decice if we want to allow uid/gid
> remapping for arbitrary subtrees anyway.

So those were the original patches Djalal was referring to.  The
problem there is that a lot of orchestration systems don't store images
they want to bind mount into containers on separately mounted
filesystems, which is what's needed to avoid this being per-subtree. 
 However, the clinching argument for me is that the canonical container
image *is* a subtree (unlike a vm image which has to be mounted).  If
we don't make this work on subtrees people go back to daft stacks for
containers like copying the image subtree into a loopback mounted
filesystem just to make this all work (and then complain about
performance and caching and so on).

>   Another option would be to require something like a project as used 
> for project quotas as the root.  This would also be conveniant as it 
> could storge the used remapping tables.

So this would be like the current project quota except set on a
subtree?  I could see it being done that way but I don't see what
advantage it has over using flags in the subtree itself (the mapping is
known based on the mount namespace, so there's really only a single bit
of information to store).

James



Re: [RFC 1/1] shiftfs: uid/gid shifting bind mount

2017-02-07 Thread James Bottomley
On Tue, 2017-02-07 at 10:10 -0800, Christoph Hellwig wrote:
> On Tue, Feb 07, 2017 at 07:59:00PM +0200, Amir Goldstein wrote:
> > I am not even sure that would be enough.
> > dentry does not contain information about the mount user came from,
> > and sb contains only information about the user ns of the mounter
> > of
> > the file system, not the mounter of the bind mount, right?
> > I think I am missing some big pieces of the big picture.
> > Would love to hear what Eric has to say.
> 
> IFF we want to do what shiftfs does properly we need vfsmount + 
> inode, no need for the dentry.

Yes, sorry ... I was thinking the dentry contained the mnt, but it
doesn't, that's the path.  However, threading the mnt through looks
substantially harder.

> But maybe we need to go back and decice if we want to allow uid/gid
> remapping for arbitrary subtrees anyway.

So those were the original patches Djalal was referring to.  The
problem there is that a lot of orchestration systems don't store images
they want to bind mount into containers on separately mounted
filesystems, which is what's needed to avoid this being per-subtree. 
 However, the clinching argument for me is that the canonical container
image *is* a subtree (unlike a vm image which has to be mounted).  If
we don't make this work on subtrees people go back to daft stacks for
containers like copying the image subtree into a loopback mounted
filesystem just to make this all work (and then complain about
performance and caching and so on).

>   Another option would be to require something like a project as used 
> for project quotas as the root.  This would also be conveniant as it 
> could storge the used remapping tables.

So this would be like the current project quota except set on a
subtree?  I could see it being done that way but I don't see what
advantage it has over using flags in the subtree itself (the mapping is
known based on the mount namespace, so there's really only a single bit
of information to store).

James



Re: [RFC 1/1] shiftfs: uid/gid shifting bind mount

2017-02-07 Thread James Bottomley
On Mon, 2017-02-06 at 20:35 -0500, J. Bruce Fields wrote:
> On Mon, Feb 06, 2017 at 04:10:11PM -0800, James Bottomley wrote:
> > On Mon, 2017-02-06 at 16:52 -0500, J. Bruce Fields wrote:
> > > On Mon, Feb 06, 2017 at 07:18:16AM -0800, James Bottomley wrote:
> > > > On Mon, 2017-02-06 at 09:50 -0500, Theodore Ts'o wrote:
> > > > > On Sun, Feb 05, 2017 at 10:46:23PM -0800, James Bottomley
> > > > > wrote:
> > > > > > Yes, I know the problem.  However, I believe most current
> > > > > > linux
> > > > > > filesystems no longer guarantee stable, for the lifetime of
> > > > > > the
> > > > > > file, inode numbers.  The usual docker container root is 
> > > > > > overlayfs, which, similarly doesn't support stable inode 
> > > > > > numbers.  I see the odd complaint about docker with
> > > > > > overlayfs 
> > > > > > having unstable inode numbers, but none seems to have any
> > > > > > serious repercussions.
> > > > > 
> > > > > Um, no.  Most current linux file systems *do* guarantee
> > > > > stable 
> > > > > inode numbers.  For one thing, NFS would break horribly if
> > > > > you 
> > > > > didn't have stable inode numbers.  Never mind applications
> > > > > which 
> > > > > depend on POSIX semantics.  And you wouldn't be able to save 
> > > > > games in rogue or nethack, either.  :-)
> > > > 
> > > > I believe that's why we have the superblock export operations
> > > > to
> > > > manufacture unique filehandles in the absence of inode number
> > > > stability.
> > > 
> > > Where did you hear that?
> > > 
> > > I'd expect an NFS client to handle non-unique filehandles
> > > better than non-unique inode numbers.  I believe our client will 
> > > -EIO 
> > > on encountering an inode number change (see
> > > nfs_check_inode_attributes().)
> > > 
> > > See also https://tools.ietf.org/html/rfc5661#section-10.3.4.
> > 
> > Could you clarify your point a bit further, please?  Both the
> > check_inode_attributes() code and section 10.3.4 are talking about
> > fileids, which are the things that are constructed in the
> > export_ops
> 
> No, the filehandle structure isn't discussed in the rfc at all,
> that's
> opaque to clients, and the "fileid" you see in the export code isn't
> what's discussed here.
> 
> The "fileid" here is an NFS attribute, really just the NFS protocol's
> name for the inode number.  The server code that returns fileid's:
> 
>   if (bmval0 & FATTR4_WORD0_FILEID) {
>   p = xdr_reserve_space(xdr, 8);
>   if (!p)
>   goto out_resource;
>   p = xdr_encode_hyper(p, stat.ino);
>   }
> 
> The client getattr code:
> 
>   stat->ino = nfs_compat_user_ino64(NFS_FILEID(inode));

OK, I now believe we may be talking about different things.  When I
said

> I believe that's why we have the superblock export operations to
> manufacture unique filehandles in the absence of inode number
> stability. 

I was talking about inode stability in the filesystem underlying the
export.  I believe you're talking about inode number stability
guarantees of the nfs client code itself, which are unrelated to the
inode number guarantees of the exported filesystem?

James



Re: [RFC 1/1] shiftfs: uid/gid shifting bind mount

2017-02-07 Thread James Bottomley
On Mon, 2017-02-06 at 20:35 -0500, J. Bruce Fields wrote:
> On Mon, Feb 06, 2017 at 04:10:11PM -0800, James Bottomley wrote:
> > On Mon, 2017-02-06 at 16:52 -0500, J. Bruce Fields wrote:
> > > On Mon, Feb 06, 2017 at 07:18:16AM -0800, James Bottomley wrote:
> > > > On Mon, 2017-02-06 at 09:50 -0500, Theodore Ts'o wrote:
> > > > > On Sun, Feb 05, 2017 at 10:46:23PM -0800, James Bottomley
> > > > > wrote:
> > > > > > Yes, I know the problem.  However, I believe most current
> > > > > > linux
> > > > > > filesystems no longer guarantee stable, for the lifetime of
> > > > > > the
> > > > > > file, inode numbers.  The usual docker container root is 
> > > > > > overlayfs, which, similarly doesn't support stable inode 
> > > > > > numbers.  I see the odd complaint about docker with
> > > > > > overlayfs 
> > > > > > having unstable inode numbers, but none seems to have any
> > > > > > serious repercussions.
> > > > > 
> > > > > Um, no.  Most current linux file systems *do* guarantee
> > > > > stable 
> > > > > inode numbers.  For one thing, NFS would break horribly if
> > > > > you 
> > > > > didn't have stable inode numbers.  Never mind applications
> > > > > which 
> > > > > depend on POSIX semantics.  And you wouldn't be able to save 
> > > > > games in rogue or nethack, either.  :-)
> > > > 
> > > > I believe that's why we have the superblock export operations
> > > > to
> > > > manufacture unique filehandles in the absence of inode number
> > > > stability.
> > > 
> > > Where did you hear that?
> > > 
> > > I'd expect an NFS client to handle non-unique filehandles
> > > better than non-unique inode numbers.  I believe our client will 
> > > -EIO 
> > > on encountering an inode number change (see
> > > nfs_check_inode_attributes().)
> > > 
> > > See also https://tools.ietf.org/html/rfc5661#section-10.3.4.
> > 
> > Could you clarify your point a bit further, please?  Both the
> > check_inode_attributes() code and section 10.3.4 are talking about
> > fileids, which are the things that are constructed in the
> > export_ops
> 
> No, the filehandle structure isn't discussed in the rfc at all,
> that's
> opaque to clients, and the "fileid" you see in the export code isn't
> what's discussed here.
> 
> The "fileid" here is an NFS attribute, really just the NFS protocol's
> name for the inode number.  The server code that returns fileid's:
> 
>   if (bmval0 & FATTR4_WORD0_FILEID) {
>   p = xdr_reserve_space(xdr, 8);
>   if (!p)
>   goto out_resource;
>   p = xdr_encode_hyper(p, stat.ino);
>   }
> 
> The client getattr code:
> 
>   stat->ino = nfs_compat_user_ino64(NFS_FILEID(inode));

OK, I now believe we may be talking about different things.  When I
said

> I believe that's why we have the superblock export operations to
> manufacture unique filehandles in the absence of inode number
> stability. 

I was talking about inode stability in the filesystem underlying the
export.  I believe you're talking about inode number stability
guarantees of the nfs client code itself, which are unrelated to the
inode number guarantees of the exported filesystem?

James



Re: [RFC 1/1] shiftfs: uid/gid shifting bind mount

2017-02-07 Thread James Bottomley
On Tue, 2017-02-07 at 19:59 +0200, Amir Goldstein wrote:
> On Tue, Feb 7, 2017 at 6:37 PM, James Bottomley
>  wrote:
> > On Tue, 2017-02-07 at 01:19 -0800, Christoph Hellwig wrote:
> > > On Sat, Feb 04, 2017 at 11:19:32AM -0800, James Bottomley wrote:
> > > > This allows any subtree to be uid/gid shifted and bound
> > > > elsewhere.
> > > >  It does this by operating simlarly to overlayfs.  Its primary
> > > > use
> > > > is for shifting the underlying uids of filesystems used to
> > > > support
> > > > unpriviliged (uid shifted) containers.  The usual use case here
> > > > is
> > > > that the container is operating with an uid shifted
> > > > unprivileged
> > > > root but sometimes needs to make use of or work with a
> > > > filesystem
> > > > image that has root at real uid 0.
> > > > 
> > > > The mechanism is to allow any subordinate mount namespace to
> > > > mount
> > > > a shiftfs filesystem (by marking it FS_USERNS_MOUNT) but only
> > > > allowing it to mount marked subtrees (using the -o mark option
> > > > as
> > > > root).  Once mounted, the subtree is mapped via the super block
> > > > user namespace so that the interior ids of the mounting user
> > > > namespace are the ids written to the filesystem.
> > > 
> > > Please move this into VFS instead of a stackable fs.  We might
> > > need
> > > addtional parameters to getattr/setattr to specify the ID
> > > translation, but that's why better than a horrible hack like
> > > this.
> > 
> > I would need a lot more than that: getattr controls the cosmetic
> > permission display to the user, but enforcement is done in the core
> > permission checks which are inode based.  To make this a real bind
> > mount, the core permission checks will have to become subtree aware
> > because knowledge of whether we need a uid shift in the permission
> > check becomes a subtree property.  Effectively inode_permission
> > would
> > become dentry_permission and generic_permission would take a dentry
> > instead of an inode.  This will be a huge amount of VFS and
> > underlying
> > filesystem churn, since the permissions calls are threaded through
> > a
> > huge chunk of code.
> > 
> 
> I am not even sure that would be enough.
> dentry does not contain information about the mount user came from,
> and sb contains only information about the user ns of the mounter of
> the file system, not the mounter of the bind mount, right?
> I think I am missing some big pieces of the big picture.
> Would love to hear what Eric has to say.

I'm not really sure until it gets prototyped, but I think the
filesystem user namespace would also have to become a subtree property.

The whole reason for shiftfs being a properly mounted filesystem is
because it needs a super block to capture the namespace it's being
mounted in.

However, when you have a container that you want remapping inside, you
must have a user namespace which owns a mount namespace, so we can
deduce the information from the mount namespace.  All we probably need
the subtree to tell us is if we're shifting or not.

James



Re: [RFC 1/1] shiftfs: uid/gid shifting bind mount

2017-02-07 Thread James Bottomley
On Tue, 2017-02-07 at 19:59 +0200, Amir Goldstein wrote:
> On Tue, Feb 7, 2017 at 6:37 PM, James Bottomley
>  wrote:
> > On Tue, 2017-02-07 at 01:19 -0800, Christoph Hellwig wrote:
> > > On Sat, Feb 04, 2017 at 11:19:32AM -0800, James Bottomley wrote:
> > > > This allows any subtree to be uid/gid shifted and bound
> > > > elsewhere.
> > > >  It does this by operating simlarly to overlayfs.  Its primary
> > > > use
> > > > is for shifting the underlying uids of filesystems used to
> > > > support
> > > > unpriviliged (uid shifted) containers.  The usual use case here
> > > > is
> > > > that the container is operating with an uid shifted
> > > > unprivileged
> > > > root but sometimes needs to make use of or work with a
> > > > filesystem
> > > > image that has root at real uid 0.
> > > > 
> > > > The mechanism is to allow any subordinate mount namespace to
> > > > mount
> > > > a shiftfs filesystem (by marking it FS_USERNS_MOUNT) but only
> > > > allowing it to mount marked subtrees (using the -o mark option
> > > > as
> > > > root).  Once mounted, the subtree is mapped via the super block
> > > > user namespace so that the interior ids of the mounting user
> > > > namespace are the ids written to the filesystem.
> > > 
> > > Please move this into VFS instead of a stackable fs.  We might
> > > need
> > > addtional parameters to getattr/setattr to specify the ID
> > > translation, but that's why better than a horrible hack like
> > > this.
> > 
> > I would need a lot more than that: getattr controls the cosmetic
> > permission display to the user, but enforcement is done in the core
> > permission checks which are inode based.  To make this a real bind
> > mount, the core permission checks will have to become subtree aware
> > because knowledge of whether we need a uid shift in the permission
> > check becomes a subtree property.  Effectively inode_permission
> > would
> > become dentry_permission and generic_permission would take a dentry
> > instead of an inode.  This will be a huge amount of VFS and
> > underlying
> > filesystem churn, since the permissions calls are threaded through
> > a
> > huge chunk of code.
> > 
> 
> I am not even sure that would be enough.
> dentry does not contain information about the mount user came from,
> and sb contains only information about the user ns of the mounter of
> the file system, not the mounter of the bind mount, right?
> I think I am missing some big pieces of the big picture.
> Would love to hear what Eric has to say.

I'm not really sure until it gets prototyped, but I think the
filesystem user namespace would also have to become a subtree property.

The whole reason for shiftfs being a properly mounted filesystem is
because it needs a super block to capture the namespace it's being
mounted in.

However, when you have a container that you want remapping inside, you
must have a user namespace which owns a mount namespace, so we can
deduce the information from the mount namespace.  All we probably need
the subtree to tell us is if we're shifting or not.

James



  1   2   >