Re: [git pull] mount API series

2018-12-21 Thread Eric W. Biederman
Al Viro  writes:

> On Mon, Nov 12, 2018 at 08:54:39PM +, Al Viro wrote:
>> On Sun, Nov 11, 2018 at 08:07:20PM -0600, Eric W. Biederman wrote:
>> > Steven Whitehouse  writes:
>> 
>> > > Can you share some details of what this NULL dereference is? David and
>> > > Al have been working on the changes as requested by Linus later in
>> > > this thread, and they'd like to tidy up this issue too at the same
>> > > time if possible. We are not asking you to actually provide a fix, in
>> > > case you are too busy to do so, however it would be good to know what
>> > > the issue is so that we can make sure that it is resolved in the next
>> > > round of patches,
>> > 
>> > https://lore.kernel.org/lkml/87bm7n5k1r@xmission.com/
>> 
>> Thought it had been dealt with, but you are right - oops is still there
>> and obviously needs fixing.  However, looking at that place in mainline...
>> How does that thing manage to work?  Look:
>> /* Notice when we are propagating across user namespaces */
>> if (m->mnt_ns->user_ns != user_ns)
>> type |= CL_UNPRIVILEGED;
>> child = copy_tree(last_source, last_source->mnt.mnt_root, type);
>> if (IS_ERR(child))
>> return PTR_ERR(child);
>> child->mnt.mnt_flags &= ~MNT_LOCKED;
>> mnt_set_mountpoint(m, mp, child);
>> last_dest = m;
>> last_source = child;

[Moved this question up as it's answer is a good foundation for the
rest]

> FWIW, I don't believe that CL_UNPRIVELEGED approach is workable; the logics
> we want is, AFAICS, "when destination is not yours, whatever you attach to
> it should be all locked".  Correct?

Kinda sorta.

There is a tree of privilege.  Mount namespaces owned by a more
privileged user_ns can propagate to slaves in a mount namespace owned by
less privileged user_ns.  Because the required permission for the making
the initial mount are insufficient a mount can not propagate from a less
privileged mount namespace to a more privileged mount namespace.

When propagating from a more privileged mount namespace to a less
privileged mount namespace we want to maintain some properties.

1) That mounts that propogate together are locked together.
2) That the mount flags readonly, nodev, nosuid, and noexec
   when set in a more privileged mount namespace are not changeable
   in a less privileged mount namespace.

>> OK, we'd created the copy to be attached to the next candidate mountpoint.
>> If we have e.g. a 4-element peer group, we'll start with what we'd been
>> asked to mount, then call that sucker three times, getting a copy for
>> each of those mountpoints.  Right?  Now, what happens if the 1st, 3rd and
>> 4th members live in your namespace, with the second one being
>> elsewhere?

As I understand the question we have a bug in the mount propagation
tree.  All peers need to be in mount namespaces that share a user
namespace.

Further copy_tree fundamentally needs to copy from either a peer node to
a peer node or a master node to a slave node to achieve the correct
results when setting up the propagation tree.

So I don't currently see how the logic in propgate_mnt could
possibly get into a problematic situation.

>> We have
>>  source_mnt: that'll go on top of the 1st mountpoint
>>  copy of source_mnt: that'll go on top of the 2nd mountpoint
>>  copy of copy of source_mnt: that'll go on top of the 3rd mountpoint
>>  copy of copy of copy of source_mnt: that'll go on top of the 4th one
>> And AFAICS your logics there has just made sure that everything except the
>> source_mnt will have MNT_LOCK_... all over the place.  IOW, the effect of
>> CL_UNPRIVELEGED is cumulative.
>> 
>> How the hell does that code avoid this randomness?  Note had the members of
>> that peer group been in a different order, you would've gotten a different 
>> result.
>> What am I missing here?

I believe it is that members of a peer group are guaranteed to share
a user namepace, or else they can't be peers.

>> Oops is a separate story, and a regression in its own right; it needs to be
>> fixed.  But I would really like to sort out the semantics of the existing
>> code in that area, so that we don't end up with patch conflicts.
>
> Ping?

Apologies for the delay I have been on my own deep dive and I initially
misunderstood the question.  For some reason I thought you were implying
that f2ebb3a921c1 ("smarter propagate_mnt()") changed the logic and
introduced the bug.  So I didn't understand why you were asking me.

> If so, the natural place to deal with that would be in commit_tree() after
> propagate_mnt() has created all copies, not while creating those copies.
> That, after all, is where we mark all those struct mount as belonging to
> namespace...

Alternatively we could add a new mount propagation flag call it
MNT_PRIV_BORDER that would mean that we don't need to look at mount
namespaces when considering the propgation tree.  We would just need
to notice that the slave we

Re: [git pull] mount API series

2018-12-17 Thread Al Viro
On Mon, Nov 12, 2018 at 08:54:39PM +, Al Viro wrote:
> On Sun, Nov 11, 2018 at 08:07:20PM -0600, Eric W. Biederman wrote:
> > Steven Whitehouse  writes:
> 
> > > Can you share some details of what this NULL dereference is? David and
> > > Al have been working on the changes as requested by Linus later in
> > > this thread, and they'd like to tidy up this issue too at the same
> > > time if possible. We are not asking you to actually provide a fix, in
> > > case you are too busy to do so, however it would be good to know what
> > > the issue is so that we can make sure that it is resolved in the next
> > > round of patches,
> > 
> > https://lore.kernel.org/lkml/87bm7n5k1r@xmission.com/
> 
> Thought it had been dealt with, but you are right - oops is still there
> and obviously needs fixing.  However, looking at that place in mainline...
> How does that thing manage to work?  Look:
> /* Notice when we are propagating across user namespaces */
> if (m->mnt_ns->user_ns != user_ns)
> type |= CL_UNPRIVILEGED;
> child = copy_tree(last_source, last_source->mnt.mnt_root, type);
> if (IS_ERR(child))
> return PTR_ERR(child);
> child->mnt.mnt_flags &= ~MNT_LOCKED;
> mnt_set_mountpoint(m, mp, child);
> last_dest = m;
> last_source = child;
> OK, we'd created the copy to be attached to the next candidate mountpoint.
> If we have e.g. a 4-element peer group, we'll start with what we'd been
> asked to mount, then call that sucker three times, getting a copy for
> each of those mountpoints.  Right?  Now, what happens if the 1st, 3rd and
> 4th members live in your namespace, with the second one being elsewhere?
> We have
>   source_mnt: that'll go on top of the 1st mountpoint
>   copy of source_mnt: that'll go on top of the 2nd mountpoint
>   copy of copy of source_mnt: that'll go on top of the 3rd mountpoint
>   copy of copy of copy of source_mnt: that'll go on top of the 4th one
> And AFAICS your logics there has just made sure that everything except the
> source_mnt will have MNT_LOCK_... all over the place.  IOW, the effect of
> CL_UNPRIVELEGED is cumulative.
> 
> How the hell does that code avoid this randomness?  Note had the members of
> that peer group been in a different order, you would've gotten a different 
> result.
> What am I missing here?
> 
> Oops is a separate story, and a regression in its own right; it needs to be
> fixed.  But I would really like to sort out the semantics of the existing
> code in that area, so that we don't end up with patch conflicts.

Ping?

FWIW, I don't believe that CL_UNPRIVELEGED approach is workable; the logics
we want is, AFAICS, "when destination is not yours, whatever you attach to
it should be all locked".  Correct?

If so, the natural place to deal with that would be in commit_tree() after
propagate_mnt() has created all copies, not while creating those copies.
That, after all, is where we mark all those struct mount as belonging to
namespace...

Again, this is quite independent from the oops you've reported (and that,
BTW, can be triggered without any userns involvement - commit_tree() itself
will oops just fine if parent's ->mnt_ns is NULL, userns or no userns).
I think I understand how to deal with that thing, but it's a separate
story; handling of MNT_LOCK... is a problem that exists in the mainline
and whatever fix we come up with for this one will need to be backportable.

Al "resurfaced from long and thoroughly nasty dive through the LSM gutter
and finally getting around to more pleasant stuff" Viro


Re: [git pull] mount API series

2018-11-12 Thread Al Viro
On Sun, Nov 11, 2018 at 08:07:20PM -0600, Eric W. Biederman wrote:
> Steven Whitehouse  writes:

> > Can you share some details of what this NULL dereference is? David and
> > Al have been working on the changes as requested by Linus later in
> > this thread, and they'd like to tidy up this issue too at the same
> > time if possible. We are not asking you to actually provide a fix, in
> > case you are too busy to do so, however it would be good to know what
> > the issue is so that we can make sure that it is resolved in the next
> > round of patches,
> 
> https://lore.kernel.org/lkml/87bm7n5k1r@xmission.com/

Thought it had been dealt with, but you are right - oops is still there
and obviously needs fixing.  However, looking at that place in mainline...
How does that thing manage to work?  Look:
/* Notice when we are propagating across user namespaces */
if (m->mnt_ns->user_ns != user_ns)
type |= CL_UNPRIVILEGED;
child = copy_tree(last_source, last_source->mnt.mnt_root, type);
if (IS_ERR(child))
return PTR_ERR(child);
child->mnt.mnt_flags &= ~MNT_LOCKED;
mnt_set_mountpoint(m, mp, child);
last_dest = m;
last_source = child;
OK, we'd created the copy to be attached to the next candidate mountpoint.
If we have e.g. a 4-element peer group, we'll start with what we'd been
asked to mount, then call that sucker three times, getting a copy for
each of those mountpoints.  Right?  Now, what happens if the 1st, 3rd and
4th members live in your namespace, with the second one being elsewhere?
We have
source_mnt: that'll go on top of the 1st mountpoint
copy of source_mnt: that'll go on top of the 2nd mountpoint
copy of copy of source_mnt: that'll go on top of the 3rd mountpoint
copy of copy of copy of source_mnt: that'll go on top of the 4th one
And AFAICS your logics there has just made sure that everything except the
source_mnt will have MNT_LOCK_... all over the place.  IOW, the effect of
CL_UNPRIVELEGED is cumulative.

How the hell does that code avoid this randomness?  Note had the members of
that peer group been in a different order, you would've gotten a different 
result.
What am I missing here?

Oops is a separate story, and a regression in its own right; it needs to be
fixed.  But I would really like to sort out the semantics of the existing
code in that area, so that we don't end up with patch conflicts.


Re: [git pull] mount API series

2018-11-11 Thread Eric W. Biederman
Steven Whitehouse  writes:

> Hi,
>
>
> On 31/10/18 15:38, Eric W. Biederman wrote:
>> Al Viro  writes:
>>
>>> mount API series from David Howells.  Last cycle's objections
>>> had been of the "I'd do it differently" variety and with no such
>>> differently done variants having ever materialized over several
>>> cycles...
>> Absolutely not.
>>
>> My objections fundamentally is that I can find real problems when I look
>> at the code.  Further the changes have not been incremental changes that
>> have evolved the code from one state to another but complete
>> replacements of code that make code review very difficult and bisection
>> completely inapplicable.
>>
>> I also object that this series completely fails to fix the worst but I
>> have ever seen in the mount API.  Whit no real intrest shown in working
>> to fix it.
>>
>> A couple of bugs that I can see quickly.  Several of which I have
>> previously reported:
>>
>> - There is an easily triggered NULL pointer deference with open_tree
>>and mount propagation.
>>
>>
> Can you share some details of what this NULL dereference is? David and
> Al have been working on the changes as requested by Linus later in
> this thread, and they'd like to tidy up this issue too at the same
> time if possible. We are not asking you to actually provide a fix, in
> case you are too busy to do so, however it would be good to know what
> the issue is so that we can make sure that it is resolved in the next
> round of patches,

https://lore.kernel.org/lkml/87bm7n5k1r@xmission.com/

Eric


Re: [git pull] mount API series

2018-11-10 Thread Steven Whitehouse

Hi,


On 31/10/18 15:38, Eric W. Biederman wrote:

Al Viro  writes:


mount API series from David Howells.  Last cycle's objections
had been of the "I'd do it differently" variety and with no such
differently done variants having ever materialized over several
cycles...

Absolutely not.

My objections fundamentally is that I can find real problems when I look
at the code.  Further the changes have not been incremental changes that
have evolved the code from one state to another but complete
replacements of code that make code review very difficult and bisection
completely inapplicable.

I also object that this series completely fails to fix the worst but I
have ever seen in the mount API.  Whit no real intrest shown in working
to fix it.

A couple of bugs that I can see quickly.  Several of which I have
previously reported:

- There is an easily triggered NULL pointer deference with open_tree
   and mount propagation.


Can you share some details of what this NULL dereference is? David and 
Al have been working on the changes as requested by Linus later in this 
thread, and they'd like to tidy up this issue too at the same time if 
possible. We are not asking you to actually provide a fix, in case you 
are too busy to do so, however it would be good to know what the issue 
is so that we can make sure that it is resolved in the next round of 
patches,


Steve.



Re: [git pull] mount API series

2018-11-02 Thread Gao Xiang
Hi Al,

On 2018/11/2 12:07, Al Viro wrote:
> On Thu, Nov 01, 2018 at 11:59:23PM +, David Howells wrote:
> 
>>  (*) mount-api-core.  These are the internal-only patches that add the
>>  fs_context, the legacy wrapper and the security hooks and make certain
>>  filesystems make use of it.
> 
> FWIW, while rereading that series I'd spotted something very odd in erofs.
> It's orthogonal to everything else, but just to make sure it doesn't get
> lost:
>   * sbi->dev_name thing in erofs is used only for debugging printks,
> basically.  Just use sb->s_id[] and be done with that.
>   * dump struct erofs_mount_private - you don't need dev_name in
> your erofs_fill_super().  Just use mount_bdev() in usual fashion.

OK, these two points are the same, the original alternative patch to fixup it 
is to use bdevname(),
However I saw what is done in drivers/usb/gadget/function/f_fs.c, therefore I 
fixed in as what I saw in f_fs.c.

Refer:
https://lists.ozlabs.org/pipermail/linux-erofs/2018-September/000548.html
https://lists.ozlabs.org/pipermail/linux-erofs/2018-September/000551.html

I could remove erofs_mount_private entirely if you want. :)

>   * what the hell are you doing with ->s_root???  Why would you
> possibly want it hashed and what kind of dcache lookup could find it?
> That d_rehash() looks deeply confused; what are you trying to do there?
Thanks for pointing out. After I think into this piece of code, I also think 
that is redundant.
I will fix it immediately, thanks again for pointing out.

Thanks,
Gao Xiang

> 


Re: [git pull] mount API series

2018-11-02 Thread Gao Xiang
Hi Al,

On 2018/11/3 3:42, Al Viro wrote:
> On Fri, Nov 02, 2018 at 04:07:01AM +, Al Viro wrote:
>> On Thu, Nov 01, 2018 at 11:59:23PM +, David Howells wrote:
>>
>>>  (*) mount-api-core.  These are the internal-only patches that add the
>>>  fs_context, the legacy wrapper and the security hooks and make certain
>>>  filesystems make use of it.
>>
>> FWIW, while rereading that series I'd spotted something very odd in erofs.
>> It's orthogonal to everything else, but just to make sure it doesn't get
>> lost:
>>  * sbi->dev_name thing in erofs is used only for debugging printks,
>> basically.  Just use sb->s_id[] and be done with that.
>>  * dump struct erofs_mount_private - you don't need dev_name in
>> your erofs_fill_super().  Just use mount_bdev() in usual fashion.
>>  * what the hell are you doing with ->s_root???  Why would you
>> possibly want it hashed and what kind of dcache lookup could find it?
>> That d_rehash() looks deeply confused; what are you trying to do there?
> 
> ... and while we are at it, what happens to
> unsigned int nameoff = le16_to_cpu(de[mid].nameoff);
> unsigned int matched = min(startprfx, endprfx);
> 
> struct qstr dname = QSTR_INIT(data + nameoff,
> unlikely(mid >= ndirents - 1) ?
> maxsize - nameoff :
> le16_to_cpu(de[mid + 1].nameoff) - nameoff);
> 
> /* string comparison without already matched prefix */
> int ret = dirnamecmp(name, &dname, &matched);
> if le16_to_cpu(de[...].nameoff) is not monotonically increasing?  I.e.
> what's to prevent e.g. (unsigned)-1 ending up in dname.len?
> 
> Corrupted fs image shouldn't oops the kernel...

Yes, thanks for pointing out. :)
I will add more boundary check later before moving into fs/ directory...
erofs now is under dm-verity for our HUAWEI mobile phone, so it doesn't be 
corruptted.

I will add more checks and meta checksum later after EROFS productization 
successfully... :)

Thanks,
Gao Xiang
> 


Re: [git pull] mount API series

2018-11-02 Thread Al Viro
On Fri, Nov 02, 2018 at 04:07:01AM +, Al Viro wrote:
> On Thu, Nov 01, 2018 at 11:59:23PM +, David Howells wrote:
> 
> >  (*) mount-api-core.  These are the internal-only patches that add the
> >  fs_context, the legacy wrapper and the security hooks and make certain
> >  filesystems make use of it.
> 
> FWIW, while rereading that series I'd spotted something very odd in erofs.
> It's orthogonal to everything else, but just to make sure it doesn't get
> lost:
>   * sbi->dev_name thing in erofs is used only for debugging printks,
> basically.  Just use sb->s_id[] and be done with that.
>   * dump struct erofs_mount_private - you don't need dev_name in
> your erofs_fill_super().  Just use mount_bdev() in usual fashion.
>   * what the hell are you doing with ->s_root???  Why would you
> possibly want it hashed and what kind of dcache lookup could find it?
> That d_rehash() looks deeply confused; what are you trying to do there?

... and while we are at it, what happens to
unsigned int nameoff = le16_to_cpu(de[mid].nameoff);
unsigned int matched = min(startprfx, endprfx);

struct qstr dname = QSTR_INIT(data + nameoff,
unlikely(mid >= ndirents - 1) ?
maxsize - nameoff :
le16_to_cpu(de[mid + 1].nameoff) - nameoff);

/* string comparison without already matched prefix */
int ret = dirnamecmp(name, &dname, &matched);
if le16_to_cpu(de[...].nameoff) is not monotonically increasing?  I.e.
what's to prevent e.g. (unsigned)-1 ending up in dname.len?

Corrupted fs image shouldn't oops the kernel...


Re: [git pull] mount API series

2018-11-01 Thread Al Viro
On Thu, Nov 01, 2018 at 11:59:23PM +, David Howells wrote:

>  (*) mount-api-core.  These are the internal-only patches that add the
>  fs_context, the legacy wrapper and the security hooks and make certain
>  filesystems make use of it.

FWIW, while rereading that series I'd spotted something very odd in erofs.
It's orthogonal to everything else, but just to make sure it doesn't get
lost:
* sbi->dev_name thing in erofs is used only for debugging printks,
basically.  Just use sb->s_id[] and be done with that.
* dump struct erofs_mount_private - you don't need dev_name in
your erofs_fill_super().  Just use mount_bdev() in usual fashion.
* what the hell are you doing with ->s_root???  Why would you
possibly want it hashed and what kind of dcache lookup could find it?
That d_rehash() looks deeply confused; what are you trying to do there?


Re: [git pull] mount API series

2018-11-01 Thread David Howells
Linus Torvalds  wrote:

> So if the patch series can be split up into a prep-phase that doesn't
> change any user-visible semantics (including the security side), but
> that uses the fs_context internally and allows the filesystems to be
> converted to the new world order, than that would make merging the
> early work much easier (and then my worry about the later phases would
> probably be much less too).

As a first go, I've rebased the patches to v4.19 (which required no other
changes), folded in some small bugfixes (fix error handling in remount, fix
incorrect user_ns in proc and mqueue) and split the set up.

There are now three branches in my git tree:

 (*) mount-api-core.  These are the internal-only patches that add the
 fs_context, the legacy wrapper and the security hooks and make certain
 filesystems make use of it.

 (*) mount-api-uapi.  This is mount-api-core with the UAPI-visible patches
 stacked thereon.

 (*) mount-api.  This is the original patchset.

"git diff mount-api mount-api-uapi" shows no differences.

Note that the commit "vfs: Implement logging through fs_context" appears in
both sets.  I was just going to leave it as macros that just wrap pr_notice(),
but I think it might be wiser to pull it out of line (as will be required
later) and make it produce messages at different levels.

The git tree in question is at:

https://git.kernel.org/pub/scm/linux/kernel/git/dhowells/linux-fs.git

David


Re: [git pull] mount API series

2018-11-01 Thread Linus Torvalds
On Thu, Nov 1, 2018 at 3:05 PM Al Viro  wrote:
>
>  Do you mind if we end up with work.mount rebased?
> The usual objections re testing in -next do not apply in this case,
> AFAICS...

I was assuming that the work.mount branch would be entirely re-done, yes.

  Linus


Re: [git pull] mount API series

2018-11-01 Thread Al Viro
On Thu, Nov 01, 2018 at 11:33:31AM -0700, Linus Torvalds wrote:

> Al - can I ask you to look at helping David with something like that?
> You tend to be very good at generating those patch-series with
> "obviously no changes" for the individual patches, but the end result
> ends up being totally different from the starting point (I'm thinking
> of all the locking and dentry refcounting series).

I'll try.  Before we go there, I'd like to get the rest of vfs.git off
my hands - AFS series and misc pile.  Will send pull requests shortly,
then - this stuff.  Do you mind if we end up with work.mount rebased?
The usual objections re testing in -next do not apply in this case,
AFAICS...


Re: [git pull] mount API series

2018-11-01 Thread Linus Torvalds
On Thu, Nov 1, 2018 at 10:18 AM David Howells  wrote:
>
> No.  What we have upstream now is most definitely *not* atomic.  That said,
> some of the steps of the sequence are atomic in their own right:

It's more that what we have traditionally is atomic wrt user space.
Use space does *one* indivisible operation. There's no way for
user-space to do a half-way mount and say "ok, that's it, I'll just
exit now" (or skip a phase and fool the kernel into handling a
half-way setup issue).

So the new model does require a bit more care in tear-down etc. I'm
not so worried about this, actually, but it *is* different.

Afaik, some of the bugs were exactly in this half-way state bits.

[ And yes, looking back at some the reports I found, it was Alan
Jenkins and mainly the open-tree thing. ]

> Would you be willing to take just the *internal* fs_context changes and leave
> the UAPI for the next window?

Hmm. I had to think about that, but the more I thought about it, the
more I liked it.

Yes. Depending on how that is done, that would make a lot of my
worries go away. *Particularly* if it then allows us to do the
per-filesystem conversion one by one.

So if the patch series can be split up into a prep-phase that doesn't
change any user-visible semantics (including the security side), but
that uses the fs_context internally and allows the filesystems to be
converted to the new world order, than that would make merging the
early work much easier (and then my worry about the later phases would
probably be much less too).

It would be _really_ nice to see that prep-phase be done in a way
where each individual patch very obviously doesn't change semantics.
If it's obvious, then I'm happy to consider that pure prep-work and
willing to merge it after the merge window in order to make the _next_
merge window easier.

Al - can I ask you to look at helping David with something like that?
You tend to be very good at generating those patch-series with
"obviously no changes" for the individual patches, but the end result
ends up being totally different from the starting point (I'm thinking
of all the locking and dentry refcounting series).

  Linus


Re: [git pull] mount API series

2018-11-01 Thread David Howells
Linus Torvalds  wrote:

> Exactly because it splits up what used to be an atomic sequence,

No.  What we have upstream now is most definitely *not* atomic.  That said,
some of the steps of the sequence are atomic in their own right:

 - Creation and initialisation of the superblock
 - Creation of the vfsmount
 - Attachment of a new vfsmount

and these units have not changed.  What I have done is to extract the
parameter processing and move it to the front and provided a way to provide
more exotic parameters more easily.  This means that all parse errors are
found before we start creating objects and applying parameters (which is a bug
in some of our existing remount operations that this fixes).

Also, as the parameters are passed one at a time, I've also inserted a
validation step so that the parameter set as a whole can be checked for
inter-parameter consistency before any object creation is performed.

Note further that upstream, remount/reconfiguration is *not* atomic, but with
these changes that becomes closer to being realisable.  I've looked into how
to do it for btrfs using RCU and a CoW'd struct with the settings in, but
that's a much bigger, more intrusive change that doesn't need to be coupled to
the mount API changes.

> I worry about the intermediate states having issues (the refcounting things
> we've already seen, for example), but I also worry about the fact that it
> completely changes the model, and that that makes things like security hooks
> fundamentally different.

A lot of the recent bugs aren't from fsopen()/fsconfig()/fsmount() at all, but
rather from the open_tree() syscall and, to a lesser extent, the move_mount()
syscall.

Would you be willing to take just the *internal* fs_context changes and leave
the UAPI for the next window?  I have patches that make NFS and BTRFS use it,
which would then allow me to remove nearly 1000 lines of LSM code[1], but I
can't reasonably ask Trond and Chris to take them unless there's some schedule
to get the core dependencies in.

[1] See the "security: Remove the set_mnt_opts and clone_mnt_opts ops" commit
here:


https://git.kernel.org/pub/scm/linux/kernel/git/dhowells/linux-fs.git/log/?h=btrfs-mount-api

> The latter may not be a "bug" in the sense that it's all intentional,
> but it does mean that I see *one* mount-time security hook now having
> been replaced by *five* security hooks.

That's not really accurate.  I'm adding:

 (1) security_fs_context_parse_param()
 (2) security_fs_context_validate()
 (3) security_sb_get_tree()
 (4) security_sb_reconfigure()
 (5) security_sb_mountpoint()
 (6) security_move_mount()

Hook (1) allows individual mount parameters to examined/removed and then (2)
allows the whole set to be checked for consistency.  This is just parameter
parsing and checking.

Then (3) allows super_block::security to be initialised and (4) allows an
LSM's sb parameters to be reconfigured by remount (or changes to be
prohibited).  These use the context set up by (1) and checked by (2).

Finally, (5) is used to check that a mountpoint may be used and (6) to see if
a mount may be moved.

Further, I'm actually removing more than one:

 (1) security_sb_kern_mount()
 (2) security_sb_remount()
 (3) security_sb_copy_data()
 (4) security_sb_set_mnt_opts()
 (5) security_sb_clone_mnt_opts()
 (6) security_sb_parse_opts_str()

But four of them are only applicable to NFS and BTRFS, and can only be removed
after those have been dealt with.

> And that's ignoring the alloc/dup/free ones.

Yes, for the record, I'm adding:

 (7) security_fs_context_alloc()
 (8) security_fs_context_dup()
 (9) security_fs_context_free()

to allow the LSM to store data in the fs_context.

> As far as I can tell, the patch-series simply added the hooks. It made
> no attempt at making sure that previous hooks had sane semantics.

Ummm...  I can't say that what's upstream has sane semantics - that's not
really the focus of these patches.  What I've tried to do is to retain the
behaviour where possible.

> Do they?  So now a system that has an old mount hook can be bypassed by
> simplky using the new model instead.

Not so easily.  The LSM is *still* being consulted and I think both in SELinux
and Smack the same rules will still be applied.

There's a case in AppArmor that requires a rejigging of the FSM compiler to
make it work, and I discussed this with John Johansen at the LSS in Edinburgh
last week, but we can simply disallow the use of the new API with AppArmor.

> I dunno. The patches are illegible in this regard (and I don't blame
> the fsmount ones, I blame the security subsystem that just is full of
> random indirection to random sub-security systems, which in turn just
> have hash lookups for data structures set up by other operations
> entirely).
> 
> Eric was pointing out bugs as late as the weekend before the merge
> window opened. That, to me, does not say "ready for the merge window".

Actually, Alan Jenkins has been mostly the one pointing out the

Re: [git pull] mount API series

2018-11-01 Thread Al Viro
On Wed, Oct 31, 2018 at 04:36:01PM +, Al Viro wrote:
> On Wed, Oct 31, 2018 at 10:38:17AM -0500, Eric W. Biederman wrote:
> > A couple of bugs that I can see quickly.  Several of which I have
> > previously reported:
> > 
> > - There is an easily triggered NULL pointer deference with open_tree
> >   and mount propagation.
> 
> What the hell?  If the fixes that went in do not handle something,
> especially if you have testcases, where the fuck have you been and
> where _are_ those testcases, while we are at it?
> 
> Eric, this is bloody ridiculous - "I have an easily triggered NULL pointer
> dereference in..., here it is" is hard to miscommunicate.  Sure, any such
> needs to be fixed.  For crying out loud, that thing has not been hidden -
> it sat in -next, it's been reposted several times...

Again, would you mind telling what exactly does the above refer to and whether
it is still true?  I'm not asking to put details in every time you mention
something of that sort, but generally one is expected to come up with those
on demand, especially if the bug _is_ easily triggered.

You, IIRC, had been Cc'd on the threads where open_tree breakage was
dealt with.  I can certainly believe that there might be something else
in that area (or any other, for that matter).  I *am* interested in finding
and fixing that and I would rather appreciate the details of what you
are seeing.


Re: [git pull] mount API series

2018-11-01 Thread Linus Torvalds
On Thu, Nov 1, 2018 at 3:53 AM Steven Whitehouse  wrote:
>
> When I look at the discussions I'm seeing two main issues (please
> correct me if you think I'm wrong about this) which are (a) whether the
> design is correct and (b) whether there are still bugs in the current
> patch set.
>
> Which of these are you most concerned about?

I'm most worried about bugs _due_ to the new design.

Exactly because it splits up what used to be an atomic sequence, I
worry about the intermediate states having issues (the refcounting
things we've already seen, for example), but I also worry about the
fact that it completely changes the model, and that that makes things
like security hooks fundamentally different.

The latter may not be a "bug" in the sense that it's all intentional,
but it does mean that I see *one* mount-time security hook now having
been replaced by *five* security hooks.

And that's ignoring the alloc/dup/free ones.

As far as I can tell, the patch-series simply added the hooks. It made
no attempt at making sure that previous hooks had sane semantics. Do
they?  So now a system that has an old mount hook can be bypassed by
simplky using the new model instead.

I dunno. The patches are illegible in this regard (and I don't blame
the fsmount ones, I blame the security subsystem that just is full of
random indirection to random sub-security systems, which in turn just
have hash lookups for data structures set up by other operations
entirely).

Eric was pointing out bugs as late as the weekend before the merge
window opened. That, to me, does not say "ready for the merge window".

   Linus


Re: [git pull] mount API series

2018-11-01 Thread Steven Whitehouse

Hi,


On 31/10/18 16:18, Linus Torvalds wrote:

On Tue, Oct 30, 2018 at 10:34 PM Al Viro  wrote:

 mount API series from David Howells.  Last cycle's objections
had been of the "I'd do it differently" variety and with no such
differently done variants having ever materialized over several
cycles...

Having just lurked on the discussions about the bugs in here, I don't
think this is ready. Maybe they got fixed, but if so, it was recent
and it was pretty fundamental.

The stated aim of the series is to make the mount API _better_, not
worse. And right now it looks like a "two steps back, one theoretical
step forwards" kind of better.

   Linus


The design of the new mount API has been under discussion for some time, 
both on the mailing lists, and also at LSF/MM too. Al and David (and 
others) have put a lot of work into getting to the current position, and 
have specifically requested input from Eric about his concerns over past 
cycles.


When I look at the discussions I'm seeing two main issues (please 
correct me if you think I'm wrong about this) which are (a) whether the 
design is correct and (b) whether there are still bugs in the current 
patch set.


Which of these are you most concerned about? It seems to me that there 
is not a lot of point in spending a large amount of time in additional 
review/testing of the current patch set if the overall design is set to 
be rejected. If your concerns are only with the robustness/stability of 
the patch set, then that provides a clear route to resolving the current 
impasse, at least assuming that Eric is able to enumerate the issues 
that he has discovered.


It looks like David has already provided a fix for one of the issues 
which Eric mentioned in his recent email. Eric it would be good if you 
could confirm that this addresses that particular concern,


Steve.



Re: [git pull] mount API series

2018-10-31 Thread Miklos Szeredi
On Wed, Oct 31, 2018 at 7:39 PM, David Howells  wrote:
>> My objections fundamentally is that I can find real problems when I look
>> at the code.

I think the big risk with such a change is not that there are bugs,
but that we get the API wrong, and have to keep supporting a broken
API forever.

Were any of your objections of that type?

There's the argument about sharing of super blocks doing weird things,
but we can't get rid of that one due to having to support the old
mount(2) API forever anyway.  That's not a new burden, and the new API
allows fixing this incrementally by adding better models later.

Thanks,
Miklos


Re: [git pull] mount API series

2018-10-31 Thread David Howells
> My objections fundamentally is that I can find real problems when I look
> at the code.

Eric.

You have repeatedly stated that there are "thinkos, typos and bugs" in the
code, but you have not been very forthcoming in actually disclosing *what*
those things are.

You had a go at rewriting it for yourself, with different philosophical aims,
and at one point, you posted a partially reworked git branch my way.  However,
you changed the patches in various ways that rendered it incomparable with
mine, making it very hard to actually separate bug fixes from your other
changes.  No useful notes were given.

> Further the changes have not been incremental changes that have evolved the
> code from one state to another but complete replacements of code that make
> code review very difficult and bisection completely inapplicable.

... therefore I find this a little hypocritical.

Further, your rewrite breaks NFS and other things, such as mis-renumbering the
tokens in SELinux.  Splitting the lookup-and-create operation into separate
lookup-no-create and create-or-fail won't work for NFS.  Despite your
protestation, this *has* been discussed publically - and you even joined in
the discussion.

> A couple of bugs that I can see quickly.  Several of which I have
> previously reported:
> 
> - There is an easily triggered NULL pointer deference with open_tree
>   and mount propagation.

As far as I know, these have all been cleared up.  I think that I've cleared
up all the bugs that Alan Jenkins found (many thanks to him for that!).

> - Bisection will not work with the cpuset filesystem patch.

It won't?  I don't recall this been mentioned before (but I may have missed
it).

>   At least cpuset looks like it may be mountable now.

It does mount now.  Thanks for pointing that out.

> - The setting of fc->user_ns on proc remains broken.  In particular if you
>   create a child user namespace and attempt to mount proc it will succeed
>   instead of fail.

You did mention this before - but you haven't been clear on how to fix it.  I
tried to work this out from your branch, but it's mixed in with loads of other
changes.

> - The mqueue filesystem has the same issue as proc.  fc->user_ns is misset.

I have a patch that I think should fix both of these.  I'll follow up this
message with it, if you could check it please.

David


Re: [git pull] mount API series

2018-10-31 Thread Al Viro
On Wed, Oct 31, 2018 at 10:38:17AM -0500, Eric W. Biederman wrote:
> A couple of bugs that I can see quickly.  Several of which I have
> previously reported:
> 
> - There is an easily triggered NULL pointer deference with open_tree
>   and mount propagation.

What the hell?  If the fixes that went in do not handle something,
especially if you have testcases, where the fuck have you been and
where _are_ those testcases, while we are at it?

Eric, this is bloody ridiculous - "I have an easily triggered NULL pointer
dereference in..., here it is" is hard to miscommunicate.  Sure, any such
needs to be fixed.  For crying out loud, that thing has not been hidden -
it sat in -next, it's been reposted several times...

I'm glad that you can see bugs.  How about _showing_ them?  Because I don't
see anything that would prevent this reply of yours from being posted
pretty much verbatim cycle after cycle, no matter what gets done.


Re: [git pull] mount API series

2018-10-31 Thread Eric W. Biederman
ebied...@xmission.com (Eric W. Biederman) writes:

> I am going to stop there.  I believe there are more issues in the code.
> I am relieved that I am not seeing the loss of some of the security
> hooks that I thought I saw last time I looked at the code.

Bah.  Now I see the missing security hook.

There are a set of security hooks that allow security modules to parse
mount options.

On a good day they look like:

security_mnt_opts opts;
char *secdata;

secdata = alloc_secdata();
security_sb_copy_data("a,mount,options,string", secdata);

security_init_mnt_opts(&opts);
security_parse_opts_str(secdata, &opts);
security_set_mnt_opts(sb, &opts, 0, NULL);
security_free_mnt_opts(&opts);

In practice however things are not that explicit.  With
security_sb_kern_mount performing all of the mnt_opts work.

However after the rewrite in the patchset.

The function sb_kern_mount no longer exists and it's replacement
sb_get_tree out of necessity does not call parse_opts_str.  This is
because the mount options can no longer be passed as a string.

The legacy compatibility code also does not call sb_parse_opts_str.

The result is using the existing apis all of the security module command
line parsing except for (btrfs and nfs) no longer works.


The changes are not structured in a way that makes any of this easy to
find.  Which is why I have been saying I wouldn't do it that way.  It
also is the case that this pattern repeats through out the patches.
Replacing code with something brand new, instead of evolving what is
there.  That makes it easy for this kind of thing to slip through.

Eric


Re: [git pull] mount API series

2018-10-31 Thread Linus Torvalds
On Tue, Oct 30, 2018 at 10:34 PM Al Viro  wrote:
>
> mount API series from David Howells.  Last cycle's objections
> had been of the "I'd do it differently" variety and with no such
> differently done variants having ever materialized over several
> cycles...

Having just lurked on the discussions about the bugs in here, I don't
think this is ready. Maybe they got fixed, but if so, it was recent
and it was pretty fundamental.

The stated aim of the series is to make the mount API _better_, not
worse. And right now it looks like a "two steps back, one theoretical
step forwards" kind of better.

  Linus


Re: [git pull] mount API series

2018-10-31 Thread Eric W. Biederman
Al Viro  writes:

>   mount API series from David Howells.  Last cycle's objections
> had been of the "I'd do it differently" variety and with no such
> differently done variants having ever materialized over several
> cycles...

Absolutely not.

My objections fundamentally is that I can find real problems when I look
at the code.  Further the changes have not been incremental changes that
have evolved the code from one state to another but complete
replacements of code that make code review very difficult and bisection
completely inapplicable.

I also object that this series completely fails to fix the worst but I
have ever seen in the mount API.  Whit no real intrest shown in working
to fix it.

A couple of bugs that I can see quickly.  Several of which I have
previously reported:

- There is an easily triggered NULL pointer deference with open_tree
  and mount propagation.

- Bisection will not work with the cpuset filesystem patch.  At least
  cpuset looks like it may be mountable now.

- The setting of fc->user_ns on proc remains broken.  In particular if you
  create a child user namespace and attempt to mount proc it will succeed
  instead of fail.

- The mqueue filesystem has the same issue as proc.  fc->user_ns is misset.

I suspect I didn't report it well but I reported both the proc and the
mqueue filesystem weeks before the merge window opened.


I am going to stop there.  I believe there are more issues in the code.
I am relieved that I am not seeing the loss of some of the security
hooks that I thought I saw last time I looked at the code.


Given both that I have reported bugs that remain unfixed and it's
non-evolutionary nature that makes this patchset hard to review I have
no confidence in this set of patches.

I think the scope has been too large for anyone to properly review the
code and it should be broken into smaller pieces.  That there were
significant open_tree and move_mount issues being found just before the
merge window opened seems a testament to that.  (AKA the first 3 or so
patches in the patchset and fundamental to the rest).


My apologies that we have not been able communication better and that I
have not been able to clearly point out all of the bugs.   My apologies
I have not yet been able to give more constructive feedback.

Still at the end of the day there remain regressions of existing
functionality in that tree.

Eric


[git pull] mount API series

2018-10-30 Thread Al Viro
mount API series from David Howells.  Last cycle's objections
had been of the "I'd do it differently" variety and with no such
differently done variants having ever materialized over several
cycles...

Conflicts: two trivial ones in
drivers/infiniband/Kconfig (removal of select ANON_INODES)
fs/f2fs/super.c (->remount signature change)
and a non-trivial one in fs/proc/inode.c - there we have
mainline adding
/* procfs dentries and inodes don't require IO to create */
s->s_shrink.seeks = 0;
to proc_fill_super() (in 4b85afbdacd2 "mm: zero-seek shrinkers")
while that series moves the sucker to fs/proc/root.c.  Resolved by
removing the old copy from fs/proc/inode.c and adding the same lines
into the new copy in fs/proc/root.c.
I'd put a variant of resolution into #proposed-merge.

David's cover letter follows; it's obviously over the top for commit
message of the merge.  Where to cut it is up to you...

=

Here are a set of patches to create a filesystem context prior to setting
up a new mount, populating it with the parsed options/binary data, looking
up/creating the superblock, querying it and then effecting the mount.  This
is also used for remount since much of the parsing stuff is common in many
filesystems.

This allows namespaces and other information to be conveyed through the
mount procedure.  This is done with something like:

fd = fsopen("nfs", 0);
fsconfig(fd, FSCONFIG_SET_STRING, "option", "val", 0);
fsconfig(fd, FSCONFIG_CMD_CREATE, NULL, NULL, 0);
struct fsinfo_statfs statfs;
fsinfo(fd, NULL, NULL, &statfs, sizeof(statfs));
mfd = fsmount(fd, MS_NODEV);
move_mount(mfd, "", AT_FDCWD, "/mnt", MOVE_MOUNT_F_EMPTY_PATH);

The new move_mount() syscall can also be used simply to move mounts around:

move_mount(AT_FDCWD, "/mnt", AT_FDCWD, "/mnt2", 0);

And, in conjunction with the open_tree() syscall, can be used to clone
mounts:

fd = open_tree(AT_FDCWD, "/mnt", AT_RECURSIVE | OPEN_TREE_CLONE);
move_mount(mfd, "", AT_FDCWD, "/mnt2", MOVE_MOUNT_F_EMPTY_PATH);

File descriptors can be used as mountpoint references:

fd = open_tree(AT_FDCWD, "/mnt", 0);
move_mount(mfd, "", AT_FDCWD, "/mnt2", MOVE_MOUNT_F_EMPTY_PATH);
move_mount(mfd, "", AT_FDCWD, "/mnt3", MOVE_MOUNT_F_EMPTY_PATH);

which, in this example, will *move* the mount at /mnt to /mnt2 and thence
to /mnt3.

Superblocks can be picked and reconfigured:

fd = fspick(AT_FDCWD, "/mnt", 0)
fsconfig(fd, FSCONFIG_SET_STRING, "option", "other-val", 0);
fsconfig(fd, FSCONFIG_SET_STRING, "option2", "true", 0);
fsconfig(fd, FSCONFIG_SET_STRING, "option3", "1234", 0);
fsconfig(fd, FSCONFIG_CMD_RECONFIGURE, NULL, NULL, 0);

Filesystem parameters and other attributes can also be queried from the fd
returned by fsopen() or fspick():

fd = fspick(AT_FDCWD, "/mnt", 0)
struct fsinfo_params params = {
.request= FSINFO_ATTR_PARAMETER,
.Nth= 1,
.Mth= 3,
};
char param_buf[4096];
fsinfo(fd, NULL, ¶ms, param_buf, sizeof(param_buf));

which will retrieve the 4th value of the 2nd parameter (0 being first) as a
printable string.

Parameters and attributes can also be queried by path or on an ordinary fd:

struct fsinfo_params params = {
.request= FSINFO_ATTR_VOLUME_NAME,
};
char param_buf[4096];
fsinfo(AT_FDCWD, "/etc/passwd", ¶ms, param_buf,
   sizeof(param_buf));

The details of a filesystem's parser can also be queried:

fd = fsopen("ext4", 0);
struct fsinfo_params params = {
.request= FSINFO_ATTR_PARAM_NAME,
.Nth= 1,
};
char param_buf[4096];
fsinfo(fd, NULL, ¶ms, param_buf, sizeof(param_buf));

which, in this instance, will retrieve the name of parameter #1.

I have implemented filesystem context handling for procfs, nfs, mqueue,
cpuset, kernfs, sysfs, cgroup and afs filesystems.  Unconverted filesystems
are handled by a legacy filesystem wrapper for the moment.

Note that I didn't use netlink as that would make the core kernel depend on
CONFIG_NET and CONFIG_NETLINK and would introduce network namespacing
issues.



WHY DO WE WANT THIS?


Firstly, there's a bunch of problems with the mount(2) syscall:

 (1) It's actually six or seven different interfaces rolled into one and
 weird combinations of flags make it do different things beyond the
 original specification of the syscall.

 (2) It produces a particularly large and diverse set of errors, which have
 to be mapped back to a small error code.  Yes, there's dmesg - if you
 have it configured - but you can't necessarily see that if you're